This is an example of text classification using typical neural networks. This code can switch choices below:
- LSTM
- CNN + MLP
- BoW + MLP
- Character-based variant models of those
And also, dataset is switchable among below:
- DBPedia Ontology dataset (dbpedia): Predict its ontology class from the abstract of an Wikipedia article.
- IMDB Movie Review Dataset (imdb.binary, imdb.fine): Predict its sentiment from a review about a movie.
.binary
's classes are positive/negative..fine
's classes are ratings [0-1]/[2-3]/[7-8]/[9-10]. - TREC Question Classification (TREC): Predict the type of its answer from a factoid question.
- Stanford Sentiment Treebank (stsa.binary, stsa.fine): Predict its sentiment from a review about a movie.
.binary
's classes are positive/negative..fine
's classes are [negative]/[somewhat negative]/[neutral]/[somewhat positive]/[positive]. - Customer Review Datasets (custrev): Predict its sentiment (positive/negative) from a review about a product.
- MPQA Opinion Corpus (mpqa): Predict its opinion polarity from a phrase.
- Scale Movie Review Dataset (rt-polarity): Predict its sentiment (positive/negative) from a review about a movie.
- Subjectivity datasets (subj): Predict subjectivity (subjective/objective) from a sentnece about a movie.
Some of datasets are downloaded from @harvardnlp's repository. Thank you.
To train a model:
python train_text_classifier.py -g 0 --dataset stsa.binary --model cnn
The output directory result
contains:
best_model.npz
: a model snapshot, which won the best accuracy for validation data during trainingvocab.json
: model's vocabulary dictionary as a json fileargs.json
: model's setup as a json file, which also contains paths of the model and vocabulary
To apply the saved model to your sentences, feed the sentences through stdin:
cat sentences_to_be_classifed.txt | python run_text_classifier.py -g 0 --model-setup result/args.json
The classification result is given by stdout.