基于Pytorch实现大多数常见的文本分类模型
- FastText: Bag of Tricks for Efficient Text Classification
- DPCNN: Deep Pyramid Convolutional Neural Networks for Text Categorization
- TextCNN: Convolutional Neural Networks for Sentence Classification
- RCNN: Recurrent Convolutional Neural Networks for Text Classification
- HAN: Hierarchical Attention Networks for Document Classification
- Bi-GRU: Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation
- Bi-LSTM: Long Short-term Memory
- Transformer-Encoder: Attention is all you need
- Transformer-XL-Encoder: Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
IMDB是一个电影评论的二分类(pos, neg)数据集,专门用于情绪分析。
IMDB的训练集数目为25000,测试集数目为25000,没有额外提供验证集。
seed: 512
epoch: 20
optimizer: Adam
warm_up_proportion: 0.1
lr: 0.0005
batch_size: 128
pretrain_embedding: Glove-6B
- 对于非预训练模型,这里使用了不同维度的Glove-6B预训练词向量,维度包括50, 100, 200
Model | Test(dim=50) | Test(dim=100) | Test(dim=200) |
---|---|---|---|
FastText | 0.8809 | 0.8830 | 0.8836 |
DPCNN | 0.8648 | 0.8722 | 0.8798 |
TextCNN | 0.8680 | 0.8689 | 0.8808 |
RCNN | 0.8716 | 0.8864 | 0.9006 |
HAN | 0.8852 | 0.8904 | 0.8973 |
Bi-GRU | 0.8535 | 0.8715 | 0.8873 |
Bi-LSTM | 0.8611 | 0.8708 | 0.8871 |
Transformer-Encoder | 0.8650 | 0.8670 | 0.8774 |
Transformer-XL-Encoder | 0.8751 | 0.8844 | 0.8791 |