Language modeling is the task of predicting the next word or character in a document.
* Indicates models using dynamic evaluation.
A common evaluation dataset for language modeling ist the Penn Treebank, as pre-processed by Mikolov et al. (2010). The dataset consists of 929k training words, 73k validation words, and 82k test words. As part of the pre-processing, words were lower-cased, numbers were replaced with N, newlines were replaced with , and all other punctuation was removed. The vocabulary is the most frequent 10k words with the rest of the tokens replaced by an token. Models are evaluated based on perplexity, which is the average per-word log-probability (lower is better).
Model | Validation perplexity | Test perplexity | Paper / Source |
---|---|---|---|
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 48.33 | 47.69 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 51.6 | 51.1 | Dynamic Evaluation of Neural Sequence Models |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.9 | 52.8 | Regularizing and Optimizing LSTM Language Models |
AWD-LSTM-MoS (Yang et al., 2018) | 56.54 | 54.44 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM (Merity et al., 2017) | 60.0 | 57.3 | Regularizing and Optimizing LSTM Language Models |
WikiText-2 has been proposed as a more realistic benchmark for language modeling than the pre-processed Penn Treebank. WikiText-2 consists of around 2 million words extracted from Wikipedia articles.
Model | Validation perplexity | Test perplexity | Paper / Source |
---|---|---|---|
AWD-LSTM-MoS + dynamic eval (Yang et al., 2018)* | 42.41 | 40.68 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM + dynamic eval (Krause et al., 2017)* | 46.4 | 44.3 | Dynamic Evaluation of Neural Sequence Models |
AWD-LSTM + continuous cache pointer (Merity et al., 2017)* | 53.8 | 52.0 | Regularizing and Optimizing LSTM Language Models |
AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | Breaking the Softmax Bottleneck: A High-Rank RNN Language Model |
AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | Regularizing and Optimizing LSTM Language Models |
The Hutter Prize Wikipedia dataset, also known as enwik8, is a byte-level dataset consisting of the first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset. Within these 100 million bytes are 205 unique tokens.
Model | Bit per Character (BPC) | Number of params | Paper / Source |
---|---|---|---|
mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | Dynamic Evaluation of Neural Sequence Models |
3 layer AWD-LSTM (Merity et al., 2018) | 1.232 | 47M | An Analysis of Neural Language Modeling at Multiple Scales |
Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | Fast-Slow Recurrent Neural Networks |
Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | Multiplicative LSTM for sequence modelling |
FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | Fast-Slow Recurrent Neural Networks |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks |
The text8 dataset is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.
Model | Bit per Character (BPC) | Number of params | Paper / Source |
---|---|---|---|
mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | Dynamic Evaluation of Neural Sequence Models |
Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | Multiplicative LSTM for sequence modelling |
Large RHN (Zilly et al., 2016) | 1.27 | 46M | Recurrent Highway Networks |
LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 | 35M | Hierarchical Multiscale Recurrent Neural Networks |
BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | Recurrent Batch Normalization |
Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | Multiplicative LSTM for sequence modelling |
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.
Model | Bit per Character (BPC) | Number of params | Paper / Source |
---|---|---|---|
3 layer AWD-LSTM (Merity et al., 2018) | 1.175 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales |
6 layer QRNN (Merity et al., 2018) | 1.187 | 13.8M | An Analysis of Neural Language Modeling at Multiple Scales |
FS-LSTM-4 (Mujika et al., 2017) | 1.190 | 27M | Fast-Slow Recurrent Neural Networks |
FS-LSTM-2 (Mujika et al., 2017) | 1.193 | 27M | Fast-Slow Recurrent Neural Networks |
NASCell (Zoph & Le, 2016) | 1.214 | 16.3M | Neural Architecture Search with Reinforcement Learning |
2-Layer Norm HyperLSTM (Ha et al., 2016) | 1.219 | 14.4M | HyperNetworks |