Skip to content

Commit

Permalink
Merge pull request sebastianruder#10 from ollmer/master
Browse files Browse the repository at this point in the history
Character Level Language Modeling Results
  • Loading branch information
sebastianruder authored Jun 25, 2018
2 parents 76972ce + 7c90d42 commit aec07e5
Showing 1 changed file with 46 additions and 1 deletion.
47 changes: 46 additions & 1 deletion language_modeling.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Language modeling

Language modeling is the task of predicting the next word in a document. * indicates models using dynamic evaluation.
Language modeling is the task of predicting the next word or character in a document.

\* Indicates models using dynamic evaluation.

## Word Level Models

### Penn Treebank

Expand Down Expand Up @@ -36,4 +40,45 @@ consists of around 2 million words extracted from Wikipedia articles.
| AWD-LSTM-MoS (Yang et al., 2018) | 63.88 | 61.45 | [Breaking the Softmax Bottleneck: A High-Rank RNN Language Model](https://arxiv.org/abs/1711.03953) |
| AWD-LSTM (Merity et al., 2017) | 68.6 | 65.8 | [Regularizing and Optimizing LSTM Language Models](https://arxiv.org/abs/1708.02182) |

## Character Level Models

### Hutter Prize
[The Hutter Prize](http://prize.hutter1.net) Wikipedia dataset, also known as enwik8, is a byte-level dataset consisting of the
first 100 million bytes of a Wikipedia XML dump. For simplicity we shall refer to it as a character-level dataset.
Within these 100 million bytes are 205 unique tokens.

| Model | Bit per Character (BPC) | Number of params | Paper / Source |
| ---------------- | :-----: | :-----: | --- |
| mLSTM + dynamic eval (Krause et al., 2017)* | 1.08 | 46M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432)
| 3 layer AWD-LSTM (Merity et al., 2018) | 1.232 | 47M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) |
| Large FS-LSTM-4 (Mujika et al., 2017) | 1.245 | 47M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) |
| Large mLSTM +emb +WN +VD (Krause et al., 2017) | 1.24 | 46M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959)
| FS-LSTM-4 (Mujika et al., 2017) | 1.277 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) |
| Large RHN (Zilly et al., 2016) | 1.27 | 46M | [Recurrent Highway Networks](https://arxiv.org/abs/1607.03474)


### Text8
[The text8 dataset](http://mattmahoney.net/dc/textdata.html) is also derived from Wikipedia text, but has all XML removed, and is lower cased to only have 26 characters of English text plus spaces.

| Model | Bit per Character (BPC) | Number of params | Paper / Source |
| ---------------- | :-----: | :-----: | --- |
| mLSTM + dynamic eval (Krause et al., 2017)* | 1.19 | 45M | [Dynamic Evaluation of Neural Sequence Models](https://arxiv.org/abs/1709.07432)
| Large mLSTM +emb +WN +VD (Krause et al., 2016) | 1.27 | 45M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959)
| Large RHN (Zilly et al., 2016) | 1.27 | 46M | [Recurrent Highway Networks](https://arxiv.org/abs/1607.03474)
| LayerNorm HM-LSTM (Chung et al., 2017) | 1.29 | 35M | [Hierarchical Multiscale Recurrent Neural Networks](https://arxiv.org/abs/1609.01704)
| BN LSTM (Cooijmans et al., 2016) | 1.36 | 16M | [Recurrent Batch Normalization](https://arxiv.org/abs/1603.09025)
| Unregularised mLSTM (Krause et al., 2016) | 1.40 | 45M | [Multiplicative LSTM for sequence modelling](https://arxiv.org/abs/1609.07959)

### Penn Treebank
The vocabulary of the words in the character-level dataset is limited to 10 000 - the same vocabulary as used in the word level dataset. This vastly simplifies the task of character-level language modeling as character transitions will be limited to those found within the limited word level vocabulary.

| Model | Bit per Character (BPC) | Number of params | Paper / Source |
| ---------------- | :-----: | :-----: | --- |
| 3 layer AWD-LSTM (Merity et al., 2018) | 1.175 | 13.8M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) |
| 6 layer QRNN (Merity et al., 2018) | 1.187 | 13.8M | [An Analysis of Neural Language Modeling at Multiple Scales](https://arxiv.org/abs/1803.08240) |
| FS-LSTM-4 (Mujika et al., 2017) | 1.190 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) |
| FS-LSTM-2 (Mujika et al., 2017) | 1.193 | 27M | [Fast-Slow Recurrent Neural Networks](https://arxiv.org/abs/1705.08639) |
| NASCell (Zoph & Le, 2016) | 1.214 | 16.3M | [Neural Architecture Search with Reinforcement Learning](https://arxiv.org/abs/1611.01578)
| 2-Layer Norm HyperLSTM (Ha et al., 2016) | 1.219 | 14.4M | [HyperNetworks](https://arxiv.org/abs/1609.09106)

[Go back to the README](README.md)

0 comments on commit aec07e5

Please sign in to comment.