Skip to content

Latest commit

 

History

History
 
 

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Data for model training and evaluation

  • agreement contains evaluation data generated based on long distance agreement patterns
  • evaluation_output contains evaluation data and results of our trained models (this is probably where you want to look at, if you're interested in using our agreement test sets)
  • linzen_testset contains the subset of data from Linzen et al. TACL 2016 (https://github.com/TalLinzen/rnn_agreement) which we used for our evaluation
  • raw_mturk_data contains the reponses of MTurk subjects for the extended Italian agreement data

Training data based on Wikipedia

Each corpus consists of around 100M tokens, we used training (80M) and validation (10M) subsets in our experiments. All corpora were shuffled at sentence level.

Pre-trained language models

For each language, we distribute the trained LSTM model which achieved the lowest perplexity on our test set (validation in the data above). The name of the model file indicates the hyperparameters that were used to train this model. See the supplementary materials for more details, and scripts in the src directory for usage examples.

The models were trained with the vocabularies given above. Each vocabulary lists words according to their indices starting from 0, <unk> and <eos> tokens are already in the vocabulary.


Please cite the paper if you use the above resources.