colorlessgreenRNNs/data at master · sheng-fu/colorlessgreenRNNs

History

Name		Name	Last commit message	Last commit date
parent directory ..
agreement		agreement
blimp		blimp
evaluation_output		evaluation_output
linzen_testset		linzen_testset
raw_mturk_data		raw_mturk_data
README.md		README.md

README.md

Data for model training and evaluation

agreement contains evaluation data generated based on long distance agreement patterns
evaluation_output contains evaluation data and results of our trained models (this is probably where you want to look at, if you're interested in using our agreement test sets)
linzen_testset contains the subset of data from Linzen et al. TACL 2016 (https://github.com/TalLinzen/rnn_agreement) which we used for our evaluation
raw_mturk_data contains the reponses of MTurk subjects for the extended Italian agreement data

Training data based on Wikipedia

Each corpus consists of around 100M tokens, we used training (80M) and validation (10M) subsets in our experiments. All corpora were shuffled at sentence level.

English train / valid / test / vocab
Hebrew train / valid / test / vocab
Italian train / valid / test / vocab
Russian train / valid / test / vocab

Pre-trained language models

For each language, we distribute the trained LSTM model which achieved the lowest perplexity on our test set (validation in the data above). The name of the model file indicates the hyperparameters that were used to train this model. See the supplementary materials for more details, and scripts in the src directory for usage examples.

The models were trained with the vocabularies given above. Each vocabulary lists words according to their indices starting from 0, <unk> and <eos> tokens are already in the vocabulary.

Please cite the paper if you use the above resources.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Data for model training and evaluation

Training data based on Wikipedia

Pre-trained language models

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Data for model training and evaluation

Training data based on Wikipedia

Pre-trained language models