Code and data for the paper "Evaluating Morphological Generalisation in Machine Translation by Distribution-Based Compositionality Assessment"

scripts numbered 01-13 are meant to be run in succession
run.sh provides examples of running the scripts
exp/subset-d-1m/data contains the 1M sentence pair dataset
exp/subset-d-1m/splits/*/*/*/ids_{train,test_full}.txt.gz contain the data splits with different compound divergences and different random initialisations

Data is from the Tatoeba Challenge data release (eng-fin set)
Data filtering is done using OpusFilter
Morphological parsing is done using TNPP, CoNLL-U format parsed using this parser
Data split algorithm uses PyTorch
Tokenisers are trained using sentencepiece
Translation systems are trained with OpenNMT-py
Evaluating translations is done with sacreBLEU

Provide feedback