The data and most code are borrowed from COMET Repo. Thanks for their work.
We have used WMT-19 DARR dataset, and considered the follwing language pairs: de-en
, fi-en
, gu-en
, kk-en
, lt-en
, ru-en
, zh-en
. For each language pair, we converted the original dataset into an unified form as shown below (all texts non-tokenized, normal cased). The unified data form is in each dataset folder, and is named as data.pkl
. Note that there is another file final_p.pkl
in each dataset folder, which is our calculated score file.
{
"doc_id": {
"src": "This is the source text.",
"ref": "This is the reference translation.",
"better": {
"sys_name": "System name 1",
"sys": "This is system translation 1.",
"scores": {}
},
"worse": {
"sys_name": "System name 2",
"sys": "This is system translation 2.",
"scores": {}
}
}
}
After calculating scores using automatic metrics, the scores
field for each system is updated, like the one below.
"scores": {
"auto_metric1": "0.3", # We use string score to save space
"auto_metric2": "0.1",
"auto_metric3": "0.7"
}
To use BLEURT, please run the following to set up.
git clone https://github.com/google-research/bleurt.git
cd bleurt
pip install .
Please run the following commands to download the PRISM model, BLEURT model and COMET model.
mkdir models
sh download.sh
Our trained BARTScore (on ParaBank2) can be downloaded here. Please also move it to the models
folder for subsequent experiments if you consider using it.
Run the following to see all the arguments that are supported by the score.py
script.
python score.py --help
To reproduce the results, run the following as an example.
python score.py --file kk-en/data.pkl --device cuda:0 --output kk-en/scores.pkl --bleu --chrf --bleurt --prism --comet --bert_score --bart_score --bart_score_cnn --bart_score_para --prompt bart_para_ref