The repository contains the codes and data for our EMNLP 2023 Main Paper: CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction.
CLEME is a reference-based metric that evaluate Grammatical Error Correction (GEC) systems at the chunk-level, aiming to provide unbiased F0.5 scores for GEC multi-reference evaluation.
- CLEME is unbiased, allowing a more objective evaluation pipeline.
- CLEME is able to visualize evaluation pipeline as tables.
- CLEME supports English and Chinese for now. We will extend to other languages in the future.
-
Python version >= 3.7
-
ERRANT or variants designed for other specific languages.
Language Link English ERRANT Arabic arabic_error_type_annotation Chinese ChERRANT Czech errant_czech German ERRANT-German Greek ELERRANT Hindi hindi_grammar_correction Korean Standard_Korean_GEC Russian ERRANT-Russian Turkish ERRANT-TR We recommend the newest version of ERRANT for speed gain, although we use ERRANT v2.3.3 in the paper.
-
Clone this repository:
git clone https://github.com/THUKElab/CLEME.git cd ./CLEME
python scripts/evaluate.py --ref tests/examples/conll14.errant --hyp tests/examples/conll14-AMU.errant
{'num_sample': 1312, 'F': 0.2514, 'Acc': 0.7634, 'P': 0.2645, 'R': 0.2097, 'tp': 313.51, 'fp': 871.8, 'fn': 1181.71, 'tn': 6312.0}
python scripts/evaluate.py --ref tests/examples/demo.errant --hyp tests/examples/demo-AMU.errant --vis
# Read M2 file
dataset_ref = self.reader.read(f"{os.path.dirname(__file__)}/examples/demo.errant")
dataset_hyp = self.reader.read(f"{os.path.dirname(__file__)}/examples/demo-AMU.errant")
print(len(dataset_ref), len(dataset_hyp))
print("Example of reference", dataset_ref[-1])
print("Example of hypothesis", dataset_hyp[-1])
# Evaluate using CLEME_dependent
config_dependent = {
"tp": {"alpha": 2.0, "min_value": 0.75, "max_value": 1.25, "reverse": False},
"fp": {"alpha": 2.0, "min_value": 0.75, "max_value": 1.25, "reverse": True},
"fn": {"alpha": 2.0, "min_value": 0.75, "max_value": 1.25, "reverse": False},
}
metric_dependent = DependentChunkMetric(weigher_config=config_dependent)
score, results = metric_dependent.evaluate(dataset_hyp, dataset_ref)
print(f"==================== Evaluate Demo ====================")
print(score)
# Visualize
metric_dependent.visualize(dataset_ref, dataset_hyp)
Refer to ./tests/test_cleme.py
for more details.
CLEME is language-agnostic, so you can easily employ CLEME for any languages if you have got reference and hypothesis M2 files.
We search optimal hyper-parameters on CoNLL-2014
reference set, which are listed in .cleme/constant.py
.
@article{ye-et-al-2023-cleme,
title = {CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction},
author = {Ye, Jingheng and Li, Yinghui and Zhou, Qingyu and Li, Yangning and Ma, Shirong and Zheng, Hai-Tao and Shen, Ying},
journal = {arXiv preprint arXiv:2305.10819},
year = {2023}
}
CLEME v1.0 released.
If you have any questions or feedbacks, please send e-mails to ours: [email protected], [email protected]