update desc of eval metrics

sxjpage · Nov 5, 2019 · f7d7170 · f7d7170
1 parent c007693
commit f7d7170
Show file tree

Hide file tree

Showing 2 changed files with 19 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -141,10 +141,13 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 
 **注意：为了保证结果的可靠性，对于同一模型，我们运行10遍（不同随机种子），汇报模型性能的最大值和平均值（括号内为平均值）。不出意外，你运行的结果应该很大概率落在这个区间内。**
 
+**评测指标中，括号内表示平均值，括号外表示最大值。**
+
 
 ### 简体中文阅读理解：CMRC 2018
 [**CMRC 2018数据集**](https://github.com/ymcui/cmrc2018)是哈工大讯飞联合实验室发布的中文机器阅读理解数据。
 根据给定问题，系统需要从篇章中抽取出片段作为答案，形式与SQuAD相同。
+评测指标为：EM / F1
 
 | 模型 | 开发集 | 测试集 | 挑战集 |
 | :------- | :---------: | :---------: | :---------: |
@@ -159,6 +162,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 ### 繁体中文阅读理解：DRCD
 [**DRCD数据集**](https://github.com/DRCKnowledgeTeam/DRCD)由中国台湾台达研究院发布，其形式与SQuAD相同，是基于繁体中文的抽取式阅读理解数据集。
 **由于ERNIE中去除了繁体中文字符，故不建议在繁体中文数据上使用ERNIE（或转换成简体中文后再处理）。**
+评测指标为：EM / F1
 
 | 模型 | 开发集 | 测试集 |
 | :------- | :---------: | :---------: |
@@ -173,6 +177,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 ### 司法阅读理解：CJRC
 [**CJRC数据集**](http://cail.cipsc.org.cn)是哈工大讯飞联合实验室发布的面向**司法领域**的中文机器阅读理解数据。
 需要注意的是实验中使用的数据并非官方发布的最终数据，结果仅供参考。
+评测指标为：EM / F1
 
 | 模型 | 开发集 | 测试集 |
 | :------- | :---------: | :---------: |
@@ -186,6 +191,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 
 ### 自然语言推断：XNLI
 在自然语言推断任务中，我们采用了[**XNLI**数据](https://github.com/google-research/bert/blob/master/multilingual.md)，需要将文本分成三个类别：`entailment`，`neutral`，`contradictory`。
+评测指标为：Accuracy
 
 | 模型 | 开发集 | 测试集 |
 | :------- | :---------: | :---------: |
@@ -199,6 +205,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 
 ### 情感分析：ChnSentiCorp
 在情感分析任务中，二分类的情感分类数据集ChnSentiCorp。
+评测指标为：Accuracy
 
 | 模型 | 开发集 | 测试集 |
 | :------- | :---------: | :---------: |
@@ -215,6 +222,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 
 #### LCQMC
 [LCQMC](http://icrc.hitsz.edu.cn/info/1037/1146.htm)由哈工大深圳研究生院智能计算研究中心发布。 
+评测指标为：Accuracy
 
 | 模型 | 开发集 | 测试集 |
 | :------- | :---------: | :---------: |
@@ -228,6 +236,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 
 #### BQ Corpus 
 [BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)由哈工大深圳研究生院智能计算研究中心发布，是面向银行领域的数据集。
+评测指标为：Accuracy
 
 | 模型 | 开发集 | 测试集 |
 | :------- | :---------: | :---------: |
@@ -242,6 +251,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
 ### 篇章级文本分类：THUCNews
 篇章级文本分类任务我们选用了由清华大学自然语言处理实验室发布的新闻数据集**THUCNews**。
 我们采用的是其中一个子集，需要将新闻分成10个类别中的一个。
+评测指标为：Accuracy
 
 | 模型 | 开发集 | 测试集 | 
 | :------- | :---------: | :---------: | 

diff --git a/README_EN.md b/README_EN.md
@@ -125,10 +125,12 @@ We experiment on several Chinese datasets, including sentence-level to document-
 
 **Note: To ensure the stability of the results, we run 10 times for each experiment and report maximum and average scores.**
 
+**Average scores are in brackets, and max performances are the numbers that out of brackets.**
 
 ### [CMRC 2018](https://github.com/ymcui/cmrc2018)
 CMRC 2018 dataset is released by Joint Laboratory of HIT and iFLYTEK Research.
 The model should answer the questions based on the given passage, which is identical to SQuAD.
+Evaluation Metrics: EM / F1
 
 | Model | Development | Test | Challenge |
 | :------- | :---------: | :---------: | :---------: |
@@ -142,6 +144,7 @@ The model should answer the questions based on the given passage, which is ident
 
 ### [DRCD](https://github.com/DRCKnowledgeTeam/DRCD)
 DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese.
+Evaluation Metrics: EM / F1
 
 | Model | Development | Test |
 | :------- | :---------: | :---------: |
@@ -155,6 +158,7 @@ DRCD is also a span-extraction machine reading comprehension dataset, released b
 
 ### CJRC
 [**CJRC**](http://cail.cipsc.org.cn) is a Chinese judiciary reading comprehension dataset, released by Joint Laboratory of HIT and iFLYTEK Research. Note that, the data used in these experiments are NOT identical to the official one.
+Evaluation Metrics: EM / F1
 
 | Model | Development | Test |
 | :------- | :---------: | :---------: |
@@ -168,6 +172,7 @@ DRCD is also a span-extraction machine reading comprehension dataset, released b
 
 ### XNLI
 We use XNLI data for testing NLI task.
+Evaluation Metrics: Accuracy
 
 | Model | Development | Test |
 | :------- | :---------: | :---------: |
@@ -180,6 +185,7 @@ We use XNLI data for testing NLI task.
 
 ### ChnSentiCorp
 We use ChnSentiCorp data for testing sentiment analysis.
+Evaluation Metrics: Accuracy
 
 | Model | Development | Test |
 | :------- | :---------: | :---------: |
@@ -194,6 +200,7 @@ We use ChnSentiCorp data for testing sentiment analysis.
 ### Sentence Pair Matching：LCQMC, BQ Corpus
 
 #### LCQMC
+Evaluation Metrics: Accuracy
 
 | Model | Development | Test |
 | :------- | :---------: | :---------: |
@@ -205,6 +212,7 @@ We use ChnSentiCorp data for testing sentiment analysis.
 | **RoBERTa-wwm-ext-large** | **90.4 (90.0)** | 87.0 (86.8) |
 
 #### BQ Corpus 
+Evaluation Metrics: Accuracy
 
 | Model | Development | Test |
 | :------- | :---------: | :---------: |
@@ -218,6 +226,7 @@ We use ChnSentiCorp data for testing sentiment analysis.
 
 ### THUCNews
 Released by Tsinghua University, which contains news in 10 categories.
+Evaluation Metrics: Accuracy
 
 | Model | Development | Test | 
 | :------- | :---------: | :---------: |