Skip to content

Commit

Permalink
update desc of eval metrics
Browse files Browse the repository at this point in the history
  • Loading branch information
ymcui committed Nov 5, 2019
1 parent c007693 commit f7d7170
Show file tree
Hide file tree
Showing 2 changed files with 19 additions and 0 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -141,10 +141,13 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件

**注意为了保证结果的可靠性对于同一模型我们运行10遍不同随机种子),汇报模型性能的最大值和平均值括号内为平均值)。不出意外你运行的结果应该很大概率落在这个区间内**

**评测指标中括号内表示平均值括号外表示最大值**


### 简体中文阅读理解:CMRC 2018
[**CMRC 2018数据集**](https://github.com/ymcui/cmrc2018)是哈工大讯飞联合实验室发布的中文机器阅读理解数据
根据给定问题系统需要从篇章中抽取出片段作为答案形式与SQuAD相同
评测指标为EM / F1

| 模型 | 开发集 | 测试集 | 挑战集 |
| :------- | :---------: | :---------: | :---------: |
Expand All @@ -159,6 +162,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
### 繁体中文阅读理解:DRCD
[**DRCD数据集**](https://github.com/DRCKnowledgeTeam/DRCD)由中国台湾台达研究院发布其形式与SQuAD相同是基于繁体中文的抽取式阅读理解数据集
**由于ERNIE中去除了繁体中文字符故不建议在繁体中文数据上使用ERNIE或转换成简体中文后再处理)。**
评测指标为EM / F1

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand All @@ -173,6 +177,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
### 司法阅读理解:CJRC
[**CJRC数据集**](http://cail.cipsc.org.cn)是哈工大讯飞联合实验室发布的面向**司法领域**的中文机器阅读理解数据
需要注意的是实验中使用的数据并非官方发布的最终数据结果仅供参考
评测指标为EM / F1

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand All @@ -186,6 +191,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件

### 自然语言推断:XNLI
在自然语言推断任务中我们采用了[**XNLI**数据](https://github.com/google-research/bert/blob/master/multilingual.md),需要将文本分成三个类别`entailment``neutral``contradictory`
评测指标为Accuracy

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand All @@ -199,6 +205,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件

### 情感分析:ChnSentiCorp
在情感分析任务中二分类的情感分类数据集ChnSentiCorp
评测指标为Accuracy

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand All @@ -215,6 +222,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件

#### LCQMC
[LCQMC](http://icrc.hitsz.edu.cn/info/1037/1146.htm)由哈工大深圳研究生院智能计算研究中心发布
评测指标为Accuracy

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand All @@ -228,6 +236,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件

#### BQ Corpus
[BQ Corpus](http://icrc.hitsz.edu.cn/Article/show/175.html)由哈工大深圳研究生院智能计算研究中心发布是面向银行领域的数据集
评测指标为Accuracy

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand All @@ -242,6 +251,7 @@ PyTorch版本则包含`pytorch_model.bin`, `bert_config.json`, `vocab.txt`文件
### 篇章级文本分类:THUCNews
篇章级文本分类任务我们选用了由清华大学自然语言处理实验室发布的新闻数据集**THUCNews**
我们采用的是其中一个子集需要将新闻分成10个类别中的一个
评测指标为Accuracy

| 模型 | 开发集 | 测试集 |
| :------- | :---------: | :---------: |
Expand Down
9 changes: 9 additions & 0 deletions README_EN.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,10 +125,12 @@ We experiment on several Chinese datasets, including sentence-level to document-

**Note: To ensure the stability of the results, we run 10 times for each experiment and report maximum and average scores.**

**Average scores are in brackets, and max performances are the numbers that out of brackets.**

### [CMRC 2018](https://github.com/ymcui/cmrc2018)
CMRC 2018 dataset is released by Joint Laboratory of HIT and iFLYTEK Research.
The model should answer the questions based on the given passage, which is identical to SQuAD.
Evaluation Metrics: EM / F1

| Model | Development | Test | Challenge |
| :------- | :---------: | :---------: | :---------: |
Expand All @@ -142,6 +144,7 @@ The model should answer the questions based on the given passage, which is ident

### [DRCD](https://github.com/DRCKnowledgeTeam/DRCD)
DRCD is also a span-extraction machine reading comprehension dataset, released by Delta Research Center. The text is written in Traditional Chinese.
Evaluation Metrics: EM / F1

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand All @@ -155,6 +158,7 @@ DRCD is also a span-extraction machine reading comprehension dataset, released b

### CJRC
[**CJRC**](http://cail.cipsc.org.cn) is a Chinese judiciary reading comprehension dataset, released by Joint Laboratory of HIT and iFLYTEK Research. Note that, the data used in these experiments are NOT identical to the official one.
Evaluation Metrics: EM / F1

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand All @@ -168,6 +172,7 @@ DRCD is also a span-extraction machine reading comprehension dataset, released b

### XNLI
We use XNLI data for testing NLI task.
Evaluation Metrics: Accuracy

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand All @@ -180,6 +185,7 @@ We use XNLI data for testing NLI task.

### ChnSentiCorp
We use ChnSentiCorp data for testing sentiment analysis.
Evaluation Metrics: Accuracy

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand All @@ -194,6 +200,7 @@ We use ChnSentiCorp data for testing sentiment analysis.
### Sentence Pair Matching:LCQMC, BQ Corpus

#### LCQMC
Evaluation Metrics: Accuracy

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand All @@ -205,6 +212,7 @@ We use ChnSentiCorp data for testing sentiment analysis.
| **RoBERTa-wwm-ext-large** | **90.4 (90.0)** | 87.0 (86.8) |

#### BQ Corpus
Evaluation Metrics: Accuracy

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand All @@ -218,6 +226,7 @@ We use ChnSentiCorp data for testing sentiment analysis.

### THUCNews
Released by Tsinghua University, which contains news in 10 categories.
Evaluation Metrics: Accuracy

| Model | Development | Test |
| :------- | :---------: | :---------: |
Expand Down

0 comments on commit f7d7170

Please sign in to comment.