add test sets from open-sourced dataset

leixy76 · Aug 7, 2021 · 69bf7bc · 69bf7bc
1 parent d73d4ea
commit 69bf7bc
Show file tree

Hide file tree

Showing 4 changed files with 30 additions and 9 deletions.
diff --git a/.gitignore b/.gitignore
@@ -2,6 +2,7 @@
 SAFEBOX/*.cfg
 datasets/download
 datasets/SPEECHIO_*
+datasets/*_TEST
 oss
 tmp.sh
 test_env

diff --git a/HOW_TO_SUBMIT.md b/HOW_TO_SUBMIT.md
@@ -10,18 +10,18 @@ As above figure demonstrates, a benchmark cycle contains following steps:
 
 ---
 
-## Step 1. Prepare model dir for submission
+## Step 1. Prepare submission model dir
 
 Conceptually, for leaderboard to re-produce and benchmark submitters' ASR system, submitters need to provide at least 3 things:
 * system dependencies (operation system, softwares, libraries, packages)
 * runtime resources (e.g. model, config, cloud-api credentials)
 * a program that can decode local audio list
 
-A sample submission `model directory` is listed below:
+So the main purpose of leaderboard pipeline is to formalize above aspects down to a concrete contract. Let's start with `submission model dir`:
 ```
-jiayu@ubuntu: tree ./sample_model_directory
+jiayu@ubuntu: tree ./sample_submission_model_dir
 
-sample_model_directory
+sample_submission_model_dir
 ├── docker
 │   └── Dockerfile
 ├── model.yaml
@@ -33,7 +33,7 @@ sample_model_directory
 ---
 
 ### 1.1 `docker/Dockerfile`
-Dockerfile is used to construct your runtime envrionment for benchmarking, it should specifies all dependencies of your ASR system.
+Dockerfile serves as a specification of your ASR runtime environment, pipeline will build docker image to reproduce your system on local machine. Here, `runtime` can be a cloud-API client, or a  local ASR engine.
 
 <details><summary> cloud-API ASR Dockerfile example </summary><p>
 

diff --git a/README.md b/README.md
@@ -32,9 +32,20 @@ With SpeechIO leaderboard, anyone can benchmark/reproduce/compare performances w
 ---
 
 ## 2. TestSet Zoo
-<details><summary> Test Sets (ZH) </summary><p>
+<details><summary> Test Sets from Open-Sourced Dataset (ZH) </summary><p>
 
-| 编号 <br> ID | 名称 <br> Name |场景 <br> Scenario | 内容领域 <br> Topic Domain | 时长 <br> hours | 难度(1-5) <br> Difficulty  |
+| 编号 <br> TEST_SET_ID | 说明 <br> DESCRIPTION |
+| --- | --- |
+| AISHELL-1_TEST | test set of AISHELL-1 |
+| AISHELL-2_IOS_TEST | test set of AISHELL-2 (iOS channel) |
+| AISHELL-2_ANDROID_TEST | test set of AISHELL-2 (Android channel) |
+| AISHELL-2_MIC_TEST | test set of AISHELL-2 (Microphone channel) |
+
+</p></details>
+
+<details><summary> SpeechIO Test Sets (ZH) </summary><p>
+
+| 编号 <br> TEST_SET_ID | 名称 <br> Name |场景 <br> Scenario | 内容领域 <br> Topic Domain | 时长 <br> hours | 难度(1-5) <br> Difficulty  |
 | --- | --- | --- | --- | --- | --- |
 |SPEECHIO_ASR_ZH00000| 接入调试集 <br> For leaderboard submitter debugging | 视频会议、论坛演讲 <br> video conference & forum speech | 经济、货币、金融 <br> economy, currency, finance | 1.0 | ★★☆ |
 |SPEECHIO_ASR_ZH00001| 新闻联播 | 新闻播报 <br> TV News | 时政 <br> news & politics | 9 | ★ |
@@ -68,7 +79,7 @@ With SpeechIO leaderboard, anyone can benchmark/reproduce/compare performances w
 ---
 
 ## 3. Model Zoo
-<details><summary> Commercial Models (ZH) </summary><p>
+<details><summary> Commercial API (ZH) </summary><p>
 
 | 编号 <br> MODEL_ID | 类型 <br> type | 模型作者/所有人 <br> model author/owner | 简介 <br> description | 链接 <br> url |
 | --- | --- | --- | --- | --- |
@@ -82,7 +93,7 @@ With SpeechIO leaderboard, anyone can benchmark/reproduce/compare performances w
 |yitu_api | Cloud API |依图 <br> YituTech |依图语音开放平台| https://speech.yitutech.com |
 </p></details>
 
-<details><summary> Open-Sourced Models (ZH) </summary><p>
+<details><summary> Open-Sourced Pretrained Models (ZH) </summary><p>
 
 | 编号 <br> MODEL_ID | 类型 <br> type | 模型作者/所有人 <br> model author/owner | 简介 <br> description | 链接 <br> url |
 | --- | --- | --- | --- | --- |

diff --git a/datasets/run_kaldi_to_speechio.sh b/datasets/run_kaldi_to_speechio.sh
@@ -1,4 +1,13 @@
+# mini debug test set
 ./kaldi_to_speechio.py download/MINI MINI
+
+# open-sourced test sets
+./kaldi_to_speechio.py download/AISHELL-1_test AISHELL-1_TEST
+./kaldi_to_speechio.py download/AISHELL-2_iOS_test AISHELL-2_IOS_TEST
+./kaldi_to_speechio.py download/AISHELL-2_Android_test AISHELL-2_ANDROID_TEST
+./kaldi_to_speechio.py download/AISHELL-2_Mic_test AISHELL-2_MIC_TEST
+
+# SpeechIO test sets
 ./kaldi_to_speechio.py download/economy_finance_currency SPEECHIO_ASR_ZH00000
 ./kaldi_to_speechio.py download/cctv_news SPEECHIO_ASR_ZH00001
 ./kaldi_to_speechio.py download/luyu_yirixing SPEECHIO_ASR_ZH00002