Merge pull request #168 from X-LANCE/ygr_pr2

improve instruction of data preparation for Mala-asr.
X-LANCE · Nov 8, 2024 · 3a729b9 · 3a729b9
2 parents dbfcfca + 378fb87
commit 3a729b9
Showing 1 changed file with 22 additions and 1 deletion.
diff --git a/examples/mala_asr_slidespeech/README.md b/examples/mala_asr_slidespeech/README.md
@@ -20,7 +20,28 @@ Encoder | Projector | LLM | dev | test
 
 
 ## Data preparation
-Refer to official [SLIDESPEECH CORPUS](https://slidespeech.github.io/)
+Refer to official [SLIDESPEECH CORPUS](https://slidespeech.github.io/).
+
+Specifically, take the file `slidespeech_dataset.py` as an example, the dataset requires four files: `my_wav.scp`, `utt2num_samples`, `text`, `hot_related/ocr_1gram_top50_mmr070_hotwords_list`.
+
+`my_wav.scp` is a file of audio path lists. We transform wav file to ark file, so this file looks like  
+```
+ID1 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:22  
+ID2 xxx/slidespeech/dev_oracle_v1/data/format.1/data_wav.ark:90445
+...
+```
+
+To generate this file, you can get audio wavs from https://www.openslr.org/144/ and get the time segments from https://slidespeech.github.io/. The second website provides segments, transcription text, OCR results at https://speech-lab-share-data.oss-cn-shanghai.aliyuncs.com/SlideSpeech/related_files.tar.gz (~1.37GB). You need to segment the wav by the timestamps provided in `segments` file.
+
+
+ This _related_files.tar.gz_ also provides `text` and a file named `keywords`. The file `keywords` refers to `hot_related/ocr_1gram_top50_mmr070_hotwords_list`, which contains hotwords list.
+
+`utt2num_samples` contains the length of the wavs, which looks like   
+```
+ID1 103680  
+ID2 181600  
+...
+```
 
 ## Decode with checkpoints
 ```