Data preparation code for building Kaldi ASR system.
These codes help data preparation for building an ASR system in Kaldi by creating the following text files within 'required' folder:
- Files created:
- text
- utt2spk
- segments
- wav.scp
- Before running prep4kaldi.sh, please check out the input section and modify to fit your needs.
(1) datadir
- Directory path of where subfolders named by speaker ids are located.
- For example, given a corpus in the following directory:
/Users/cho/mycorpus/,
├─ s01/
├─ s02/
├─ s03/
├─ ...
├─ s19/
└─ s20/
NB. each subfolder includes its corresponding speaker's
-> recordings (.wav)
-> transcriptions (.txt) or textgrids (.TextGrid)
-
Then, specify as:
$ datadir='/Users/cho/mycorpus/'
(line 39)
(2) datatype
-
Type of data from which information should be extracted.
-
Please choose between 'textgrid' or 'wavtxt'.
-
For instance:
$ datatype='textgrid'
(line 40)
(3) tiername
-
Name of TextGrid tier to extract labels from.
-
For example, if the transcriptions need to be extracted from 'utterance' tier, specify as:
$ tiername='utterance'
(line 41)
After specifying 'datadir', 'datatype', and 'tiername' in prep4kaldi.sh, type the following command:
$ sh prep4kaldi.sh