Train your own CTC model!
You will need the following packages installed before you can train a model using this code. You may have to change PYTHONPATH
to include the directories
of your new packages.
-
theano
The underlying deep learning Python library. We suggest using the bleeding edge version through
git clone https://github.com/Theano/Theano
Follow the instructions on: http://deeplearning.net/software/theano/install.html#bleeding-edge-install-instructions
Or, simply:cd Theano; python setup.py install
-
keras
This is a wrapper over Theano that provides nice functions for building networks. Once again, we suggest using the bleeding edge version. Make sure you install it with support forhdf5
- we make use of that to save models.
git clone https://github.com/fchollet/keras
Follow the installation instructions on https://github.com/fchollet/keras
Or, simply:cd keras; python setup.py install
-
warp-ctc
This contains the main implementation of the CTC cost function.
git clone https://github.com/baidu-research/warp-ctc
To install it, follow the instructions on https://github.com/baidu-research/warp-ctc -
theano-warp-ctc
This is a theano wrapper over warp-ctc.
git clone https://github.com/sherjilozair/ctc
Follow the instructions on https://github.com/sherjilozair/ctc for installation. -
Others
You may require some additional packages. Install Python requirements throughpip
as:
pip install soundfile
On Ubuntu,avconv
(used here for audio format conversions) requireslibav-tools
.
sudo apt-get install libav-tools
We will make use of the LibriSpeech ASR corpus to train our models. Use the download.sh
script to download this corpus (~65GB). Use flac_to_wav.sh
to convert any flac
files to wav
.
We make use of a JSON file that aggregates all data for training, validation and testing. Once you have a corpus, create a description file that is a json-line file in the following format:
{"duration": 15.685, "text": "spoken text label", "key": "/home/username/LibriSpeech/train-clean-360/5672/88367/5672-88367-0031.wav"} {"duration": 14.32, "text": "ground truth text", "key": "/home/username/LibriSpeech/train-other-500/8678/280914/8678-280914-0009.wav"}
Each line is a JSON. We will make use of the durations to construct a curriculum in the first epoch (shorter utterances are easier).
You can query the duration of a file using: soxi -D filename
. By default, we split this data as 80%: training, 10%: validation and 10%: testing. You can play around with these in data_generator.py
Finally, let's train a model!
python train.py corpus.json ./save_my_model_here
This will checkpoint a model every few iterations into the directory you specify.