This repo is a fork of Baidu's DeepSpeech. Unlike Baidu's repo:
- It works with both Tensorflow and Theano
- It has helpers for better training by training against auto-generated phonograms
- Training by Theano can be much faster, since CTC calculation may be done by GPU
If you want train by Theano you'll need Theano>=0.10 since It has bindings for Baidu's CTC.
HalfPhonemeModelWrapper
class in model_wrp
module implements training of a model with half of RNN layers trained for Phonorgrams and rest of them for actual output text. To generate Phonograms, Logios tool of CMU Sphinx can be used. Sphinx Phonogram symbols are called Arpabets. To generate Arpabets from Baidu's DeepSpeech description files you can:
$ cat train_corpus.json | sed -e 's/.*"text": "\([^"]*\)".*/\1/' > train_corpus.txt
# make_pronunciation.pl script is provided by logios
# https://github.com/skerit/cmusphinx/tree/master/logios/Tools/MakeDict
$ perl ./make_pronunciation.pl -tools ../ -dictdir . -words prons/train_corpus.txt -dict prons/train_corpus.dict
$ python create_arpabet_json.py train_corpus.json train_corpus.dict train_corpus.arpadesc
Select Keras backend by environment variable KERAS_BACKEND
to theano
or tensorflow
.
Make a train routine, a function like this:
def train_sample_half_phoneme(datagen, save_dir, epochs, sortagrad,
start_weights=False, mb_size=60):
model_wrp = HalfPhonemeModelWrapper()
model = model_wrp.compile(nodes=1000, conv_context=5, recur_layers=5)
logger.info('model :\n%s' % (model.to_yaml(),))
if start_weights:
model.load_weights(start_weights)
train_fn, test_fn = (model_wrp.compile_train_fn(1e-4),
model_wrp.compile_test_fn())
trainer = Trainer(model, train_fn, test_fn, on_text=True, on_phoneme=True)
trainer.run(datagen, save_dir, epochs=epochs, do_sortagrad=sortagrad,
mb_size=mb_size, stateful=False)
return trainer, model_wrp
And call it in from main()
of train.py
. Training can be done by:
$ KERAS_BACKEND="tensorflow" python train.py descs/small.arpadesc descs/test-clean.arpadesc models/test --epochs 20 --use-arpabets --sortagrad 1
visualize.py
will give you a semi-shell for testing your model by giving it input files. There is also models-evaluation notebook, though it may look too dirty..
These models are trained for about three days by LibriSpeech corpus on a GTX 1080 Ti GPU:
- A five layers unidirectional RNN model trained by LibriSpeech using Theano: mega, drive
- A five layers unidirectional RNN model trained by LibriSpeech using Tensorflow: mega, drive
Validation WER CER of these models on test-clean
is about %5 an It's about %15 on test-other
.