Automatic Speech Recognition Using TensorFlow

Code is writen in Python 2.7, TensorFlow 1.0. The high-level network structure is demonstrated in below figure.

Dataset

The dataset used in this repo is TIMIT. The training set contains 3699 utterances, while the test set contains 1347 utterances ('sa' files are removed from the original dataset to prevent bias to this system).

WAV Format Conversion

Original wav files are actually NIST format. So conversion must be made beforehand using script nist2wav.sh. But please ensure you have libsndfile installed first in your machine.

Feature Extraction

MFCC is used here to extract features out of raw sound wav. I'm using code here to calculate the features.

Model

4-layer Bi-directional GRU is used as the acoustic model, and CTC is used to calculate the loss and backpropagate the gradient to the previous network layers. Dropout and Gradient Clipping are used to prevent overfitting and gradient explosion.

PER

A PER calculation wrapper of leven edit distance is implemented (code), so based on this distance, we can calculate PER arbitrarily without using TensorFlow's sub-graph. To be specific in this case, as suggested in Speaker-independent phone recognition using hidden Markov models, we merge original 61 phonemes into 39 to gain more robust predictions.

Below figure is generated using TensorBoard during training phase.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
src		src
.gitignore		.gitignore
ASR Network Structure.PNG		ASR Network Structure.PNG
README.md		README.md
_config.yml		_config.yml
tensorboard_train_error.png		tensorboard_train_error.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Speech Recognition Using TensorFlow

Dataset

WAV Format Conversion

Feature Extraction

Model

PER

About

Releases

Packages

Languages

JoeyHeisenberg/automatic-speech-recognition

Folders and files

Latest commit

History

Repository files navigation

Automatic Speech Recognition Using TensorFlow

Dataset

WAV Format Conversion

Feature Extraction

Model

PER

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages