This is project for End-to-end Speech Recognition using LAS (Listen, Attend and Spell) models implemented in PyTorch.
We appreciate any kind of feedback or contribution.
End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.
For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.
We mainly referred to following papers.
「State-of-the-art Speech Recognition with Sequence-to-Sequence Models」
「SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition」.
if you want to study the feature of audio, we recommend this papers.
「Voice Recognition Using MFCC Algirithm」.
Our project based on Seq2seq with Attention Architecture.
Sequence to sequence architecture is a field that is still actively studied in the field of speech recognition.
Our model architeuture is as follows.
ListenAttendSpell(
(listener): Listener(
(conv): Sequential(
(0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): Hardtanh(min_val=0, max_val=20, inplace=True)
(2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): Hardtanh(min_val=0, max_val=20, inplace=True)
(5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(8): Hardtanh(min_val=0, max_val=20, inplace=True)
(9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): Hardtanh(min_val=0, max_val=20, inplace=True)
(12): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(rnn): GRU(2560, 256, num_layers=5, batch_first=True, dropout=0.3, bidirectional=True)
)
(speller): Speller(
(rnn): GRU(512, 512, num_layers=3, batch_first=True, dropout=0.3)
(embedding): Embedding(2040, 512)
(input_dropout): Dropout(p=0.3, inplace=False)
(fc): Linear(in_features=512, out_features=2040, bias=True)
(attention): MultiHeadAttention(
(W_Q): Linear(in_features=512, out_features=512, bias=True)
(W_V): Linear(in_features=512, out_features=512, bias=True)
(fc): Linear(in_features=1024, out_features=512, bias=True)
)
)
)
We use KsponSpeech dataset which contains 1,000 hours korean voice data from AI Hub.
At present our model has recorded an 85.79% CRR, and we are working for a higher recognition rate.
Also our model has recorded 91.0% CRR in Kadi-zeroth dataset.
We are constantly updating the progress of the project on the Wiki page. Please check this page.
This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).
- Numpy:
pip install numpy
(Refer here for problem installing Numpy). - Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
- Pandas:
pip install pandas
(Refer here for problem installing Pandas) - librosa:
pip install librosa
(Refer here for problem installing librosa) - torchaudio:
pip install torchaudio
(Refer here for problem installing torchaudio) - tqdm:
pip install tqdm
(Refer here for problem installing tqdm)
Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:
pip install -r requirements.txt
python setup.py build
python setup.py install
Refer here before Training.
The above document is written in Korean.
We will also write a document in English as soon as possible, so please wait a little bit.
If you already have another dataset, please modify the data set path to definition.py as appropriate.
you can run by run.sh like following.
- Linux
$ ./run.sh
- Window
$ run.sh
after training, you want to start testing, you should run by --mode='eval'
.
you can set up a arguments at run.sh or at execution time.
We introduce incorporating external language model in performance test.
if you are interested in this content, please check here.
If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts [email protected] please.
We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.
We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.
[1] 「Listen, Attend and Spell」 Paper
[2] 「State-of-the-art Speech Recognition with Sequence-to-Sequence Models」 Paper
[3] 「A Simple Data Augmentation Method for Automatic Speech Recognition」 Paper
[4] 「An analysis of incorporating an external language model into a sequence-to-sequence model」 Paper
[5] 「Voice Recognition Using MFCC Algorithm」 Paper
[6] IBM pytorch-seq2seq
[7] Character RNN Language Model
[8] KsponSpeech
[9] Documentation
@source_code{
title={End-to-end Speech Recognition},
author={Soohwan Kim, Seyoung Bae, Cheolhwang Won},
year={2020}
}