Name	Name	Last commit message	Last commit date
Latest commit History 924 Commits
data	data
docs	docs
e2e	e2e
experiment	experiment
.gitattributes	.gitattributes
.gitignore	.gitignore
LICENSE	LICENSE
README.md	README.md
infer.py	infer.py
infer.sh	infer.sh
requirements.txt	requirements.txt
setup.py	setup.py
train.py	train.py
train.sh	train.sh

End-to-end Speech Recognition

Character-unit based End-to-end Speech Recognition in Korean

Documentation

Intro

This is project for End-to-end Speech Recognition using LAS (Listen, Attend and Spell) models implemented in PyTorch.
This repository has modularized and extensible components for las models, training and inference, checkpoints etc.
We appreciate any kind of feedback or contribution.

We use KsponSpeech which contains 1,000 hours korean voice data.
At present our model has recorded an 86.09% CRR, and we are working for a higher recognition rate.
Also our model has recorded 91.0% CRR in Kadi-zeroth dataset.

( CRR : Character Recognition Rate )

Features

E2E automatic speech recognition
Convolutional encoder
MaskConv & pack_padded_sequence
Multi-Head Attention
Top K Decoding (Beam Search)
Provides a variety of feature extraction methods
Delete silence
SpecAugment
Noise Injection
Label Smoothing
Save & load Checkpoint
Various options can be set using parser
Implement data loader as multi-thread for speed
Inference with batching
Multi-GPU training
Show training states as log

Roadmap

End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.

For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.

We mainly referred to following papers.

「Listen, Attend and Spell」

「State-of-the-art Speech Recognition with Sequence-to-Sequence Models」

「SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition」.

If you want to study the feature of audio, we recommend this papers.

「Voice Recognition Using MFCC Algirithm」.

Our project based on Seq2seq with Attention Architecture.

Sequence to sequence architecture is a field that is still actively studied in the field of speech recognition.
Our model architeuture is as follows.

ListenAttendSpell(
  (listener): Listener(
    (conv): Sequential(
      (0): Conv2d(1, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): Hardtanh(min_val=0, max_val=20, inplace=True)
      (2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (4): Hardtanh(min_val=0, max_val=20, inplace=True)
      (5): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
      (6): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (7): Conv2d(64, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (8): Hardtanh(min_val=0, max_val=20, inplace=True)
      (9): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (10): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (11): Hardtanh(min_val=0, max_val=20, inplace=True)
      (12): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (rnn): GRU(2560, 256, num_layers=5, batch_first=True, dropout=0.3, bidirectional=True)
  )
  (speller): Speller(
    (rnn): GRU(512, 512, num_layers=3, batch_first=True, dropout=0.3)
    (embedding): Embedding(2040, 512)
    (input_dropout): Dropout(p=0.3, inplace=False)
    (fc): Linear(in_features=512, out_features=2040, bias=True)
    (attention): MultiHeadAttention(
      (W_Q): Linear(in_features=512, out_features=512, bias=True)
      (W_V): Linear(in_features=512, out_features=512, bias=True)
      (fc): Linear(in_features=1024, out_features=512, bias=True)
    )
  )
)

We are constantly updating the progress of the project on the Wiki page. Please check this page.

Installation

This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).

Prerequisites

Numpy: pip install numpy (Refer here for problem installing Numpy).
Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
Pandas: pip install pandas (Refer here for problem installing Pandas)
librosa: pip install librosa (Refer here for problem installing librosa)
torchaudio: pip install torchaudio (Refer here for problem installing torchaudio)
tqdm: pip install tqdm (Refer here for problem installing tqdm)

Install from source

Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:

pip install -r requirements.txt

python setup.py build
python setup.py install

Get Started

Step 1: Preparation dataset

Refer here before training. this document contains information regarding the preprocessing of KsponSpeech.
The above document is written in Korean.
We will also write a document in English as soon as possible, so please wait a little bit.

If you already have another dataset, please modify the data set path to definition.py as appropriate.

Step 2: Run `train.py`

Default setting

$ ./train.sh

Custom setting

python ./train.py -use_multi_gpu -init_uniform -mode 'train' -batch_size 32 -num_workers 4 \
                  -num_epochs 20 -use_augment -augment_num 1 -max_len 151 \
                  -use_cuda -lr 3e-04 -min_lr 1e-05 -lr_patience 1/3 -valid_ratio 0.01 \
                  -label_smoothing 0.1 -save_result_every 1000 -print_every 10 -checkpoint_every 5000 \
                  -use_bidirectional -hidden_dim 256 -dropout 0.3 -num_heads 8 -rnn_type 'gru' \
                  -listener_layer_size 5 -speller_layer_size 3 -teacher_forcing_ratio 0.99 \ 
                  -input_reverse -normalize -del_silence -sample_rate 16000 -window_size 20 -stride 10 -n_mels 80 \
                  -feature_extract_by 'librosa' -time_mask_para 50 -freq_mask_para 12 \
                  -time_mask_num 2 -freq_mask_num 2

You can train the model by above command.
If you want to train by default setting, you can train by Defaulting setting command.
Or if you want to train by custom setting, you can designate hyperparameters by Custom setting command.

Step 3: Run `infer.py`

Default setting

$ ./infer.sh

Custom setting

python ./infer.py -mode 'infer' -use_multi_gpu -use_cuda -batch_size 32 -num_workers 4 \
                  -use_beam_search -k 5 -print_every 100 \
                  -sample_rate 16000 --window_size 20 --stride 10 --n_mels 80 -feature_extract_by 'librosa' \
                  -normalize -del_silence -input_reverse

Now you have a model which you can use to predict on new data. We do this by running beam search (or greedy search).
Like training, you can choose between Default setting or Custom setting.

Checkpoints

Checkpoints are organized by experiments and timestamps as shown in the following file structure.

save_dir
+-- checkpoints
|  +-- YYYY_mm_dd_HH_MM_SS
   |  +-- trainer_states.pt
   |  +-- model.pt

You can resume and load from checkpoints.

Incorporating External Language Model in Performance Test

We introduce incorporating external language model in performance test.
If you are interested in this content, please check here.

Troubleshoots and Contributing

If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts [email protected] please.

We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.

Code Style

We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.

Reference

[1] 「Listen, Attend and Spell」 Paper
[2] 「State-of-the-art Speech Recognition with Sequence-to-Sequence Models」 Paper
[3] 「A Simple Data Augmentation Method for Automatic Speech Recognition」 Paper
[4] 「An analysis of incorporating an external language model into a sequence-to-sequence model」 Paper
[5] 「Voice Recognition Using MFCC Algorithm」 Paper
[6] IBM pytorch-seq2seq
[7] Character RNN Language Model
[8] KsponSpeech
[9] Documentation

Citing

@github{
  title = {End-to-end Speech Recognition},
  author = {Soohwan Kim, Seyoung Bae, Cheolhwang Won},
  publisher = {GitHub},
  docs = {https://sooftware.github.io/End-to-end-Speech-Recognition/},
  url = {https://github.com/sooftware/End-to-end-Speech-Recognition},
  year = {2020}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-end Speech Recognition

Character-unit based End-to-end Speech Recognition in Korean

Documentation

Intro

( CRR : Character Recognition Rate )

Features

Roadmap

Installation

Prerequisites

Install from source

Get Started

Step 1: Preparation dataset

Step 2: Run `train.py`

Step 3: Run `infer.py`

Checkpoints

Incorporating External Language Model in Performance Test

Troubleshoots and Contributing

Code Style

Reference

Citing

About

Releases

Packages

Languages

License

hephaex/KoSpeech

Folders and files

Latest commit

History

Repository files navigation

End-to-end Speech Recognition

Character-unit based End-to-end Speech Recognition in Korean

Documentation

Intro

( CRR : Character Recognition Rate )

Features

Roadmap

Installation

Prerequisites

Install from source

Get Started

Step 1: Preparation dataset

Step 2: Run train.py

Step 3: Run infer.py

Checkpoints

Incorporating External Language Model in Performance Test

Troubleshoots and Contributing

Code Style

Reference

Citing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Step 2: Run `train.py`

Step 3: Run `infer.py`

Packages