- January 2021: Add Jasper model
- January 2021: Release v1.2
- January 2021: Add Joint CTC-Attention Transformer model
- January 2021: Add Speech Transformer model
- January 2021: Apply Hydra: framework for elegantly configuring complex applications
- December 2020: Release v1.1
- December 2020: Update pre-train models
- December 2020: Add Joint CTC-Attention LAS (Currently, Not Supports Multi-GPU)
- November 2020: Add Deep Speech 2 passing
- November 2020: Add KsponSpeech Subword & Grapheme Unit (Not Tested)
- November 2020: Add RAdam & AdamP Optimizer
- Currently, beam search may not work properly.
- Subword and Grapheme unit currently not tested.
KoSpeech, an open-source software, is modular and extensible end-to-end Korean automatic speech recognition (ASR) toolkit based on the deep learning library PyTorch. Several automatic speech recognition open-source toolkits have been released, but all of them deal with non-Korean languages, such as English (e.g. ESPnet, Espresso). Although AI Hub opened 1,000 hours of Korean speech corpus known as KsponSpeech, there is no established preprocessing method and baseline model to compare model performances. Therefore, we propose preprocessing methods for KsponSpeech corpus and a several models (Deep Speech 2, LAS, Transformer, Jasper). By KoSpeech, we hope this could be a guideline for those who research Korean speech recognition.
Description | Loss | Feature | Dataset | Epochs | CER | Model |
---|---|---|---|---|---|---|
Transformer (12-6) | CTC + CrossEntropy | Kaldi-style fbank 80 | KsponSpeech | 7 | 9.84 | download |
Listen Attend Spell | CrossEntropy | - | KsponSpeech | - | - | will be upload |
Listen Attend Spell | CTC + CrossEntropy | - | KsponSpeech | - | - | will be upload |
Deep Speech 2 | CTC | - | KsponSpeech | - | - | will be upload |
Jasper | CTC | Kaldi-style fbank 80 | KsponSpeech | 2 | 56.5 | download |
VAD Model | - | - | - | - | - | download |
※ Training is in progress. As the training progresses, the pre-trained model will be updated.
Dataset | Authentication | Output-Unit | Transcript |
---|---|---|---|
KsponSpeech | Required | Character | download |
KsponSpeech | Required | Subword | download |
KsponSpeech | Required | Grapheme | download |
LibriSpeech | Unrequired | Subword | download |
KsponSpeech needs permission from AI Hub. Please send e-mail including the approved screenshot to [email protected]. It may be slow to reply, so it is recommended to execute preprocessing code.
End-to-end (E2E) automatic speech recognition (ASR) is an emerging paradigm in the field of neural network-based speech recognition that offers multiple benefits. Traditional “hybrid” ASR systems, which are comprised of an acoustic model, language model, and pronunciation model, require separate training of these components, each of which can be complex.
For example, training of an acoustic model is a multi-stage process of model training and time alignment between the speech acoustic feature sequence and output label sequence. In contrast, E2E ASR is a single integrated approach with a much simpler training pipeline with models that operate at low audio frame rates. This reduces the training time, decoding time, and allows joint optimization with downstream processing such as natural language understanding.
- End-to-end (E2E) automatic speech recognition
- Various Options
- (VGG / DeepSpeech2) Extractor
- MaskCNN & pack_padded_sequence
- Attention (Multi-Head / Location-Aware / Additive / Scaled-dot)
- Top K Decoding (Beam Search)
- Various Feature (Spectrogram / Mel-Spectrogram / MFCC / Filter-Bank)
- Delete silence
- SpecAugment / NoiseAugment
- Label Smoothing
- Save & load Checkpoint
- Learning Rate Scheduling
- Implement data loader as multi-thread for speed
- Scheduled Sampling (Teacher forcing scheduling)
- Inference with batching
- Multi-GPU training
We have referred to several papers to develop the best model possible. And tried to make the code as efficient and easy to use as possible. If you have any minor inconvenience, please let us know anytime.
We will response as soon as possible.
So far, serveral models are implemented: Deep Speech 2, Listen Attend and Spell (LAS), Speech Transformer, Jasper. To check details of these model architectures, check figures attached to each section.
- Deep Speech 2
Deep Speech 2 showed faster and more accurate performance on ASR tasks with Connectionist Temporal Classification (CTC) loss. This model has been highlighted for significantly increasing performance compared to the previous end- to-end models.
- Listen, Attend and Spell (LAS)
We follow the architecture previously proposed in the "Listen, Attend and Spell", but some modifications were added to improve performance. We provide four different attention mechanisms, scaled dot-product attention
, additive attention
, location aware attention
, multi-head attention
. Attention mechanisms much affect the performance of models.
- Speech Transformer
Transformer is a powerful architecture in the Natural Language Processing (NLP) field. This architecture also showed good performance at ASR tasks. In addition, as the research of this model continues in the natural language processing field, this model has high potential for further development.
- Joint CTC-Attention
With the proposed architecture to take advantage of both the CTC-based model and the attention-based model. It is a structure that makes it robust by adding CTC to the encoder. Joint CTC-Attention can be trained in combination with LAS and Speech Transformer.
- Jasper
Jasper (Just Another SPEech Recognizer) is a end-to-end convolutional neural acoustic model. Jasper showed powerful performance with only CNN → BatchNorm → ReLU → Dropout block and residential connection.
This project recommends Python 3.7 or higher.
We recommend creating a new virtual environment for this project (using virtual env or conda).
- Numpy:
pip install numpy
(Refer here for problem installing Numpy). - Pytorch: Refer to PyTorch website to install the version w.r.t. your environment.
- Pandas:
pip install pandas
(Refer here for problem installing Pandas) - Matplotlib:
pip install matplotlib
(Refer here for problem installing Matplotlib) - librosa:
conda install -c conda-forge librosa
(Refer here for problem installing librosa) - torchaudio:
pip install torchaudio==0.6.0
(Refer here for problem installing torchaudio) - tqdm:
pip install tqdm
(Refer here for problem installing tqdm) - sentencepiece:
pip install sentencepiece
(Refer here for problem installing sentencepiece) - hydra:
pip install hydra-core --upgrade
(Refer here for problem installing hydra)
Currently we only support installation from source code using setuptools. Checkout the source code and run the
following commands:
pip install -e .
We use Hydra to control all the training configurations. If you are not familiar with Hydra we recommend visiting the Hydra website. Generally, Hydra is an open-source framework that simplifies the development of research applications by providing the ability to create a hierarchical configuration dynamically.
Download from here or refer to the following to preprocess.
- KsponSpeech : Check this page
- LibriSpeech : Check this page
You can choose from several models and training options. There are many other training options, so look carefully and execute the following command:
- Deep Speech 2 Training
python ./bin/main.py model=ds2 train=ds2_train train.dataset_path=$DATASET_PATH
- Listen, Attend and Spell Training
python ./bin/main.py model=las train=las_train train.dataset_path=$DATASET_PATH
- Joint CTC-Attention Listen, Attend and Spell Training
python ./bin/main.py model=joint-ctc-attention-las train=las_train train.dataset_path=$DATASET_PATH
- Speech Transformer Training
python ./bin/main.py model=transformer train=transformer_train train.dataset_path=$DATASET_PATH
- Joint CTC-Attention Speech Transformer Training
python ./bin/main.py model=joint-ctc-attention-transformer train=transformer_train train.dataset_path=$DATASET_PATH
- Jasper Training
python ./bin/main.py model=jasper train=jasper_train train.dataset_path=$DATASET_PATH
python ./bin/eval.py eval.dataset_path=$DATASET_PATH eval.transcripts_path=$TRANSCRIPTS_PATH eval.model_path=$MODEL_PATH
Now you have a model which you can use to predict on new data. We do this by running greedy search
or beam search
.
- Command
$ python3 ./bin/inference.py --model_path $MODEL_PATH --audio_path $AUDIO_PATH --device $DEVICE
- Output
음성인식 결과 문장이 나옵니다
You can get a quick look of pre-trained model's inference, with a audio.
Checkpoints are organized by experiments and timestamps as shown in the following file structure.
outputs
+-- YYYY_mm_dd
| +-- HH_MM_SS
| +-- trainer_states.pt
| +-- model.pt
You can resume and load from checkpoints.
If you have any questions, bug reports, and feature requests, please open an issue on Github.
For live discussions, please go to our gitter or Contacts [email protected] please.
We appreciate any kind of feedback or contribution. Feel free to proceed with small issues like bug fixes, documentation improvement. For major contributions and new features, please discuss with the collaborators in corresponding issues.
We follow PEP-8 for code style. Especially the style of docstrings is important to generate documentation.
Ilya Sutskever et al. Sequence to Sequence Learning with Neural Networks arXiv: 1409.3215
Dzmitry Bahdanau et al. Neural Machine Translation by Jointly Learning to Align and Translate arXiv: 1409.0473
Jan Chorowski et al. Attention Based Models for Speech Recognition arXiv: 1506.07503
Wiliam Chan et al. Listen, Attend and Spell arXiv: 1508.01211
Dario Amodei et al. Deep Speech2: End-to-End Speech Recognition in English and Mandarin arXiv: 1512.02595
Takaaki Hori et al. Advances in Joint CTC-Attention based E2E Automatic Speech Recognition with a Deep CNN Encoder and RNN-LM arXiv: 1706.02737
Ashish Vaswani et al. Attention Is All You Need arXiv: 1706.03762
Chung-Cheng Chiu et al. State-of-the-art Speech Recognition with Sequence-to-Sequence Models arXiv: 1712.01769
Anjuli Kannan et al. An Analysis Of Incorporating An External LM Into A Sequence-to-Sequence Model arXiv: 1712.01996
Daniel S. Park et al. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition arXiv: 1904.08779
Rafael Muller et al. When Does Label Smoothing Help? arXiv: 1906.02629
Daniel S. Park et al. SpecAugment on large scale datasets arXiv: 1912.05533
Jung-Woo Ha et al. ClovaCall: Korean Goal-Oriented Dialog Speech Corpus for Automatic Speech Recognition of Contact Centers arXiv: 2004.09367
Jason Li et al. Jasper: An End-to-End Convolutional Neural Acoustic Model arXiv: 1902.03288
AppleHolic/2020 AI Challenge - SpeechRecognition
This project is licensed under the Apache-2.0 LICENSE - see the LICENSE.md file for details
A technical report on KoSpeech is available. If you use the system for academic work, please cite:
@ARTICLE{2020kospeech,
author = {Soohwan Kim, Seyoung Bae, Cheolhwang Won},
title = "{KoSpeech: Open-Source Toolkit for End-to-End Korean Speech Recognition}",
journal = {ArXiv e-prints},
eprint = {2009.03092}
}