Add multi-speaker and multi-language support

CanKorkut · Feb 26, 2021 · 7621af2 · 7621af2
1 parent e58247c
commit 7621af2
Show file tree

Hide file tree

Showing 164 changed files with 315,354 additions and 2,166 deletions.
diff --git a/.gitignore b/.gitignore
@@ -111,11 +111,8 @@ __pycache__
 montreal-forced-aligner/
 
 # data, checkpoint, and models
-preprocessed/
-ckpt/
-results/
-synth/
-log/
-eval/
-waveglow/pretrained_model/*
-
+raw_data/
+output/
+*.npy
+TextGrid/
+speakers.json
diff --git a/README.md b/README.md
@@ -1,154 +1,142 @@
 # FastSpeech 2 - PyTorch Implementation
 
-This is a PyTorch implementation of Microsoft's text-to-speech system [**FastSpeech 2: Fast and High-Quality End-to-End Text to Speech**](https://arxiv.org/abs/2006.04558). 
-This project is based on [xcmyz's implementation](https://github.com/xcmyz/FastSpeech) of FastSpeech. Feel free to use/modify the code. 
-Any suggestion for improvement is appreciated.
+This is a PyTorch implementation of Microsoft's text-to-speech system [**FastSpeech 2: Fast and High-Quality End-to-End Text to Speech**](https://arxiv.org/abs/2006.04558v1). 
+This project is based on [xcmyz's implementation](https://github.com/xcmyz/FastSpeech) of FastSpeech. Feel free to use/modify the code.
 
-This repository contains only FastSpeech 2 but FastSpeech 2s so far.
-I will update it once I reproduce FastSpeech 2s, the end-to-end version of FastSpeech2, successfully.
+There are several versions of FastSpeech 2.
+This implementation is more similar to [version 1](https://arxiv.org/abs/2006.04558v1), which uses F0 values as the pitch features.
+On the other hand, pitch spectrograms extracted by continuous wavelet transform are used as the pitch features in the [laer versions](https://arxiv.org/abs/2006.04558).
 
-![](./model.png)
+![](./img/model.png)
+
+# Updates
+- 2021/2/26: Support English and Mandarin TTS
+- 2021/2/26: Support multi-speaker TTS (AISHELL-3 and LibriTTS)
+- 2021/2/26: Support MelGAN and HiFi-GAN vocoder
 
 # Audio Samples
-Audio samples generated by this implementation can be found [here](https://ming024.github.io/FastSpeech2/).  
-- The model used to generate these samples is trained for 300k steps on the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset.
-- Audio samples are converted from mel-spectrogram to raw waveform via [NVIDIA's pretrained WaveGlow](https://github.com/NVIDIA/waveglow) and [seungwonpark's pretrained MelGAN](https://github.com/seungwonpark/melgan).
+Audio samples generated by this implementation can be found [here](https://ming024.github.io/FastSpeech2/). 
 
 # Quickstart
 
 ## Dependencies
-You can install the python dependencies with
+You can install the Python dependencies with
 ```
 pip3 install -r requirements.txt
 ```
 
-## Synthesis
+## Inference
+
+You have to download the [pretrained models](https://drive.google.com/drive/folders/1DOhZGlTLMbbAAFZmZGDdc77kz1PloS7F?usp=sharing) and put them in ``output/ckpt/LJSpeech/`` or ``output/ckpt/AISHELL3``.
 
-You have to download our [FastSpeech2 pretrained model](https://drive.google.com/file/d/1jXNDPMt1ybTN97_MztoTFyrPIthoQuSO/view?usp=sharing) and put it in the ``ckpt/LJSpeech/`` directory.
+For English single-speaker TTS, run
+```
+python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
+```
 
-Your can run
+For Mandarin multi-speaker TTS, try
 ```
-python3 synthesis.py --step 300000
+python3 synthesize.py --text "大家好" --speaker_id SPEAKER_ID --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
 ```
-to generate any desired utterances. 
-The generated utterances will be put in the ``results/`` directory.
 
-Here is a generated spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition"  
-![](./synth/LJSpeech/step_300000.png)
+The generated utterances will be put in ``output/result/``.
+
+Here is an example of synthesized mel-spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition", with the English single-speaker TTS model.  
+![](./img/synthesized_melspectrogram.png)
+
+## Batch Inference
+Batch inference is also supported, try
+
+```
+python3 synthesize.py --source preprocessed_data/LJSpeech/val.txt --restore_step 900000 --mode batch -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
+```
+to synthesize all utterances in ``preprocessed_data/LJSpeech/val.txt``
 
-For CPU inference please refer to this [colab tutorial](https://colab.research.google.com/drive/1S60pytpB1OcEFrd-SkYyjtBsBHYepRSG?usp=sharing). One has to clone the original repo of [MelGAN](https://github.com/seungwonpark/melgan) instead of using ``torch.hub`` due to the code architecture of MelGAN.
 ## Controllability
-The duration/pitch/energy of the synthesized utterances can be modified by specifying the desired duration/pitch/energy ratio to the predicted values.
+The pitch/volume/speaking rate of the synthesized utterances can be controlled by specifying the desired pitch/energy/duration ratios.
 For example, one can increase the speaking rate by 20 % and decrease the volume by 20 % by
 
 ```
-python3 synthesis.py --step 300000 --duration_control 0.8 --energy_control 0.8
+python3 synthesize.py --text "YOUR_DESIRED_TEXT" --restore_step 900000 --mode single -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml --duration_control 0.8 --energy_control 0.8
 ```
 
 # Training
 
 ## Datasets
-We use the [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) English dataset, which consists of 13100 short audio clips of a single female speaker reading passages from 7 non-fiction books, approximately 24 hours in total, to train the entire model end-to-end.
 
-After downloading the dataset and extracting the compressed files, you have to modify the ``hp.data_path`` and some other parameters in ``hparams.py``. 
-Default parameters are for the LJSpeech dataset.
+The supported datasets are
 
-## Preprocessing
+- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): a single-speaker English dataset consists of 13100 short audio clips of a female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
+- [AISHELL-3](http://www.aishelltech.com/aishell_3): a Mandarin TTS dataset with 218 male and female speakers, roughly 85 hours in total.
+- [LibriTTS](https://research.google/tools/datasets/libri-tts/): a multi-speaker English dataset containing 585 hours of speech by 2456 speakers.
 
-As described in the paper, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/)(MFA) is used to obtain the alignments between the utterances and the phoneme sequences. 
-Alignments for the LJSpeech dataset is provided [here](https://drive.google.com/file/d/1ukb8o-SnqhXCxq7drI3zye3tZdrGvQDA/view?usp=sharing). 
-You have to put the ``TextGrid.zip`` file in your ``hp.preprocessed_path/`` and extract the files before you continue.
+We take LJSpeech as an example hereafter.
 
-After that, run the preprocessing script by
+## Preprocessing
+
+First, run 
 ```
-python3 preprocess.py
+python3 prepare_align.py config/LJSpeech/preprocess.yaml
 ```
+for some preparations.
 
-Alternately, you can align the corpus by yourself. 
-First download the MFA package and the pretrained lexicon file. (We use LibriSpeech lexicon instead of the G2p\_en python package proposed in the paper)
+As described in the paper, [Montreal Forced Aligner](https://montreal-forced-aligner.readthedocs.io/en/latest/) (MFA) is used to obtain the alignments between the utterances and the phoneme sequences.
+Alignments for the LJSpeech and AISHELL-3 datasets are provided [here](https://drive.google.com/drive/folders/1DBRkALpPd6FL9gjHMmMEdHODmkgNIIK4?usp=sharing).
+You have to unzip the files in ``preprocessed_data/LJSpeech/TextGrid/``.
 
+After that, run the preprocessing script by
 ```
-wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.1.0-beta.2/montreal-forced-aligner_linux.tar.gz
-tar -zxvf montreal-forced-aligner_linux.tar.gz
-
-wget http://www.openslr.org/resources/11/librispeech-lexicon.txt -O montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt
+python3 preprocess.py config/LJSpeech/preprocess.yaml
 ```
 
-Then prepare some necessary files required by the MFA.
-
+Alternately, you can align the corpus by yourself. 
+Download the official MFA package and run
 ```
-python3 prepare_align.py
+./montreal-forced-aligner/bin/mfa_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt english preprocessed_data/LJSpeech
 ```
-
-Run the MFA and put the .TextGrid files in your ``hp.preprocessed_path``.
+or
 ```
-# Replace $DATA_PATH and $PREPROCESSED_PATH with ./LJSpeech-1.1/wavs and ./preprocessed/LJSpeech/TextGrid, for example
-./montreal-forced-aligner/bin/mfa_align $YOUR_DATA_PATH montreal-forced-aligner/pretrained_models/librispeech-lexicon.txt english $YOUR_PREPROCESSED_PATH -j 8
+./montreal-forced-aligner/bin/mfa_train_and_align raw_data/LJSpeech/ lexicon/librispeech-lexicon.txt preprocessed_data/LJSpeech
 ```
 
-And remember to run the preprocessing script.
+to align the corpus and then run the preprocessing script.
 ```
-python3 preprocess.py
+python3 preprocess.py config/LJSpeech/preprocess.yaml
 ```
 
-After preprocessing, you will get a ``stat.txt`` file in your ``hp.preprocessed_path/``, recording the maximum and minimum values of the fundamental frequency and energy values throughout the entire corpus.
-You have to modify the f0 and energy parameters in the ``hparams.py`` according to the content of ``stat.txt``.
-
 ## Training
 
 Train your model with
 ```
-python3 train.py
+python3 train.py -p config/LJSpeech/preprocess.yaml -m config/LJSpeech/model.yaml -t config/LJSpeech/train.yaml
 ```
 
-The model takes less than 10k steps (less than 1 hour on my GTX1080 GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.
-
-There might be some room for improvement for this repository.
-For example, I just simply add up the duration loss, f0 loss, energy loss and mel loss without any weighting.
+The model takes less than 10k steps (less than 1 hour on my GTX1080Ti GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2.
 
 # TensorBoard
 
-The TensorBoard loggers are stored in the ``log/hp.dataset/`` directory. Use
+Use
 ```
-tensorboard --logdir log/hp.dataset/
+tensorboard --logdir output/log/LJSpeech
 ```
-to serve the TensorBoard on your localhost.
-Here is an example training the model on LJSpeech for 400k steps.
 
-![](./tensorboard.png)
+to serve TensorBoard on your localhost.
+The loss curves, synthesized mel-spectrograms, and audios are shown.
 
-# Notes
+![](./img/tensorboard_loss.png)
+![](./img/tensorboard_spec.png)
+![](./img/tensorboard_audio.png)
 
-## Implementation Issues
+# Implementation Issues
 
-There are several differences between my implementation and the paper.
-- The paper includes punctuations in the transcripts. 
-  However, MFA discards punctuations by default and I haven't found a way to solve it. 
-  During inference, I replace all punctuations with the ``sp`` (short-pause) phone labels.
-- Following [xcmyz's implementation](https://github.com/xcmyz/FastSpeech), I use an additional Tacotron-2-styled postnet after the FastSpeech decoder, which is not used in the original paper.
-- The [transformer paper](https://arxiv.org/abs/1706.03762) suggests to use dropout after the input and positional embedding.
-  I find that this trick does not make any observable difference so I do not use dropout for positional embedding.
-- The paper suggest to use L1 loss for mel loss and L2 loss for variance predictor losses.
-  But I find it easier to train the model with L2 mel loss and L1 variance adaptor losses.
+- Following [xcmyz's implementation](https://github.com/xcmyz/FastSpeech), I use an additional Tacotron-2-styled Postnet after the decoder, which is not used in the original paper.
 - Gradient clipping is used in the training.
+- In my experience, using phoneme-level pitch and energy prediction instead of frame-level prediction results in much better prosody, and normalizing the pitch and energy features also helps. Please refer to ``config/README.md`` for more details.
 
-Some tips for training this model.
-- You can set the ``hp.acc_steps`` parameter if you wish to train with a large batchsize on a GPU with limited memory.
-- In my experience, carefully masking out the padded parts in loss computation and in model forward parts largely improves the performance.
-
-Please inform me if you find any mistake in this repo, or any useful tip to train the FastSpeech2 model.
-
-## TODO
-- Try different weights for the loss terms.
-- Evaluate the quality of the synthesized audio over the validation set.
-- Multi-speaker, voice cloning, or transfer learning experiment.
-- Implement FastSpeech 2s.
+Please inform me if you find any mistakes in this repo, or any useful tips to train the FastSpeech 2 model.
 
 # References
 - [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558), Y. Ren, *et al*.
-- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263), Y. Ren, *et al*.
 - [xcmyz's FastSpeech implementation](https://github.com/xcmyz/FastSpeech)
-- [rishikksh20's FastSpeech2 implementation](https://github.com/rishikksh20/FastSpeech2)
-- [TensorSpeech's FastSpeech2 implementation](https://github.com/TensorSpeech/TensorflowTTS)
-- [NVIDIA's WaveGlow implementation](https://github.com/NVIDIA/waveglow)
-- [seungwonpark's MelGAN implementation](https://github.com/seungwonpark/melgan)
+- [TensorSpeech's FastSpeech 2 implementation](https://github.com/TensorSpeech/TensorflowTTS)
+- [rishikksh20's FastSpeech 2 implementation](https://github.com/rishikksh20/FastSpeech2)
diff --git a/audio/audio_processing.py b/audio/audio_processing.py
@@ -1,12 +1,18 @@
 import torch
 import numpy as np
-from scipy.signal import get_window
 import librosa.util as librosa_util
-import hparams as hp
+from scipy.signal import get_window
 
 
-def window_sumsquare(window, n_frames, hop_length=hp.hop_length, win_length=hp.win_length,
-                     n_fft=hp.filter_length, dtype=np.float32, norm=None):
+def window_sumsquare(
+    window,
+    n_frames,
+    hop_length,
+    win_length,
+    n_fft,
+    dtype=np.float32,
+    norm=None,
+):
     """
     # from librosa 0.6
     Compute the sum-square envelope of a window function at a given hop length.
@@ -47,14 +53,13 @@ def window_sumsquare(window, n_frames, hop_length=hp.hop_length, win_length=hp.w
 
     # Compute the squared window at the desired length
     win_sq = get_window(window, win_length, fftbins=True)
-    win_sq = librosa_util.normalize(win_sq, norm=norm)**2
+    win_sq = librosa_util.normalize(win_sq, norm=norm) ** 2
     win_sq = librosa_util.pad_center(win_sq, n_fft)
 
     # Fill the envelope
     for i in range(n_frames):
         sample = i * hop_length
-        x[sample:min(n, sample + n_fft)
-          ] += win_sq[:max(0, min(n_fft, n - sample))]
+        x[sample : min(n, sample + n_fft)] += win_sq[: max(0, min(n_fft, n - sample))]
     return x