Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
png		png
samples		samples
text		text
.gitignore		.gitignore
README.md		README.md
hyperparams.py		hyperparams.py
module.py		module.py
network.py		network.py
prepare_data.py		prepare_data.py
preprocess.py		preprocess.py
synthesis.py		synthesis.py
train_postnet.py		train_postnet.py
train_transformer.py		train_transformer.py
utils.py		utils.py

Repository files navigation

Transformer-TTS

A Pytorch Implementation of Neural Speech Synthesis with Transformer Network This model can be trained about 3 to 4 times faster than the well known seq2seq model like tacotron, and the quality of synthesized speech is almost the same. It was confirmed through experiment that it took about 0.5 second per step. I did not use the wavenet vocoder but learned the post network using CBHG model of tacotron and converted the spectrogram into raw wave using griffin-lim algorithm.

Requirements

Install python 3
Install pytorch == 0.4.0
Install requirements:
```
pip install -r requirements.txt
```

Data

I used LJSpeech dataset which consists of pairs of text script and wav files. The complete dataset (13,100 pairs) can be downloaded here. I referred https://github.com/keithito/tacotron and https://github.com/Kyubyong/dc_tts for the preprocessing code.

Attention plots

A diagonal alignment appeared after about 15k steps. The attention plots below are at 160k steps.

Self Attention encoder

Self Attention decoder

Attention encoder-decoder

Learning curves & Alphas

I used Noam style warmup and decay as same as [Tacotron](https://github.com/Kyubyong/tacotron)

The alpha value for the scaled position encoding is different from the thesis. In the paper, the alpha value of the encoder is increased to 4, whereas in the present experiment, it slightly increased at the beginning and then decreased continuously. The decoder alpha has steadily decreased since the beginning.

Experimental notes

Generated Samples

File description

hyperparams.py includes all hyper parameters that are needed.
prepare_data.py preprocess wav files to mel, linear spectrogram and save them for faster training time. Preprocessing codes for text is in text/ directory.
preprocess.py includes all preprocessing codes when you loads data.
module.py contains all methods, including attention, prenet, postnet and so on.
network.py contains networks including encoder, decoder and post-processing network.
train_transformer.py is for training autoregressive attention network. (text --> mel)
train_postnet.py is for training post network. (mel --> linear)
synthesis.py is for generating TTS sample.

Training the network

STEP 1. Download and extract LJSpeech data at any directory you want.
STEP 2. Adjust hyperparameters in hyperparams.py, especially 'data_path' which is a directory that you extract files, and the others if necessary.
STEP 3. Run prepare_data.py.
STEP 4. Run train_transformer.py.
STEP 5. Run train_postnet.py.

Generate TTS wav file

STEP 1. Run synthesis.py. Make sure the restore step.

Samples

You can check the generated samples in 'samples/' directory.

Reference

Keith ito: https://github.com/keithito/tacotron
Kyubyong Park: https://github.com/Kyubyong/dc_tts
jadore801120: https://github.com/jadore801120/attention-is-all-you-need-pytorch/

Comments

Any comments for the codes are always welcome.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformer-TTS

Requirements

Data