Skip to content

Commit

Permalink
minor revision
Browse files Browse the repository at this point in the history
  • Loading branch information
ming024 committed Jun 25, 2020
1 parent d146fc0 commit 796d8bc
Showing 1 changed file with 6 additions and 10 deletions.
16 changes: 6 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

This is a Pytorch implementation of Microsoft's text-to-speech system [**FastSpeech 2: Fast and High-Quality End-to-End Text to Speech**](https://arxiv.org/abs/2006.04558). This project is based on [xcmyz's implementation](https://github.com/xcmyz/FastSpeech) of FastSpeech. Feel free to use/modify the code. Any improvement suggestion is appreciated.

<div style="text-align:center"><img src="./model.png" width="600"></div>
![](./model.png)

# Audio Samples
Audio samples generated by this implementation can be found [here](https://ming024.github.io/FastSpeech2/).
Expand Down Expand Up @@ -34,18 +34,14 @@ python3 synthesis.py --step 300000
to generate any utterances you wish to. The generated utterances will be put in the ``results/`` directory.

Here is a generated spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition"
<div style="text-align:center"><img src="./synth/LJSpeech/step_300000_0.png" width="400"></div>
![](./synth/LJSpeech/step_300000_0.png)

# Training

## Datasets
This project supports two datasets:
- [LJSpeech](https://keithito.com/LJ-Speech-Dataset/): consisting of 13100 short audio clips of a single female speaker reading passages from 7 non-fiction books, approximately 24 hours in total.
- [Blizzard2013](http://www.cstr.ed.ac.uk/projects/blizzard/2013/lessac_blizzard2013/): a male speaker reading 10 audio books. The prosody variance are greater than the LJSpeech dtaaset. Only the 9741 segmented utterances are used in this project.

```
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
```
- [Blizzard2013](http://www.cstr.ed.ac.uk/projects/blizzard/2013/lessac_blizzard2013/): a male speaker reading 10 audio books. The prosody variance are greater than the LJSpeech dataset. Only the 9741 segmented utterances are used in this project.

After downloading the dataset, extract the compressed files, you have to modify the ``hp.data_path`` and some other parameters in ``hparams.py``. Default parameters are for the LJSpeech dataset.

Expand Down Expand Up @@ -102,8 +98,8 @@ There might be some room for improvement for this repository. For example, I jus
## Implementation Issues

There are several differences between my implementation and the paper.
- The paper includes the punctuations in the transcripts. However, MFA discards puntuations by default and I haven't find a way to solve it.
- Follow [xcmyz's implementation](https://github.com/xcmyz/FastSpeech), I use an additional Tacotron-2-styled postnet after the FastSpeech decoder, which is not used in the original paper.
- The paper includes punctuations in the transcripts. However, MFA discards puntuations by default and I haven't find a way to solve it.
- Following [xcmyz's implementation](https://github.com/xcmyz/FastSpeech), I use an additional Tacotron-2-styled postnet after the FastSpeech decoder, which is not used in the original paper.
- The [transformer paper](https://arxiv.org/abs/1706.03762) suggests to use dropout after the input and positional embedding. I haven't try it yet.
- The paper suggest to use L1 loss for mel loss and L2 loss for variance predictor losses. But I find it easier to train the model with L2 mel loss and L1 variance adaptor losses, for unknown reason.
- I use gradient clipping and weigth decay in the training.
Expand All @@ -112,7 +108,7 @@ There are several differences between my implementation and the paper.
- Try difference weights for the loss terms.
- My loss computation does not mask out the paddings.
- Evaluate the quality of the synthesized audio over the validation set.
- Find the difference between the F0 & energy predicted by the variance predictor and the F0 & energy of the synthesized utterance.
- Find the difference between the F0 & energy predicted by the variance predictors and the F0 & energy of the synthesized utterance measured by PyWorld Vocoder.

# References
- [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558), Y. Ren, *et al*.
Expand Down

0 comments on commit 796d8bc

Please sign in to comment.