From d146fc063b91ac6140f9cd5c9af984aded2df0be Mon Sep 17 00:00:00 2001 From: ming024 Date: Thu, 25 Jun 2020 22:00:26 +0800 Subject: [PATCH] first commit --- README.md | 38 +++++++++++++++++++++++++------------- 1 file changed, 25 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index 4a8dea91aa..eeb0c3a477 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,8 @@ This is a Pytorch implementation of Microsoft's text-to-speech system [**FastSpeech 2: Fast and High-Quality End-to-End Text to Speech**](https://arxiv.org/abs/2006.04558). This project is based on [xcmyz's implementation](https://github.com/xcmyz/FastSpeech) of FastSpeech. Feel free to use/modify the code. Any improvement suggestion is appreciated. +
+ # Audio Samples Audio samples generated by this implementation can be found [here](https://ming024.github.io/FastSpeech2/). - The model used to generate these samples is trained for 30k steps on [LJSpeech](https://keithito.com/LJ-Speech-Dataset/) dataset. @@ -32,7 +34,7 @@ python3 synthesis.py --step 300000 to generate any utterances you wish to. The generated utterances will be put in the ``results/`` directory. Here is a generated spectrogram of the sentence "Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition" -
+
# Training @@ -91,19 +93,29 @@ Train your model with python3 train.py ``` -There might be some room for improvement for this repository. For example, I just simply add up +The model takes less than 10000 steps (less than 1 hour on my GTX1080 GPU) of training to generate audio samples with acceptable quality, which is much more efficient than the autoregressive models such as Tacotron2. + +There might be some room for improvement for this repository. For example, I just simply add up the duration loss, f0 loss, energy loss and mel loss without any weighting. Please inform me if you find any useful tip for training the FastSpeech2 model. + +# Notes +## Implementation Issues +There are several differences between my implementation and the paper. +- The paper includes the punctuations in the transcripts. However, MFA discards puntuations by default and I haven't find a way to solve it. +- Follow [xcmyz's implementation](https://github.com/xcmyz/FastSpeech), I use an additional Tacotron-2-styled postnet after the FastSpeech decoder, which is not used in the original paper. +- The [transformer paper](https://arxiv.org/abs/1706.03762) suggests to use dropout after the input and positional embedding. I haven't try it yet. +- The paper suggest to use L1 loss for mel loss and L2 loss for variance predictor losses. But I find it easier to train the model with L2 mel loss and L1 variance adaptor losses, for unknown reason. +- I use gradient clipping and weigth decay in the training. +## TODO +- Try difference weights for the loss terms. +- My loss computation does not mask out the paddings. +- Evaluate the quality of the synthesized audio over the validation set. +- Find the difference between the F0 & energy predicted by the variance predictor and the F0 & energy of the synthesized utterance. -MFA with punctuation...? -input embedding dropout...? -additional postnet -gradient clipping -weight decay -L1 loss or L2 loss? -weights of the loss terms? -loss masking -eval the output -some problem in eval.py(dataset) -util/plot \ No newline at end of file +# References +- [FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558), Y. Ren, *et al*. +- [FastSpeech: Fast, Robust and Controllable Text to Speech](https://arxiv.org/abs/1905.09263), Y. Ren, *et al*. +- [xcmyz's FastSpeech implementation](https://github.com/xcmyz/FastSpeech) +- [NVIDIA's WaveGlow implementation](https://github.com/NVIDIA/waveglow) \ No newline at end of file