This repository provides an unofficial implementation of Tacotron2 using PyTorch. For a more detailed and accurate implementation, it is strongly recommended to refer to NVIDIA’s official Tacotron2 repository. This current implementation serves as an acoustic model to generate mel-spectrograms. To generate actual speech, you will need to integrate a vocoder model provided by torchaudio or retrain one. Most of the code in this repository is based on NVIDIA’s Tacotron2 implementation.
- Docker and NVIDIA GPU Drivers (for GPU support)
-
Clone the repository:
git clone https://github.com/Orca0917/Tacotron2.git cd Tacotron2
-
Build the Docker image:
docker build -t tacotron2 .
-
Run the Docker container:
docker run -it --name tacotron2-container --gpus all tacotron2
-
Train the model inside the container:
python train.py
Once the model is trained, you will obtain a mel-spectrogram as the output of the acoustic model, which can be visualized similarly to the image below:
This implementation focuses on generating mel-spectrograms. To complete the text-to-speech pipeline, you will need to use a vocoder (e.g., WaveNet, Griffin-Lim, or a model from torchaudio) to convert the spectrograms into waveform audio.
[1] Natural TTS Synthesis By Conditioning Wavenet On Mel Spectrogram Predictions. [Link to Paper]
[2] github NVIDIA/Tacotron2 [Link to GitHub]