We propose to pretrain a TTS model with unlabeled multilingual speech corpora and then fine-tune it with a small paired speech-text dataset.
The overall procedures are as follows:
- Make pseudo phoneme sequence corresponding to each unlabeled waveform using wav2vec 2.0 XLSR-53 and k-means clustering
- Pre-train a model with those waveforms & pseudo phoneme sequences as if they are paired speech-text data
- Fine-tune the model with a small paired dataset of target language
-
Python 3.7
-
Clone this repository
-
Install python packages. Please refer to requirements.txt
-
Download datasets
+) We used a subset of the MLS dataset for a multilingual speech dataset, and a subset of the LibriTTS for a low-resource dataset.
-
Build Cython version Monotonic Alignment Search if you want
cd utils_/monotonic_align
python setup.py build_ext --inplace
# get and save centroids
python get_cluster.py data_dir_path --save-dir save_dir_path --checkpoint facebook/wav2vec2-large-xlsr-53 --gpu gpu_index
# get pseudo phoneme sequences using the nearest centroids for each waveform
python get_cluster_idx.py data_dir_path --checkpoint facebook/wav2vec2-large-xlsr-53 --path centroids_dir --gpu gpu_index
# pre-train with the P1 setting
python main_.py --config configs/P1.yaml --train
# pre-train with the P2 setting
python main_.py --config configs/P2.yaml --train
# fine-tune the model pre-trained with the P1 setting.
python main_.py --config configs/P1T1F2.yaml --train
# fine-tune the model pre-trained with the P2 setting.
python main_.py --config configs/P2T2F2.yaml --train
Some of our codes are copied and modifed from
- facebookresearch/fairseq (MIT license)
- coqui-ai/TTS (MPL-2.0 license)
- huggingface/transformers (Apache-2.0 license)
- VITS implementation (MIT license)
- NANSY implementation
Our codes are under the MPL-2.0 license.