Leveraging Unlabeled Multilingual Speech Corpora with Clustering for Speech Synthesis

We propose to pretrain a TTS model with unlabeled multilingual speech corpora and then fine-tune it with a small paired speech-text dataset.

The overall procedures are as follows:

Make pseudo phoneme sequence corresponding to each unlabeled waveform using wav2vec 2.0 XLSR-53 and k-means clustering
Pre-train a model with those waveforms & pseudo phoneme sequences as if they are paired speech-text data
Fine-tune the model with a small paired dataset of target language

Pre-requisites

Python 3.7
Clone this repository
Install python packages. Please refer to requirements.txt
Download datasets

+) We used a subset of the MLS dataset for a multilingual speech dataset, and a subset of the LibriTTS for a low-resource dataset.
Build Cython version Monotonic Alignment Search if you want

cd utils_/monotonic_align
python setup.py build_ext --inplace

Get corresponding pseudo phoneme sequence

# get and save centroids
python get_cluster.py data_dir_path --save-dir save_dir_path --checkpoint facebook/wav2vec2-large-xlsr-53 --gpu gpu_index



# get pseudo phoneme sequences using the nearest centroids for each waveform
python get_cluster_idx.py data_dir_path --checkpoint facebook/wav2vec2-large-xlsr-53 --path centroids_dir --gpu gpu_index

Pre-training Examples

# pre-train with the P1 setting
python main_.py --config configs/P1.yaml --train

# pre-train with the P2 setting
python main_.py --config configs/P2.yaml --train

Fine-tuning Examples

# fine-tune the model pre-trained with the P1 setting.
python main_.py --config configs/P1T1F2.yaml --train

# fine-tune the model pre-trained with the P2 setting.
python main_.py --config configs/P2T2F2.yaml --train

License

Some of our codes are copied and modifed from

facebookresearch/fairseq (MIT license)
coqui-ai/TTS (MPL-2.0 license)
huggingface/transformers (Apache-2.0 license)
VITS implementation (MIT license)
NANSY implementation

Our codes are under the MPL-2.0 license.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
configs		configs
datasets_		datasets_
losses_		losses_
models_		models_
trainer_		trainer_
utils_		utils_
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
get_cluster.py		get_cluster.py
get_cluster_idx.py		get_cluster_idx.py
main_.py		main_.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Leveraging Unlabeled Multilingual Speech Corpora with Clustering for Speech Synthesis

Pre-requisites

Get corresponding pseudo phoneme sequence

Pre-training Examples

Fine-tuning Examples

License

About

Releases

Packages

Languages

License

l22ys/multilingual_unlabeled_pretrain_tts

Folders and files

Latest commit

History

Repository files navigation

Leveraging Unlabeled Multilingual Speech Corpora with Clustering for Speech Synthesis

Pre-requisites

Get corresponding pseudo phoneme sequence

Pre-training Examples

Fine-tuning Examples

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages