VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

[~~this is a work in progress, feel free to contribute! Model will be ready if this line is removed~~] [Most of the code is ready and the model is ready to train. Hopefully, someone from the community can share some results ASAP! Thanks!]

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Unofficial implementation of the VITS2 paper, sequel to VITS paper. (thanks to the authors for their work!)

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-toend single-stage approach.

Credits

We will build this repo based on the VITS repo. Currently I am adding vits2 changes in the 'notebooks' folder. The goal is to make this model easier to transfer learning from VITS pretrained model!
(08-17-2023) - The authors were really kind to guide me through the paper and answer my questions. I am open to discuss any changes or answer questions regarding the implementation. Please feel free to open an issue or contact me directly.

Jupyter Notebook for initial experiments

check the 'notebooks' folder

pre-requisites

Python >= 3.6
Now supports Pytorch version 2.0
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak
Download datasets
1. Download and extract the LJ Speech dataset, then rename or create a link to the dataset folder: ln -s /path/to/LJSpeech-1.1/wavs DUMMY1
2. For mult-speaker setting, download and extract the VCTK dataset, and downsample wav files to 22050 Hz. Then rename or create a link to the dataset folder: ln -s /path/to/VCTK-Corpus/downsampled_wavs DUMMY2
Build Monotonic Alignment Search and run preprocessing if you use your own datasets.

# Cython-version Monotonoic Alignment Search
cd monotonic_align
python setup.py build_ext --inplace

# Preprocessing (g2p) for your own datasets. Preprocessed phonemes for LJ Speech and VCTK have been already provided.
# python preprocess.py --text_index 1 --filelists filelists/ljs_audio_text_train_filelist.txt filelists/ljs_audio_text_val_filelist.txt filelists/ljs_audio_text_test_filelist.txt 
# python preprocess.py --text_index 2 --filelists filelists/vctk_audio_sid_text_train_filelist.txt filelists/vctk_audio_sid_text_val_filelist.txt filelists/vctk_audio_sid_text_test_filelist.txt

How to run (dry-run)

model forward pass (dry-run)

import torch
from models import SynthesizerTrn

net_g = SynthesizerTrn(
    n_vocab=256,
    spec_channels=80, # <--- vits2 parameter (changed from 513 to 80)
    segment_size=8192,
    inter_channels=192,
    hidden_channels=192,
    filter_channels=768,
    n_heads=2,
    n_layers=6,
    kernel_size=3,
    p_dropout=0.1,
    resblock="1", 
    resblock_kernel_sizes=[3, 7, 11],
    resblock_dilation_sizes=[[1, 3, 5], [1, 3, 5], [1, 3, 5]],
    upsample_rates=[8, 8, 2, 2],
    upsample_initial_channel=512,
    upsample_kernel_sizes=[16, 16, 4, 4],
    n_speakers=0,
    gin_channels=0,
    use_sdp=True, 
    use_transformer_flows=True, # <--- vits2 parameter
    transformer_flow_type="fft", # <--- vits2 parameter (choose from "pre_conv","fft","mono_layer")
    use_spk_conditioned_encoder=True, # <--- vits2 parameter
    use_noise_scaled_mas=True, # <--- vits2 parameter
)

x = torch.LongTensor([[1, 2, 3],[4, 5, 6]]) # token ids
x_lengths = torch.LongTensor([3, 2]) # token lengths
y = torch.randn(2, 80, 100) # mel spectrograms
y_lengths = torch.Tensor([100, 80]) # mel spectrogram lengths

net_g(
    x=x,
    x_lengths=x_lengths,
    y=y,
    y_lengths=y_lengths,
)

# calculate loss and backpropagate

Training Example

# LJ Speech
python train.py -c configs/vits2_ljs_base.json -m ljs_base

# VCTK
python train_ms.py -c configs/vits2_vctk_base.json -m vctk_base

Updates, TODOs, features and notes

note - duration predictor is not adversarial yet. In my earlier experiments with VITS-1, I used deterministic duration predictor (no-sdp) and found that it is quite good. So, I am not sure if adversarial duration predictor is necessary. But, I will add it sooner or later if it is necessary. Also, I want to combine parallel tacotron-2 and naturalspeech-1's learnable upsampling layer to remove MAS completely for E2E differentiable model.

(08/17/2023) update 4 - Fixed multi-spk DataLoader
(08/17/2023) update 3 - QOL changes to generate mel spec from existing lin spec. Updated inference.ipynb.
(08/17/2023) update 2 - hotfix for "use_mel_posterior_encoder" flag in config file. Should fix #8 and #9. Will do a if-else cleanup later.
(08/17/2023) update 1 - After some discussions with the authors, I implemented "mono-layer" transformer flow which seems to be the closest to what they intend. It is a single layer transformer flow used as the first layer before the traditional conv-residual-flows. Need to experiment to check the best transformer flow type of the three. (pre_conv, fft, mono_layer). But, each of the layers serves similar purpose of long range dependency using attention.
(08/10/2023) update 1 - updated multi_GPU training, support pytorch2.0 #5
(08/09/2023) update - Corrected MAS with noise_scale and updated train_ms.py, train.py (thanks to @KdaiP for testing and pointing out the bug in MAS)
(08/08/2023) update 2 - Changed data_utils.py to take in "use_mel_posterior_encoder" flag.
(08/08/2023) update 1 - Added "use_noise_scaled_mas" flag in config file. Added sanity checks in notebooks. Everything except adverserial duration predictor is ready to train.
(08/072023) update 2 - transformer_flow_type "fft" and "pre_conv" added. @lexkoro suggested "fft" transformer flow is better than "pre_conv" transformer flow in his intial experiments.
(08/07/2023 update 1 - vits2_vctk_base.json and vits2_ljs_base.json are ready to train; multi-speaker and single-speaker models respectively)
(08/06/2023 update - dry run is ready; duration predictor will complete within next week)
(08/05/2023 update - everything except the duration predictor is ready to train and we can expect some improvement from VITS1)
(08/04/2023 update - initial codebaase is ready; paper is being read)

Duration predictor (fig 1a)

Added LSTM discriminator to duration predictor in notebook.
Added adversarial loss to duration predictor (TODO)
Monotonic Alignment Search with Gaussian Noise added in 'notebooks' folder; need expert verification (Section 2.2)
Added "use_noise_scaled_mas" flag in config file. Choose from True or False; updates noise while training based on number of steps and never goes below 0.0
Update models.py/train.py/train_ms.py
Update config files (vits2_vctk_base.json; vits2_ljs_base.json)
Update losses.py (TODO)

Transformer block in the normalizing flow (fig 1b)

Added transformer block to the normalizing flow in notebook. There are three types of transformer blocks: pre-convolution (my implementation), FFT (from so-vits-svc repo) and mono-layer.
Added "transformer_flow_type" flag in config file. Choose from "pre_conv" or "fft" or "mono_layer".
Added layers and blocks in models.py (ResidualCouplingTransformersLayer, ResidualCouplingTransformersBlock, FFTransformerCouplingLayer, MonoTransformerFlowLayer)
Add in config file (vits2_ljs_base.json; can be turned on using "use_transformer_flows" flag)

Speaker-conditioned text encoder (fig 1c)

Added speaker embedding to the text encoder in notebook.
Added speaker embedding to the text encoder in models.py (TextEncoder; backward compatible with VITS)
Add in config file (vits2_ljs_base.json; can be turned on using "use_spk_conditioned_encoder" flag)

Mel spectrogram posterior encoder (Section 3)

Added mel spectrogram posterior encoder in notebook.
Added mel spectrogram posterior encoder in train.py
Addded new config file (vits2_ljs_base.json; can be turned on using "use_mel_posterior_encoder" flag)
Updated 'data_utils.py' to use the "use_mel_posterior_encoder" flag for vits2

Training scripts

Added vits2 flags to train.py (single-speaer model)
Added vits2 flags to train_ms.py (multi-speaker model)

Special mentions

@erogol for quick feedback and guidance. (Please check his awesome CoquiTTS repo).
@lexkoro for discussions and help with the prototype training.
@manmay-nakhashi for discussions and help with the code.
@athenasaurav for offering GPU support for training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Credits

Jupyter Notebook for initial experiments

pre-requisites

How to run (dry-run)

Training Example

Updates, TODOs, features and notes

Duration predictor (fig 1a)

Transformer block in the normalizing flow (fig 1b)

Speaker-conditioned text encoder (fig 1c)

Mel spectrogram posterior encoder (Section 3)

Training scripts

Special mentions

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
configs		configs
filelists		filelists
monotonic_align		monotonic_align
notebooks		notebooks
resources		resources
text		text
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
attentions.py		attentions.py
commons.py		commons.py
data_utils.py		data_utils.py
image.png		image.png
inference.ipynb		inference.ipynb
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
preprocess.py		preprocess.py
requirements.txt		requirements.txt
train.py		train.py
train_ms.py		train_ms.py
transforms.py		transforms.py
utils.py		utils.py

License

nangongmujd/vits2_pytorch

Folders and files

Latest commit

History

Repository files navigation

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Credits

Jupyter Notebook for initial experiments

pre-requisites

How to run (dry-run)

Training Example

Updates, TODOs, features and notes

Duration predictor (fig 1a)

Transformer block in the normalizing flow (fig 1b)

Speaker-conditioned text encoder (fig 1c)

Mel spectrogram posterior encoder (Section 3)

Training scripts

Special mentions

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages