VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Unofficial implementation of the VITS2 paper, sequel to VITS paper. (thanks to the authors for their work!)

Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-toend single-stage approach.

Notes

Supports 44100kHz.
No Language barrier between models - Uses almost all IPA Phonemes as Input.

Prerequisites

Python >= 3.10
Tested on Pytorch version 1.13.1 with Google Colab and LambdaLabs cloud.
Clone this repository
Install python requirements. Please refer requirements.txt
1. You may need to install espeak first: apt-get install espeak

Special mentions

@erogol for quick feedback and guidance. (Please check his awesome CoquiTTS repo).
@lexkoro for discussions and help with the prototype training.
@manmay-nakhashi for discussions and help with the code.
@athenasaurav for offering GPU support for training.
@w11wo for ONNX support.
@Subarasheese for Gradio UI.

Name		Name	Last commit message	Last commit date
Latest commit History 236 Commits
configs		configs
monotonic_align		monotonic_align
text		text
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
MiriVoicer_VITS2_Exporter.ipynb		MiriVoicer_VITS2_Exporter.ipynb
README.md		README.md
VITS2_MiriVoice_Support.ipynb		VITS2_MiriVoice_Support.ipynb
attentions.py		attentions.py
commons.py		commons.py
data_utils.py		data_utils.py
export_onnx.py		export_onnx.py
infer_onnx.py		infer_onnx.py
inference.ipynb		inference.ipynb
inference.py		inference.py
inference_ms.py		inference_ms.py
losses.py		losses.py
mel_processing.py		mel_processing.py
models.py		models.py
modules.py		modules.py
preprocess.py		preprocess.py
preprocess_audio.py		preprocess_audio.py
requirements.txt		requirements.txt
train.py		train.py
train_ms.py		train_ms.py
transforms.py		transforms.py
utils.py		utils.py
webui.py		webui.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Notes

Prerequisites

Special mentions

About

Releases 1

Packages

Languages

License

EX3exp/MiriVoiceSupport-VITS2

Folders and files

Latest commit

History

Repository files navigation

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design

Jungil Kong, Jihoon Park, Beomjeong Kim, Jeongmin Kim, Dohee Kong, Sangjin Kim

Notes

Prerequisites

Special mentions

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages