How much audio data one needs for VITS2 to start generalizing well? #96

michal-bida · 2024-07-12T06:23:31Z

michal-bida
Jul 12, 2024

Hi all.

I am using VITS2 to train a Czech voice. My question is - from your experience - how much audio data one needs to feed into the model the get a decent performance on out-of-domain words (not seen in training). For sure that will be language dependent - but any experience welcome.

As for my experiments... So I have this low quality dataset of 40 minutes of Czech voice - out of that you get a decently sounding Czech voice after a 12 hours of training on A100. However, this voice absolutely does not generalize well - but performs well on in domain words.

Another dataset I have is 4 hours of a bit better quality of Czech voice - generalizes a bit better. However, it is by no means perfect. I am suspecting I will need perhaps twice the amount of data - maybe more - for the voice to generalize well.

Also, it seems to me that when you train the model with audio of different lengths then the model in inference kind of prefers the "knowledge" with duration that matches better the required duration... Concrete example:

In your train set you have a longer sentence in gollum voice: "Gollum likes his precious ring very much so."
But you have a lot of short sentences like so: "I liked the food." So word liked is represented well and you would expect the model to learn it properly.

Then in inference you want to synthetize this sentence: "I liked his photo that he showed me yesterday."
You would expect "liked" to be pronounced correctly as you had plenty of examples.
But what happened to me is that the synthetized voice sounded like: "I likes his photo that he showed me yesterday." - even though there were plenty of audio examples with word liked in the dataset.

I think it is because 1) "liked his" somehow aligned better in the model to "likes his" and 2) the length of synthetized sentence were aligned better with the length of the "Gollum likes his precious ring very much so.".

This leads me to thinking that perhaps if one wants a good generalization, one should use sentences of similar lengths in the training data? Alternatively, this might simply go away when more data are present...

EDIT: I should mention that this occurred after 4715 iterations - I will try to train more to see if it persists. It might go away.

What are your experiences with VITS2 performance based on the amount and structure of training data?

p0p4k · 2024-07-12T10:22:06Z

p0p4k
Jul 12, 2024
Maintainer

Are you using phonetic transcription? See if "liked his" and "likes his" end up giving same phones, if so, need to correct it manually.

12 replies

michal-bida Nov 8, 2024
Author

I would be interested as well, though if your aim is the highest quality of generated voice then I would think using CFM would not help.

p0p4k Nov 8, 2024
Maintainer

Why not? I thought cfm is the sota for non auto regressive spectrogram generation?

michal-bida Nov 8, 2024
Author

~~Do you know about any recent models using CFM? I wasn't able to find anything so far.~~

michal-bida Nov 8, 2024
Author

Ok. For example https://arxiv.org/html/2309.03199v2

I read the paper.(Thanks for mentioning CFM btw). It seems to me that it indeed holds promise but from the paper I understood that the main advantage of matcha CFM model versus VITS2 would be probably smaller model size and speed and not so much the quality (MOS scores were on par with VITS and VITS2 should produce better results).

Ofc speed of inference and training are important as well and I am pretty sure that the quality of speech could be also further improved.

Edit2: I see you already stated to implement this model https://github.com/p0p4k/Matcha-TTS-2 :-D

p0p4k Nov 9, 2024
Maintainer

Or rather use pflow tts or matcha original, matcha-2 doesn't work yet, it is a wip.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How much audio data one needs for VITS2 to start generalizing well? #96

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How much audio data one needs for VITS2 to start generalizing well? #96

michal-bida Jul 12, 2024

Replies: 1 comment · 12 replies

p0p4k Jul 12, 2024 Maintainer

michal-bida Nov 8, 2024 Author

p0p4k Nov 8, 2024 Maintainer

michal-bida Nov 8, 2024 Author

michal-bida Nov 8, 2024 Author

p0p4k Nov 9, 2024 Maintainer

michal-bida
Jul 12, 2024

Replies: 1 comment 12 replies

p0p4k
Jul 12, 2024
Maintainer

michal-bida Nov 8, 2024
Author

p0p4k Nov 8, 2024
Maintainer

michal-bida Nov 8, 2024
Author

michal-bida Nov 8, 2024
Author

p0p4k Nov 9, 2024
Maintainer