How much audio data one needs for VITS2 to start generalizing well? #96
michal-bida
started this conversation in
General
Replies: 1 comment 12 replies
-
Are you using phonetic transcription? See if "liked his" and "likes his" end up giving same phones, if so, need to correct it manually. |
Beta Was this translation helpful? Give feedback.
12 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi all.
I am using VITS2 to train a Czech voice. My question is - from your experience - how much audio data one needs to feed into the model the get a decent performance on out-of-domain words (not seen in training). For sure that will be language dependent - but any experience welcome.
As for my experiments... So I have this low quality dataset of 40 minutes of Czech voice - out of that you get a decently sounding Czech voice after a 12 hours of training on A100. However, this voice absolutely does not generalize well - but performs well on in domain words.
Another dataset I have is 4 hours of a bit better quality of Czech voice - generalizes a bit better. However, it is by no means perfect. I am suspecting I will need perhaps twice the amount of data - maybe more - for the voice to generalize well.
Also, it seems to me that when you train the model with audio of different lengths then the model in inference kind of prefers the "knowledge" with duration that matches better the required duration... Concrete example:
Then in inference you want to synthetize this sentence: "I liked his photo that he showed me yesterday."
You would expect "liked" to be pronounced correctly as you had plenty of examples.
But what happened to me is that the synthetized voice sounded like: "I likes his photo that he showed me yesterday." - even though there were plenty of audio examples with word liked in the dataset.
I think it is because 1) "liked his" somehow aligned better in the model to "likes his" and 2) the length of synthetized sentence were aligned better with the length of the "Gollum likes his precious ring very much so.".
This leads me to thinking that perhaps if one wants a good generalization, one should use sentences of similar lengths in the training data? Alternatively, this might simply go away when more data are present...
EDIT: I should mention that this occurred after 4715 iterations - I will try to train more to see if it persists. It might go away.
What are your experiences with VITS2 performance based on the amount and structure of training data?
Beta Was this translation helpful? Give feedback.
All reactions