Indonesian Audio Samples #70
Replies: 2 comments 17 replies
-
Thanks for sharing your results. I believe the future of generative TTS is in conditional flow matching. You can safely skip vits2, it was just a hobby project for me. |
Beta Was this translation helpful? Give feedback.
-
"How is the symbols.py file for Indonesian generated? The phonemes I obtained are for each word, and there is no way to split them into individual phonemes." |
Beta Was this translation helpful? Give feedback.
-
Hey, firstly thanks for making this repo!
I am just going to share a few results when training on Indonesian speech. I've trained VITS (via Piper) and VITS2. Both use Indonesian IPA phonemes as modelling units, which I've obtained by phonemizing Indonesian texts using g2p ID (disclaimer: my team and I built this tool).
We also followed Piper's medium model configuration with 44.1kHz, so the resultant VITS2 model is smaller than the default config (about 16M parameters, 66.5MB after exporting to ONNX). We trained both models on a Microsoft Azure Indonesian male voice (ArdiNeural). You can find the relevant files here.
At least on this dataset, I don't find much quality change when comparing it to VITS1. But what I do like is the ability to not use blank/padding tokens for the sake of stability. It's really helpful when deploying it to a mobile device where computational resources are very limited, so cutting the input IDs by half helps a lot.
We were also able to leverage this model and fine-tune it on a relatively small studio-recorded audio dataset, with about 1000 utterances. This latter result is the best setup we've found for training on a low-resource dataset so far. The idea is to train on a larger synthetic audio dataset (e.g. Azure, ElevenLabs), and then fine-tune it on a more specific speaker.
I would also be interested to find the most efficient setup when training for a low-resource multi-speaker dataset. Currently, I'm going to try and pre-train on an English single-speaker synthetic dataset, and then fine-tune it as a multi-speaker model -- but this is going to take a while due to limited resources.
Hope this helps!
Beta Was this translation helpful? Give feedback.
All reactions