Indonesian Audio Samples #70

w11wo · 2023-10-30T09:33:50Z

w11wo
Oct 30, 2023

Hey, firstly thanks for making this repo!

I am just going to share a few results when training on Indonesian speech. I've trained VITS (via Piper) and VITS2. Both use Indonesian IPA phonemes as modelling units, which I've obtained by phonemizing Indonesian texts using g2p ID (disclaimer: my team and I built this tool).

We also followed Piper's medium model configuration with 44.1kHz, so the resultant VITS2 model is smaller than the default config (about 16M parameters, 66.5MB after exporting to ONNX). We trained both models on a Microsoft Azure Indonesian male voice (ArdiNeural). You can find the relevant files here.

At least on this dataset, I don't find much quality change when comparing it to VITS1. But what I do like is the ability to not use blank/padding tokens for the sake of stability. It's really helpful when deploying it to a mobile device where computational resources are very limited, so cutting the input IDs by half helps a lot.

We were also able to leverage this model and fine-tune it on a relatively small studio-recorded audio dataset, with about 1000 utterances. This latter result is the best setup we've found for training on a low-resource dataset so far. The idea is to train on a larger synthetic audio dataset (e.g. Azure, ElevenLabs), and then fine-tune it on a more specific speaker.

I would also be interested to find the most efficient setup when training for a low-resource multi-speaker dataset. Currently, I'm going to try and pre-train on an English single-speaker synthetic dataset, and then fine-tune it as a multi-speaker model -- but this is going to take a while due to limited resources.

Hope this helps!

p0p4k · 2023-10-30T10:53:57Z

p0p4k
Oct 30, 2023
Maintainer

Thanks for sharing your results. I believe the future of generative TTS is in conditional flow matching. You can safely skip vits2, it was just a hobby project for me.

5 replies

w11wo Oct 30, 2023
Author

Hi @p0p4k, thanks for the suggestion. Do you have any suggested repos for conditional flow matching-based TTS? Probably something like Matcha-TTS?

p0p4k Oct 30, 2023
Maintainer

Wait for December. Author of glowtts** is going to release p-flow, a cfm based technique. Very similar to naturalspeech2.

JohnHerry Oct 30, 2023

Is that means the needing for more data and larger pretrained base models? Ah, I can not just find such resources.

p0p4k Oct 30, 2023
Maintainer

It will be faster and better model. Nothing related to dataset size yet.
https://pflow-demo.github.io/projects/pflow/

zidsi Nov 10, 2023

Data: We train P-Flow on LibriTTS [41]. LibriTTS training set consists of 580 hours of data from
2,456 speakers. We specifically use data that is longer than 3 seconds for speech prompting, yielding
a 256 hours subset. For evaluation, we follow the experiments in [37, 19] and use LibriSpeech
test-clean, assuring no overlap exists with our training data. We resample all datasets to 22kHz.

Source: https://openreview.net/forum?id=zNA7u7wtIN

pengpengtao · 2024-10-24T04:14:18Z

pengpengtao
Oct 24, 2024

嘿，首先感谢您创建此存储库！

我只想分享一些在印尼语口语训练时的结果。我已经训练了 VITS（通过 Piper）和 VITS2。两者都使用印度尼西亚语 IPA 音素作为建模单位，这是我通过使用 g2p ID 对印度尼西亚语文本进行音效化而获得的（免责声明：我和我的团队构建了这个工具）。

我们还遵循了 Piper 的 44.1kHz 中等模型配置，因此生成的 VITS2 模型比默认配置小（大约 16M 参数，导出到 ONNX 后为 66.5MB）。我们在 Microsoft Azure 印度尼西亚语男声（ArdiNeural）上训练了这两个模型。您可以在此处找到相关文件。

至少在这个数据集上，与 VITS1 相比，我没有发现太大的质量变化。但我真正喜欢的是为了稳定性而不使用空白/填充令牌的能力。当将其部署到计算资源非常有限的移动设备上时，这非常有用，因此将输入 ID 减少一半会有很大帮助。

我们还能够利用这个模型，并在一个相对较小的工作室录制的音频数据集上对其进行微调，该数据集包含大约 1000 个话语。后一个结果是我们迄今为止发现的在低资源数据集上进行训练的最佳设置。这个想法是在更大的合成音频数据集（例如 Azure、ElevenLabs）上进行训练，然后在更具体的扬声器上对其进行微调。

我还希望找到在训练低资源多说话人数据集时最有效的设置。目前，我将尝试在英语单说话人合成数据集上进行预训练，然后将其微调为多说话人模型 - 但由于资源有限，这将需要一段时间。

希望这有帮助！

"How is the symbols.py file for Indonesian generated? The phonemes I obtained are for each word, and there is no way to split them into individual phonemes."

12 replies

pengpengtao Oct 24, 2024

如果无法拆分为二合字母（具有 >1 个字符的音素），我认为在单字符音素上进行训练也是可以的（许多模型都在这样做）。所以你的词汇表将是这样的：
[' ', '.', 'a', 'b', 'd', 'e', 'f', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 's', 't', 'u', 'v', 'w', 'z', 'æ', 'ð', 'ŋ', 'ɐ', 'ɔ', 'ə', 'ɚ', 'ɛ', 'ɜ', 'ɪ', 'ɹ', 'ʃ', 'ʊ', 'ʌ', 'ʒ', 'ˈ', 'ˌ', 'ː', 'ᵻ']

Will the quality degrade like this?

w11wo Oct 24, 2024
Author

For example, StyleTTS uses espeak and trains on single IPA char here. So you can also use this in VITS2 and symbols.py.

w11wo Oct 24, 2024
Author

Will the quality degrade like this?

Personally, I couldn't tell the difference. I've tried using espeak/g2p ID for Indonesian, and they're both ok. g2p ID just has better handling of homographs (same spelling, different pronunciations).

pengpengtao Oct 24, 2024

I will try these two g2p methods.

p0p4k Oct 24, 2024
Maintainer

@w11wo thanks for taking care of this !

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indonesian Audio Samples #70

{{title}}

Replies: 2 comments 17 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Indonesian Audio Samples #70

w11wo Oct 30, 2023

Replies: 2 comments · 17 replies

p0p4k Oct 30, 2023 Maintainer

w11wo Oct 30, 2023 Author

p0p4k Oct 30, 2023 Maintainer

JohnHerry Oct 30, 2023

p0p4k Oct 30, 2023 Maintainer

zidsi Nov 10, 2023

pengpengtao Oct 24, 2024

pengpengtao Oct 24, 2024

w11wo Oct 24, 2024 Author

w11wo Oct 24, 2024 Author

pengpengtao Oct 24, 2024

p0p4k Oct 24, 2024 Maintainer

w11wo
Oct 30, 2023

Replies: 2 comments 17 replies

p0p4k
Oct 30, 2023
Maintainer

w11wo Oct 30, 2023
Author

p0p4k Oct 30, 2023
Maintainer

p0p4k Oct 30, 2023
Maintainer

pengpengtao
Oct 24, 2024

w11wo Oct 24, 2024
Author

w11wo Oct 24, 2024
Author

p0p4k Oct 24, 2024
Maintainer