-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emilia dataset #2
Comments
Hot damn. I think my own collection caps out at ~14K hours between pieces of LibriSpeech and audiobooks, and a very, very small smattering from videogames. Even the smaller languages being >1K hours is a huge help, since the biggest hurdle for me was trying to even find a large enough corpus to piece together for my own dataset for just Japanese.
Daunting, but I'll see if I have some spare spinning rust I can store it and pick at it. If anything I might just start with the smaller languages to squeeze out some multi-lingual-ness that I keep meaning to go about doing. There being transcriptions already help a ton, as a half of the pain with audio processing is the time waiting to transcribe. The other half having to quantize it all through EnCodec is a bit of a boon still, but I think my recent setup of being able to split across GPUs should help. Having more audio in general, and especially non-monotonous utterances, should help a ton. I'm pretty sure I already hit immense diminishing returns after an epoch. Appreciated. I'll see what I can pick at over the next few days while my spark hasn't waned yet once again. |
doing large preps and trains can be soul-crushing. |
It's not so bad this go-around. It used to be agonizing with system instability (segfaults or hard reboots with anything under PyTorch) on my original training system with my 4070Ti. Swapping to my 7900XTX almost entirely resolved the problems for dataset preparation and non-important training. The estimated week-and-a-half wait for the dataset to process is always a good time for any last-minute checks or ideas to get added; for example: something to solve my dillema of subpar zero-shot performance that may stem from my naive prompt sampling (for months I've been entertaining the idea of using a vector store for input prompts to sample the closest utterance for a given training sample).
I think I'm beyond my hyperfixate-then-burnout-then-hyperfixate cycles I've kept getting myself under, just seems to be lulls in progress until I grow a wild hair and fiddle with a new idea (for example, the STT task being quite promising for reducing the loss overall, so I hope putting emphasis on more languages would in turn help the model overall). The urgency and dread of trying to push out a decent model went away by the time I pushed out a quasi-decent Although again that 50K hours of English audio is still daunting. I think the best approach is to download the N-smallest |
Should have everything prepared for the next training session. Detailing the below mostly for my sake to have it in writing somewhere:
As for the actual dataset to train against, I think I'm going to:
I'll just resume training from the existing My hope is that the Emilia dataset is the answer to my problems. As time goes on I'm more pilled on having a very, very good dataset rather than a large one, and I feel the big crux of these TTS systems are having a sloppy dataset. If results looks promising then that'll be my push to deal with processing the rest of the dataset. One huge oversight I made is that there's ~400k allegedly-unique speakers among the portion of the dataset I collected. A bit of a pain since I made the assumption each group was it's own speaker, so I have to work around having to juggle that many speakers. |
I think it's promising? A few user-error hiccups:
My expectations are pretty good. I think my only regret is throwing too many changes at once again (a handful of different languages, the "use the most similar utterance" feature, more STT). It's hard to gauge what really helped, but I can't complain if it all helped together.
I botched the duration "fix" post-training with an old copy of the tokenizer from July (which shouldn't affect things but a few missing phonemes might cause issues with it training those phonemes against ), but the few results I tested are very pleasing with actually following the prompted speaker, at least the couple of voices I test. I uploaded the "botched" model to https://huggingface.co/ecker/vall-e/. I should have it fixed for tomorrow (the 25th). mmm... I had to go back to my 4xV100 system for the duration-post-fix training, ROCm is just being too much of a pill. I think I still need to bake it more since it only had a few hours sadly (I only realized to use my 4xV100s towards the evening). My notes so far:
I pushed the weights to the HF repo, but I think I need to set aside a good day to let the post-fix training carry out, since I feel like 40% of outputs have extra junk at the end from the stop token taking longer to pop up. And hopefully that can help fill the gaps of voices it's not so good at if I elect to pick sampling by speaker rather than paths. It definitely has potential, but it falling apart on regular people voices has my doubts. |
I guess I'll give some final thoughts. For Emilia specifically:
Will definitely recommend to use for any speech dataset. For the size I used, it performed well. Now, for the model specifically:
I'm pretty sure this won't be my final touches with the model, but until I get a breakthrough again (between another dataset or training technique like I did here), these should be my final thoughts on the model itself. The two core issues it seems to have now is between reduced quality / artifacting from the NAR and some voices not mapping accurately and precisely enough. The former requires more post-training and hoping I can try and prioritize the NAR more without lobotomizing the AR, and the latter I don't really have much of an idea on fixing without more post-training too. That aside, I'll try and get the demo page updated with the current performing outputs when I do my finishing touches. I tried doing it the other day and it seemed mostly fine, but struggling for some speakers. |
Thanks for the great reimplementation of Valle and your interesting thoughts about Emilia. It would be even better if you could compare the model you implemented with the original paper’s metrics (such as WER/SIM-O) on Valle’s paper test set. So far, it seems there hasn’t been a multilingual open-source model that comes close to the performance of Valle/Vallex papers, and I believe this is a big opportunity to train it on Emilia. |
Right, objective metrics. I'll see about calculating WER/SIM-O within I suppose then it'd be a good enough reason to re-create the demo page now that the
From what I do know/remember, it's quite the can of worms.
I do admit that I didn't put much care into my implementation's attention to multi-lingual-ness. Partially because, as an EOP, I don't have a keen ear to non-English beyond the small bits I've picked up for Japanese/German/French (not enough to be confident in). Partially because a proper dataset hasn't been around. Despite that, the output seems fine as far as I can tell sans normal quirks (speaker-confidence issues, duration degradation), but I suppose objective metrics can help make up for my unkeen ears.
I suppose I can toss in more of its English and some of its Chinese. Additional post training mostly on Emilia seemed to benefit more than if I would have done it on Libri(Vox/Speech/Light/TTS/whatever). I'm mostly just worried I already hit the model's capacity, but most models seem to get trained on at least 50K+ hours of audio, and I'm still at a measily unique ~15-20K or something. |
I've grabbed a small portion of Emilia's Chinese dataset (
I don't have an ETA on when it would be done processing and ready to train, since it always seems dreadfully slow to quantize audio through EnCodec ahead-of-time (unbatched, I need to figure out a good way to batch things), but I guesstimate until the end of the week? It would give me enough time to ensure everything is ready, both in the code and with my 4xV100 machine (after it fried a wire from the PSU to the board, but fixing it should be easy), and play around with some other things like APOLLO and some things to maybe help with better multilingual support. If all seems fine then I suppose I can scale up further and add in more from the English and Chinese portions of Emilia. Also WER/SIM-O I'm a bit skeptical since the SIM-O is very high, despite the audio being still crummy. WER is expected to be off because I don't have a good method of normalizing text beyond phonemizing it. Demo page updated with WER/CER/SIM-O metrics between the last The old weights beat the new weights objectively, so I probably botched something during the agony of training for the NAR-len, so I might just resume training from the old weights themselves. |
After some agony and hiccups:
I still need to do my personal evals, which I suppose I'll do over the next few days, but for the most part I think I'm happy with it for its size.
|
Thanks for the great effort! Regarding SIM-O, I believe most papers calculate SIM using WavLM-large, fine-tuned on the speaker verification task (model link). This model is used to generate speaker embeddings, which are then used to compute the cosine similarity between the test utterance and reference clips. The average SIM-O should typically range from 0.5 to 0.7 using this checkpoint. This could be a more useful benchmark when compared to the SIM-O results of Valle in other TTS papers like F5-TTS, MaskGCT, Valle, and NS3. As for WER, just want to provide a information that might be useful: the Amphion framework (GitHub link) reports a Valle model achieving around 4% WER on English, as outlined in the paper (link to paper). They used a speech tokenizer to separate semantic and acoustic tokens, though, which may not be applicable in multilingual contexts. Similar to MaskGCT, they use a multilingual distangled speech tokenizer (I guess it is also open-sourced). I guess using a similar distangled tokenizer might help improve WER. (I am mentioning these because I heard the Chinese Demos. They are fine to me, but have room for further improvements.)
I suggest considering SEED-Eval (GitHub link), where the speakers come from CommonVoice (which the Emilia-trained model hasn’t encountered).
By the way, what biases were you referring to? 😂 |
They're under the demo page (at least, averaged from the utterances on the demo page), but for posterity:
They're averaged for each section in the demo page, so they're not too objective of an average, but with a sample size of >30 it should be decent enough.
Will do, I'll grab them sooner than later. I worry the LibriSpeech speakers I have for English are actually seen from other portions of the LibriVox-derived dataset and aren't quite a good validation dataset. Although their usual performance suggest they're unseen speakers.
Right right, that's most likely where my problem stems from then. I could have sworn I took care to use the right weights, but I probably immediately used the wrong ones when I was refractoring things around. I'll re-do SIM-O calcs sooner than later then with the right weights.
DAC is still my holy grail but none of my past experiments proved fruitful (and I suppose it hasn't been fruitful for anyone else): RVQ levels 4+ just fall apart too much to be useful, but I haven't been able to pinpoint the exact issue; I think I would just have to read the paper for both it and EnCodec and see if the devil is in the details. Although I am not an expert in any stretch of the meaning.
The way I'm handling my text tokenizer does need some attention, since it's a bit of an artifact from when I fiddled with TorToiSe: plain IPA phonemes + some symbols + merges. I feel the merges does more harm than good, since I don't recall the current errors (either stuttering or outright omission) showing up in the The IPAs themselves are robust enough to be language-agnostic, whereas espeak is more liable to problems (for example, in how it wants to pronounce read/lead; it's not necessarily a tokenizer problem but just how it's a contextual problem).
That blame's more on me I feel from both using a small portion of Emilia/ZH, and the model not seeing it nowhere near as much as the other languages, as they have had the benefit of being in for the NAR RVQ-level 0 post training.
Non-specifically: listening to my VALL-E's outputs. Since the beginning of working on it and throughout all the checkpoints, there's always been specific traits of the outputs that it kind of ruined other models for me. I might also blame on how there only really was TorToiSe and ElevenLabs at the time VALL-E's paper released and I took it upon myself to work on my own implementation, so other models haven't ruined me in that way. Specifically: other implementation's implementation details. I have a mental list of nitpicks for abstractions adding unneccessary complexity, but I think it's mostly from working under TorToiSe. A pure non-autoregressive model was on that list, but caving and taking the NAR-pill removed that bias since it works just as well as an autoregressive RVQ level 0 (although I still feel there's a difference in utterance quality). |
Demo page updated with the correct SIM-O:
The actual SIM-Os seems pretty plausible to what I expect for the given tech of the model. I was a bit concerned about doing SIM-O against the prompt and not the reference utterance, but glancing at MaskGCT's paper states it also does this (and does increase the scores, naturally), so I'm not too silly for doing so. I'm still a bit skeptical about WER being too low; giving the samples a listen there's some hiccups that I feel didn't get caught in the calculation, but it might just be whisper not catching it in the transcription.
When I get the chance I'll nab seed-tts-eval and do further evals against it for a more apples-to-apples comparison. |
The result looks promising! Surprised by the Japanese results.
I think WER should be computed on exact words instead of phonemes. The original valle paper reported around 5%, and also Amphion's tech report reported around 4.5% (on different test sets). Maybe 4-10% WER on words is an indicator for a good model.
Maybe simply random sample prompt length, as 5%–20% of the training samples during training can alleviate this problem. |
Got around to adding
I feel the model is fine for what it's worth, but it probably won't hurt to feed it some more of the parts of Emilia EN/ZH I didn't grab whenever I get the headspace to debilitate over a day or two to more training. I did notice the loss still slowly creep down during the last post-training session, so that's nice.
More of a reason for me to cobble together a validation dataset to make sure it's not a fluke. The Emilia/JA samples sound fine for my gaijin ears (and it's a bit of a chore to read along since they all speak faster than I can read the transcription), but in practice with my personal evals, more dramatic speakers sometimes works a little too nicely, but it sometimes suffers from the confidence issue where the output utterance just sounds choked sometimes. There's also some quirks in the phonemization process, as in some kanji aren't correct, blah blah blah.
I'll look into nabbing a thorough text normalizer, or just rely on whisper transcribing the reference as well and hoping it normalizes both text well enough. I just really don't look forward to it since thumbing through TorToiSe's text cleaners still haunts me. I could have sworn though that my initial WER computations still had unusually low WER scores (taking into account they're normalized too), but I'll double check later.
There's a weird "quirk" in the model where, despite training explicitly on low prompt duration ranges (3-6 seconds), it performs very well on prompt durations well above what it was trained against. I only say it's a quirk since it doesn't carry over for output utterances that goes well beyond the average 12 seconds / 3 sentences of audio. Regardless, it only marginally improved scores a negligible amount (per the old page). Despite that though, it's a good enough placebo since it aligns more with normal usage where I feed it a full clip and trim it down to around 8 seconds in my own evals, rather than abruptly trimmed, small utterances.
|
Re-calculated WER with cleaned/normalized transcriptions, and the WER actually decreased (especially for Emilia JA/ZH).
I'm still skeptical of the WER/CER being very low, but I guess the model is just that robust enough. However, I might be cheating a bit out of habit with having the NAR RVQ-level 0 ( Aside from that, I did some deliberating on increasing the SIM-O score (/ the model's output quality in general) but I don't think I have very many feasible paths to explore.
I could be entirely wrong on my thoughts, as I'm still not an expert by any stretch. I just feel that if the model hasn't generalized well enough for zero-shot TTS, it probably won't ever as a tiny 220M attention-based transformer. Although, I think I'm fine with that. I suppose when I get the hyperfixation spark to put more work into the model, it'd be on the SpeechX tasks and see if it amounts to anything, mostly because a bulk of the work is already in the trainer code. However, I think the answer is just to feed it more data in hopes the model can better map speakers to its own latent space or blah blah blah, as the model is still very undertrained in comparison to the other monsters boasting 50K+ hours of audio. |
just saw a intersting paper https://arxiv.org/pdf/2412.10117 which repots the objective evaluation result of almost all sota TTS models on the seed-tts-eval dataset, that you may refer to. |
Right right, I forgot I needed to read CosyVoice 2's paper. At the moment I can't really give it a solid glance (and I probably need the weekend and ample free time to digest it), but the instruction stuff reminded me of the "in-context learning" VALL-E boasted that didn't really ever seem to have been explored to my memory. It's something I wanted to tackle but thinking over it always ended with "I just don't have a way to properly annotate data to that degree".
Oh, I completely forgot that I had this in mind before, and I think I referenced it somewhere in some other eval with the Harvard sentences (if I remember right, MaskGCT's demo page had something similar). Those One thing that did cross my mind, though, is that I might need to do WER calculation as naively as possible by just comparing the transcriptions as-is without any cleanup, and seeing if that brings it closer to other models (although I could have sworn that's what I did initially and it didn't seem too egregious). I suppose I can do this quickly while it's not too late into the night. |
Demo page updated with
I've also added the Harvard sentences I use for my own evals for general robustness, as well as English tongue twisters to try and mirror |
It came to my attention while doing something completely unrelated that, for a while now at least, the model actually wasn't properly trained with a language token, per Line 941 in e7e7f48
On the bright side, I suppose it's impressive it can perform as it does without a language marker. It doens't seem to offer much of a difference re-enabling it though. Maybe the attention mechanisms managed to derive the target language/accent from the input phonemes itself, since it's agnostic to the input voice itself. I'll have to dig through some papers to see if any other TTS solution made note of this. It didn't seem to matter at all. I guess it's more technically-interesting this way, but I still prefered my old way of being able to influence the accent by changing the language marker. Regardless, I'll still need to re-do the samples in the demo page anyways. I did also enable training for |
have you seen this dataset? maybe it's better suited for zero-shot task, more natural speech than audiobook
https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md
The text was updated successfully, but these errors were encountered: