Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emilia dataset #2

Open
kunibald413 opened this issue Sep 7, 2024 · 20 comments
Open

Emilia dataset #2

kunibald413 opened this issue Sep 7, 2024 · 20 comments

Comments

@kunibald413
Copy link

have you seen this dataset? maybe it's better suited for zero-shot task, more natural speech than audiobook

https://github.com/open-mmlab/Amphion/blob/main/preprocessors/Emilia/README.md

@e-c-k-e-r
Copy link
Owner

101k hours

Hot damn. I think my own collection caps out at ~14K hours between pieces of LibriSpeech and audiobooks, and a very, very small smattering from videogames. Even the smaller languages being >1K hours is a huge help, since the biggest hurdle for me was trying to even find a large enough corpus to piece together for my own dataset for just Japanese.

2.4TB

Daunting, but I'll see if I have some spare spinning rust I can store it and pick at it. If anything I might just start with the smaller languages to squeeze out some multi-lingual-ness that I keep meaning to go about doing.

There being transcriptions already help a ton, as a half of the pain with audio processing is the time waiting to transcribe. The other half having to quantize it all through EnCodec is a bit of a boon still, but I think my recent setup of being able to split across GPUs should help.


Having more audio in general, and especially non-monotonous utterances, should help a ton. I'm pretty sure I already hit immense diminishing returns after an epoch.

Appreciated. I'll see what I can pick at over the next few days while my spark hasn't waned yet once again.

@kunibald413
Copy link
Author

doing large preps and trains can be soul-crushing.
passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

@e-c-k-e-r
Copy link
Owner

doing large preps and trains can be soul-crushing.

It's not so bad this go-around. It used to be agonizing with system instability (segfaults or hard reboots with anything under PyTorch) on my original training system with my 4070Ti. Swapping to my 7900XTX almost entirely resolved the problems for dataset preparation and non-important training.

The estimated week-and-a-half wait for the dataset to process is always a good time for any last-minute checks or ideas to get added; for example: something to solve my dillema of subpar zero-shot performance that may stem from my naive prompt sampling (for months I've been entertaining the idea of using a vector store for input prompts to sample the closest utterance for a given training sample).

passion and a little bit of compulsion can keep you going, but i hope you don't burn yourself out.

I think I'm beyond my hyperfixate-then-burnout-then-hyperfixate cycles I've kept getting myself under, just seems to be lulls in progress until I grow a wild hair and fiddle with a new idea (for example, the STT task being quite promising for reducing the loss overall, so I hope putting emphasis on more languages would in turn help the model overall). The urgency and dread of trying to push out a decent model went away by the time I pushed out a quasi-decent ar+nar-llama-8 model with my 4xV100 system: it's rather quick to churn out a model with good results, and I don't need to keep bruteforcing additional epochs for very little gains like I did with the ar+nar-retnet-8.


Although again that 50K hours of English audio is still daunting. I think the best approach is to download the N-smallest .tars to prioritize more speakers over speakers with a lot of utterances (which unfortunately a lot of datasets seem to prioritize the latter over the former).

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Sep 17, 2024

Should have everything prepared for the next training session. Detailing the below mostly for my sake to have it in writing somewhere:

  • I added both the French and a portion of the English (tar files under I think 300MiB for 312 speakers) datasets, just to make sure I have enough data.
    • I would have added the Korean one and some of the Chinese dataset too, but I didn't want to have to wrestle with the phonemizer to get it to work like I've had to with Japanese (although the more I think about it, Korean shouldn't have issues, and Chinese I can probably find a better phonemizer/procedure for). There's also the issue of updating the tokenizer, since Japanese gave some unique IPAs that I had to remap the last time I tried working with it.
  • (Theoretically) I have my "use the most similar utterance for a given sample" system added to hopefully better help zero-shot voice cloning. Theoretically, since I need to:
    • Verify it works when integrated. It passes my mental model, and separate testing shows that it does "work".
    • Verify that what it thinks is similar is adequate enough, since it's primarily just relying on cosine similarities of audio passed through the resp embedding (similar to inputs for STT).
    • Verify that my tokenizer does contain all the phonemes (and not have a repeat of missing phonemes with the small Japanese portion of my dataset).
    • Actually compute the similarity metadata, since I have yet to actually integrate it into the metadata generation procedure (I'm not looking forward to seeing how long it would take to process).
    • Do the final processed data => HDF5 and shove it to my 4xV100 system (but there's been some weird Linux instability after a while and the entire file gets tainted when it's not closed properly, so I'll see what I can do about it).

As for the actual dataset to train against, I think I'm going to:

  • drop the donated audiobooks I've had to supplement my dataset, and LibriSpeech (small+medium+duplicate)
    • The model was already trained a majority of the time against this, and I don't want a majority of this training session dedicated to a control dataset.
    • I feel it would be a huge pain to compute the similarity metadata on speakers with a xboxhueg amount of utterances.
    • Both of these pieces of my dataset are concerningly transcribed and trimmed from whole audio files, so I don't really trust how whisperX decides to slice it at.
    • They're the most audiobook-y.
  • retain LibriTTS-R, as it's not sliced, already transcribed, and I feel is the cleanest besides Emilia.
  • retain all my other handpicked vidya-derived audio, since they're still rather small compared to the rest of the dataset, and they're also not audiobook-y.

I'll just resume training from the existing ar+nar-tts+stt-llama-8 weights since I don't think I need to restart from scratch (as the model is still rather malleable from all the other tweaks I've glued on), but have the same dataset sampling method of "sort by duration with a fix batch size, and let it go from smallest to largest utterances" (as I still do not trust my batch-by-duration-size method to be stable).

My hope is that the Emilia dataset is the answer to my problems. As time goes on I'm more pilled on having a very, very good dataset rather than a large one, and I feel the big crux of these TTS systems are having a sloppy dataset. If results looks promising then that'll be my push to deal with processing the rest of the dataset.


One huge oversight I made is that there's ~400k allegedly-unique speakers among the portion of the dataset I collected. A bit of a pain since I made the assumption each group was it's own speaker, so I have to work around having to juggle that many speakers.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Sep 24, 2024

I think it's promising? A few user-error hiccups:

  • initially started training on my 7900XTX, but had to scrap those weights:
    • VRAM usage was sporadic, I assume just ROCm things, so training would cut off randomly during SDPA.
    • some phonemes were being <unk>'d.
    • the "pick the most similar utterance" code was wrong for a number of silly reasons, so it didn't benefit from that.
      • despite this, the training loss seemed to be teetering towards the low 2.xs.
  • ironed out the kinks in the code, threw everything onto my 4xV100 machine:
    • ...despite the first 3 second utterancess => 12 second utterances session used only one V100 for the entire 25 hours, so the model trained under 16 batch * 4 gradaccum instead of 16 batch * 4 gradaccum * 4 GPUs (for smoother gradients). I don't think it's that big of a deal.
    • 12 second utterances onward trained properly, although I had to adjust from 16 batch * 4 gradaccum to 8 batch * 8 gradaccum.
  • the metrics are interesting:
    metrics
    • I'm not too sure what causes those little restarts (they show even without EMA smoothing), but up to 52k is the initial 3 seconds => 12 seconds batch (after that is the 12 seconds onwards).
      • I wouldn't be surprised if it's just luck with the batch having the "right" amount of RVQ 0s (despite being 50% likely), stt tasks (despite being 25% likely), and LibriTTS-R utterances picked (despite making up ~7% of the dataset).
    • the loss/accuracy seems promising at least.
    • the evaluation/validation samples seem promising, although there seems to be something wrong with the evaluation dataloader giving me nothing but German utterances so far.
  • currently doing final training touches on my 7900XTX with sample_max_duration_batch + sample_shuffle (and 16 batch * 16 gradaccum) to "fix" the model (again) to work on a number of durations (a drawback from sorting by duration), and probably some benefit of training under bfloat16 instead of float16 + loss scaling (even though I feel the latter has "better" models to the former).

My expectations are pretty good. I think my only regret is throwing too many changes at once again (a handful of different languages, the "use the most similar utterance" feature, more STT). It's hard to gauge what really helped, but I can't complain if it all helped together.

Tomorrow I should have some time mucking with the model and seeing (hearing) if it's as good as it looks.


I botched the duration "fix" post-training with an old copy of the tokenizer from July (which shouldn't affect things but a few missing phonemes might cause issues with it training those phonemes against ), but the few results I tested are very pleasing with actually following the prompted speaker, at least the couple of voices I test. I uploaded the "botched" model to https://huggingface.co/ecker/vall-e/. I should have it fixed for tomorrow (the 25th).


mmm... I had to go back to my 4xV100 system for the duration-post-fix training, ROCm is just being too much of a pill. I think I still need to bake it more since it only had a few hours sadly (I only realized to use my 4xV100s towards the evening). My notes so far:

  • when it works, it really works. It's too eerie on how similar it can get to the prompt, but I'm probably biased from expecting it to be removed far enough from the original speaker.
  • however, voices that don't belong to a normal, typical corpus (instead of being from Libri*/Emilia, being from my vidya-derived corpus), kind of break apart with the same kind of crusty male voice.
    • I do notice it does preserve the acoustic part of the input prompt eerily well sometimes (if there's a reverb or if it has a through-a-speaker filter effect).
    • I had to double check that the prior weights had the same issue where it just falls apart for these voices, and I suppose it did. I could have sworn some voices still somewhat tried to be similar.
  • I feel like it does a good job at Japanese, at least from my cursory tests, even with non-Japanese speakers. I need a concrete test for the other way around, and for German and French.

I pushed the weights to the HF repo, but I think I need to set aside a good day to let the post-fix training carry out, since I feel like 40% of outputs have extra junk at the end from the stop token taking longer to pop up. And hopefully that can help fill the gaps of voices it's not so good at if I elect to pick sampling by speaker rather than paths. It definitely has potential, but it falling apart on regular people voices has my doubts.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Sep 28, 2024

I guess I'll give some final thoughts.

For Emilia specifically:

  • for what it's worth, it's a decent dataset, as it offers something both LibriSpeech and audiobook readings don't really offer: a ton of speakers and some variety between those speakers.
  • the additional languages seem just enough to provide a good baseline to do cross-lingual inferences, and a strong baseline to finetune to other speakers (although it's conjecture based on how some light tests, but I know my problem with not being able to finetune to a Japanese voice with my previous portion of the dataset was because of how small that was).
  • it already being trimmed and transcribed is a huge plus. The agony of dealing with LibriSpeech is the sheer amount of time transcribing with WhisperX and hoping it's accurate enough.

Will definitely recommend to use for any speech dataset. For the size I used, it performed well.

Now, for the model specifically:

  • I think the "pick the most similar utterance" prompt sampling approach is a net good. Prompt adherence is really strong when it works.
  • However, voices it's not so confident in seems to vary between having artifacts in the audio (a NAR issue), or the latent space it maps the input prompt into being in a close ball park, but varying a little too much between inferences.
    • I think the former problem is simply just from there being little utterances for a speaker compared to the big players in the dataset. The caveat of training a unified AR/NAR model is that there's not a guarantee a speaker will receive proper training that covers all eight RVQ levels unless there's a ton of utterances for that speaker (such as with the audiobook-derived corpus-es); Emilia has a lot of speakers that have a small handful of utterances.
  • A lot of my previous issues with output quality/performance actually stemmed from a bug with the input audio not resampling properly, which would explain why things I thought the old model being able to do "decently" ended up being shitty.
    • Despite this, overall output quality seems to have improved between model weights.
  • Multi-lingual quality is adequate, but is much more likely to suffer from the former artifacting issues rather than the latter "latent mapping" problems.
    • I will say I'm rather surprised it doesn't have the latter "latent mapping" problem, given the speakers I tested against having a very small presence in the dataset.
  • Outputs are almost consistent, but sometimes there'll be some issues in the output. I suppose samplers could fix this, since my tests stopped bothering with anything beyond AR temp 1.0 / NAR temp 0.0.

I'm pretty sure this won't be my final touches with the model, but until I get a breakthrough again (between another dataset or training technique like I did here), these should be my final thoughts on the model itself. The two core issues it seems to have now is between reduced quality / artifacting from the NAR and some voices not mapping accurately and precisely enough. The former requires more post-training and hoping I can try and prioritize the NAR more without lobotomizing the AR, and the latter I don't really have much of an idea on fixing without more post-training too.

That aside, I'll try and get the demo page updated with the current performing outputs when I do my finishing touches. I tried doing it the other day and it seemed mostly fine, but struggling for some speakers.

@HarryHe11
Copy link

Thanks for the great reimplementation of Valle and your interesting thoughts about Emilia. It would be even better if you could compare the model you implemented with the original paper’s metrics (such as WER/SIM-O) on Valle’s paper test set. So far, it seems there hasn’t been a multilingual open-source model that comes close to the performance of Valle/Vallex papers, and I believe this is a big opportunity to train it on Emilia.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 7, 2024

It would be even better if you could compare the model you implemented with the original paper’s metrics (such as WER/SIM-O) on Valle’s paper test set

Right, objective metrics.

I'll see about calculating WER/SIM-O within vall_e.demo; I imagine I would need to borrow some other models to do so (WER needs transcriptions, and I don't trust stt, and I don't think I can figure a way to derive SIM-O from proms embeddings), and it would be beneficial to calculate them across huge batches.

I suppose then it'd be a good enough reason to re-create the demo page now that the NAR-len works.

So far, it seems there hasn’t been a multilingual open-source model that comes close to the performance of Valle/Vallex papers

From what I do know/remember, it's quite the can of worms.

  • VALL-E X's paper seems to briefly cover it, and the later papers in the VALL-E/NaturalSpeech line seems to cover it less and less.
  • Literature in general doesn't seem to cover it much. TTS papers love to cover the underlying mechanisms like ~*attention*~ and ~*flow matching*~ in excrutiating detail, but not the things to look for in a dataset or formatting your sequences.
    • For example, it would be real sugoi if I had something explain why my language token seems to actually guide accents, or if summing a language embedding on the inputs is preferred over a single token.
  • XTTS2 is still a TorToiSe retrain/finetune and relies on bloating the text tokenizer.
  • Other solutions that do boast it like MaskGCT only seem to be trained on English and Chinese.
  • Quality datasets are still scarce. Quality non-English/Chinese datasets are a rarity. A cross-lingual dataset is nonexistent.

I do admit that I didn't put much care into my implementation's attention to multi-lingual-ness. Partially because, as an EOP, I don't have a keen ear to non-English beyond the small bits I've picked up for Japanese/German/French (not enough to be confident in). Partially because a proper dataset hasn't been around.

Despite that, the output seems fine as far as I can tell sans normal quirks (speaker-confidence issues, duration degradation), but I suppose objective metrics can help make up for my unkeen ears.

  • Cross-lingual-ness (for example, input Japanese => output English) seems to only work on a narrow-er subset of speakers (ones it's very confident in), but a LoRA on an unconfident speaker does fix it (for example, the GLaDOS LoRA can speak Japanese).

and I believe this is a big opportunity to train it on Emilia.

I suppose I can toss in more of its English and some of its Chinese. Additional post training mostly on Emilia seemed to benefit more than if I would have done it on Libri(Vox/Speech/Light/TTS/whatever). I'm mostly just worried I already hit the model's capacity, but most models seem to get trained on at least 50K+ hours of audio, and I'm still at a measily unique ~15-20K or something.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 9, 2024

I've grabbed a small portion of Emilia's Chinese dataset (tar files <500MiB, so 101 of them, don't know how many speakers it ended up having) + Korean to ensure everything is all fine and dandy. I think I should have the phonemizer up to par for handling both. I'm a bit wary since I feel like I'm neglecting a detail, but I think I can get away with using phonemizer/espeak-ng's cmn-latn-pinyin output and remapping the pinyin tone markers alongside the IPAs. In theory:

  • the attention heads can attend to the tone marker tokens and do its magic contextually, handling my main worry with introducing a tonal language. The existing stress markers in my IPAs should help guide it.
  • it might not be so easy, because of shooting myself in the foot by relying on token merging in the vocab, so it might not get to do it. It might benefit if I did some training without token merges, but that would require some dark arts.

I don't have an ETA on when it would be done processing and ready to train, since it always seems dreadfully slow to quantize audio through EnCodec ahead-of-time (unbatched, I need to figure out a good way to batch things), but I guesstimate until the end of the week? It would give me enough time to ensure everything is ready, both in the code and with my 4xV100 machine (after it fried a wire from the PSU to the board, but fixing it should be easy), and play around with some other things like APOLLO and some things to maybe help with better multilingual support.

If all seems fine then I suppose I can scale up further and add in more from the English and Chinese portions of Emilia.


Also WER/SIM-O soon added, didn't forget that. I do want to establish metrics from pre-Emilia weights, post-Emilia+pre-NAR-len the current weights, and then the future weights.

I'm a bit skeptical since the SIM-O is very high, despite the audio being still crummy. WER is expected to be off because I don't have a good method of normalizing text beyond phonemizing it.

image


Demo page updated with WER/CER/SIM-O metrics between the last ar+nar-llama-8 weights, and the current weights both through the AR+NAR and the NAR-len (pure NAR).

The old weights beat the new weights objectively, so I probably botched something during the agony of training for the NAR-len, so I might just resume training from the old weights themselves.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 17, 2024

After some agony and hiccups:

  • the model is much, much, much better now. I do not know what I did different, but it resolves my concerns with the NAR-len output.
    • I still need to do my personal evals, but the brief evals I did this morning on its corresponding checkpoint had me very relieved (however HL2 G-Man still had problems, but that's a given anyways).
  • the model can now speak Korean and Chinese (Mandarin?).
    • The output seems fine? I'm extremely rusty with both, but the WER/SIM-O suggests it's fine.
    • I can't remember if I made note of it in a commit message, a comment in the code, or in the docs, but I don't need to coerce into pinyin and then phonemize, since espeak seems good enough to phonemize directly, and replacing the tone markers with additional phonemes seems good enough?
  • Multi-lingual-ness seems to be on par with English.
    • I still need to do my personal evals for cross-lingualness. I still have my concerns with this, since it seems very particular to use the prompted speaker's language as the output audio language, rather than the target text language.
  • I am skeptical if I managed to overfit on some speakers/utterances......
    • For example per the demo page:
      • the == Departments and programs == utterance outputs Topics: Departments and programs just like the reference.
      • the おとうさん utterance has a "leaky" prefix in the prompt and reference that makes its way into the output.
    • However I'm also skeptical about it overfitting since I genuinely don't see how the model is able to do so from how much data it's seen already.
    • I don't have a proper validation dataset of Emilia, like I do for LibriSpeech/LibriLight, so everything under Emilia is already seen by the model anyways.
      • Technically I do have a validation dataset of Emilia through singleton speakers (speakers with one utterance that got culled due to how I'm handling sampling a prompt), but I don't think I have a good way to utilize them without re-using the reference as the prompt.
  • I am a bit skeptical about how SIM-O is calculated.
    • Some utterances scoring above 0.9 has noticeable (to me) problems that keep it from feeling like it deserves that score.
    • Other utterances have quality differences that WavLM does not seem to care so much for?
    • I imagine this is simply a limitation of how the speaker verification model is able to map speakers into its latent space with adequate granularity (and not a precision issue since scores of 0.98 feel deserved).
    • I am very skeptical of the average being 0.9.

I still need to do my personal evals, which I suppose I'll do over the next few days, but for the most part I think I'm happy with it for its size.

  • Emilia really helped improve the model, despite both using a portion of it and the quirks I've caught. Especially for non-English, since I was agonizing over trying to cobble together my own dataset for Japanese from scraps of corpus-es.
  • I do need to revisit F5-TTS and MaskGCT (and.... I guess visit Fish 1.5... since that seems all the rage too......) to do comparisons against again. Especially so now that the model is a pure NAR now (and my biases are mostly gone).

@HarryHe11
Copy link

HarryHe11 commented Dec 17, 2024

  • The output seems fine? I'm extremely rusty with both, but the WER/SIM-O suggests it's fine.
  • I am a bit skeptical about how SIM-O is calculated. * Some utterances scoring above 0.9 has noticeable (to me) problems that keep it from feeling like it deserves that score.

Thanks for the great effort! Regarding SIM-O, I believe most papers calculate SIM using WavLM-large, fine-tuned on the speaker verification task (model link). This model is used to generate speaker embeddings, which are then used to compute the cosine similarity between the test utterance and reference clips. The average SIM-O should typically range from 0.5 to 0.7 using this checkpoint. This could be a more useful benchmark when compared to the SIM-O results of Valle in other TTS papers like F5-TTS, MaskGCT, Valle, and NS3.

As for WER, just want to provide a information that might be useful: the Amphion framework (GitHub link) reports a Valle model achieving around 4% WER on English, as outlined in the paper (link to paper). They used a speech tokenizer to separate semantic and acoustic tokens, though, which may not be applicable in multilingual contexts. Similar to MaskGCT, they use a multilingual distangled speech tokenizer (I guess it is also open-sourced). I guess using a similar distangled tokenizer might help improve WER. (I am mentioning these because I heard the Chinese Demos. They are fine to me, but have room for further improvements.)

  • I don't have a proper validation dataset of Emilia, like I do for LibriSpeech/LibriLight, so everything under Emilia is already seen by the model anyways.

I suggest considering SEED-Eval (GitHub link), where the speakers come from CommonVoice (which the Emilia-trained model hasn’t encountered).

  • I do need to revisit F5-TTS and MaskGCT (and.... I guess visit Fish 1.5... since that seems all the rage too......) to do comparisons against again. Especially so now that the model is a pure NAR now (and my biases are mostly gone).

By the way, what biases were you referring to? 😂

@e-c-k-e-r
Copy link
Owner

Thanks for the great effort! May we know the WER/SIM-O on English and Mandarin for current Valle?

They're under the demo page (at least, averaged from the utterances on the demo page), but for posterity:

  • LibriSpeech/Light/Vox EN:
    • WER: 0.047
    • CER: 0.014
    • SIM-O: 0.889
  • Emilia EN:
    • Average WER: 0.040
    • CER: 0.020
    • SIM-O: 0.894
  • Emilia ZH:
    • WER: 0.119
    • CER: 0.049
    • SIM-O: 0.897

They're averaged for each section in the demo page, so they're not too objective of an average, but with a sample size of >30 it should be decent enough.

I recommend using SEED-Eval: https://github.com/BytedanceSpeech/seed-tts-eval, where speakers are from commonvoice (not seen by Emilia trained model)

Will do, I'll grab them sooner than later.

I worry the LibriSpeech speakers I have for English are actually seen from other portions of the LibriVox-derived dataset and aren't quite a good validation dataset. Although their usual performance suggest they're unseen speakers.

I think most papers calculate SIM with WavLM-large fine-tuned on the speaker verification task

Right right, that's most likely where my problem stems from then. I could have sworn I took care to use the right weights, but I probably immediately used the wrong ones when I was refractoring things around. I'll re-do SIM-O calcs sooner than later then with the right weights.

But it is true that DAC has the best audio quality.

DAC is still my holy grail but none of my past experiments proved fruitful (and I suppose it hasn't been fruitful for anyone else): RVQ levels 4+ just fall apart too much to be useful, but I haven't been able to pinpoint the exact issue; I think I would just have to read the paper for both it and EnCodec and see if the devil is in the details.

Although I am not an expert in any stretch of the meaning.

I guess using a similar distangled tokenizer might help improve WER.

The way I'm handling my text tokenizer does need some attention, since it's a bit of an artifact from when I fiddled with TorToiSe: plain IPA phonemes + some symbols + merges. I feel the merges does more harm than good, since I don't recall the current errors (either stuttering or outright omission) showing up in the ar+nar-retnet-8 weights when it worked, as it used a very plain tokenizer.

The IPAs themselves are robust enough to be language-agnostic, whereas espeak is more liable to problems (for example, in how it wants to pronounce read/lead; it's not necessarily a tokenizer problem but just how it's a contextual problem).

I am mentioning these because I heard the Chinese Demos. They are fine to me, but have room for further improvements.

That blame's more on me I feel from both using a small portion of Emilia/ZH, and the model not seeing it nowhere near as much as the other languages, as they have had the benefit of being in for the NAR RVQ-level 0 post training.

By the way, what biases were you referring to? 😂

Non-specifically: listening to my VALL-E's outputs. Since the beginning of working on it and throughout all the checkpoints, there's always been specific traits of the outputs that it kind of ruined other models for me. I might also blame on how there only really was TorToiSe and ElevenLabs at the time VALL-E's paper released and I took it upon myself to work on my own implementation, so other models haven't ruined me in that way.

Specifically: other implementation's implementation details. I have a mental list of nitpicks for abstractions adding unneccessary complexity, but I think it's mostly from working under TorToiSe.

A pure non-autoregressive model was on that list, but caving and taking the NAR-pill removed that bias since it works just as well as an autoregressive RVQ level 0 (although I still feel there's a difference in utterance quality).

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 17, 2024

Demo page updated with the correct SIM-O:

  • LibriVox-derived: 0.376
  • Emilia (EN): 0.520
  • Emilia (DE): 0.554
  • Emilia (FR): 0.469
  • Emilia (JA): 0.631
  • Emilia (KO): 0.519
  • Emilia (ZH): 0.519

The actual SIM-Os seems pretty plausible to what I expect for the given tech of the model. I was a bit concerned about doing SIM-O against the prompt and not the reference utterance, but glancing at MaskGCT's paper states it also does this (and does increase the scores, naturally), so I'm not too silly for doing so.

I'm still a bit skeptical about WER being too low; giving the samples a listen there's some hiccups that I feel didn't get caught in the calculation, but it might just be whisper not catching it in the transcription.

  • It might also simply be from me computing the WER on the phonemes themselves to serve as text normalization. I suppose I might need to segregate this into true-word error rate and a phonetic error rate.
  • It might also just be naively using torchmetrics wholesale is from naively using torcheval.metrics.functional.word_error_rate; I need to un-normalize the outputted rates.

When I get the chance I'll nab seed-tts-eval and do further evals against it for a more apples-to-apples comparison.

@HarryHe11
Copy link

Demo page updated with the correct SIM-O:

  • LibriVox-derived: 0.376
  • Emilia (EN): 0.520
  • Emilia (DE): 0.554
  • Emilia (FR): 0.469
  • Emilia (JA): 0.631
  • Emilia (KO): 0.519
  • Emilia (ZH): 0.519

The result looks promising! Surprised by the Japanese results.

I'm still a bit skeptical about WER being too low; giving the samples a listen there's some hiccups that I feel didn't get caught in the calculation, but it might just be whisper not catching it in the transcription.

  • It might also simply be from me computing the WER on the phonemes themselves to serve as text normalization. I suppose I might need to segregate this into true-word error rate and a phonetic error rate.

I think WER should be computed on exact words instead of phonemes. The original valle paper reported around 5%, and also Amphion's tech report reported around 4.5% (on different test sets). Maybe 4-10% WER on words is an indicator for a good model.

  • my input prompts were far too short, as I was using the training prompt durations (between 3 seconds and 6 seconds)

Maybe simply random sample prompt length, as 5%–20% of the training samples during training can alleviate this problem.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 18, 2024

Got around to adding seed-tts-eval (as well as some other speakers from my personal portion of my dataset). Some thoughts:

  • naturally, unseen speakers are quite meh, an average of ~0.38-0.42 SIM-O.
  • there's a fundamental problem with English as a whole and not being able to account for all accents/dialects. This eval dataset seems to have more South Asian speakers that, while I imagine Emilia would better account for with more real-world samples, the majority of the model was still trained under LibriVox-derived corpuses and other audiobook readings, which is focused more on typical-American English instead.
    • This most likely explains why non-specific English speakers have a general low SIM-O score in comparison to other languages due to accents/dialects.
  • one thing I've noticed, moreso with Emilia, is that it seems the model handles a variety in acoustic environments (for lack of a better term)? It's great that it has a ton of real-world usage, letting the model better isolate what is the environment and what is the utterance, but I don't imagine there being a metric that would quantify that. Per MaskGCT's paper, FSD might be the closest thing to it if I'm understanding it right.
  • I'm still very skeptical about the WER being low, even taking account to un-normalize the rates and some outliers completely muddying up the numbers. On one hand the model does feel rather robust even for low SIM-O utterances, as the utterances feel pretty natural (or at least in comparison to what I remember hearing from TorToiSe), but it still doesn't seem to account for stuttering or some utterances just having an unnatural delivery to it at times. I suppose subjective metrics are what accounts for that.
    • I do find it interesting it does stutter, but I imagine that's just a consequence of modeling against real world utterances. I've noticed there's some breathing that slips through as well.
    • This might also just be a limitation to whisper not seeing stutters. And obviously it won't be able to handle oddities in deliveries.

I feel the model is fine for what it's worth, but it probably won't hurt to feed it some more of the parts of Emilia EN/ZH I didn't grab whenever I get the headspace to debilitate over a day or two to more training. I did notice the loss still slowly creep down during the last post-training session, so that's nice.


Surprised by the Japanese results.

More of a reason for me to cobble together a validation dataset to make sure it's not a fluke. The Emilia/JA samples sound fine for my gaijin ears (and it's a bit of a chore to read along since they all speak faster than I can read the transcription), but in practice with my personal evals, more dramatic speakers sometimes works a little too nicely, but it sometimes suffers from the confidence issue where the output utterance just sounds choked sometimes. There's also some quirks in the phonemization process, as in some kanji aren't correct, blah blah blah.

I think WER should be computed on exact words instead of phonemes.

I'll look into nabbing a thorough text normalizer, or just rely on whisper transcribing the reference as well and hoping it normalizes both text well enough. I just really don't look forward to it since thumbing through TorToiSe's text cleaners still haunts me.

I could have sworn though that my initial WER computations still had unusually low WER scores (taking into account they're normalized too), but I'll double check later.

  • This might also be true since the phonemes are still derived from whisper's transcription.

Maybe simply random sample prompt length, as 5%–20% of the training samples during training can alleviate this problem.

There's a weird "quirk" in the model where, despite training explicitly on low prompt duration ranges (3-6 seconds), it performs very well on prompt durations well above what it was trained against. I only say it's a quirk since it doesn't carry over for output utterances that goes well beyond the average 12 seconds / 3 sentences of audio.

Regardless, it only marginally improved scores a negligible amount (per the old page). Despite that though, it's a good enough placebo since it aligns more with normal usage where I feed it a full clip and trim it down to around 8 seconds in my own evals, rather than abruptly trimmed, small utterances.

  • Naturally longer input prompts = good, since I feel the attention heads have more to work with for deriving how a speaker utters a phoneme, but it can only do so much for speakers it can't strongly map in its own latent space or blah blah.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 19, 2024

Re-calculated WER with cleaned/normalized transcriptions, and the WER actually decreased (especially for Emilia JA/ZH).

  • debugging the transcriptions showed that whisper-large-v3 is a bit inconsistent when it comes to handling some punctuation and some nuances in English (for example: "Good night" vs "goodnight", "could have" vs "could've")
  • while I did initially do the WER calc against whisper-base, whisper-large-v3 showed little to no differences when trying to capture stuttering or delivery oddities (I suppose since the datasets used to train whisper won't have those anyways).
  • I didn't fully update the demo page, but the egregiously high WER/CER scores were updated.

I'm still skeptical of the WER/CER being very low, but I guess the model is just that robust enough. However, I might be cheating a bit out of habit with having the NAR RVQ-level 0 (NAR-len) output use 50 demasking steps. I don't recall if it behaves like this with the current checkpoint, but I vaguely remember previous checkpoints being able to correct some errors simply by increasing the step count. However-however, I think this is a bit negated by having a high (annealed) temperature.


Aside from that, I did some deliberating on increasing the SIM-O score (/ the model's output quality in general) but I don't think I have very many feasible paths to explore.

  • increasing the model size feels more like a bandaid by increasing its knowledge-capacity, rather than allowing it to generalize properly (and would require a training from scratch anyways to make the most use of it, glueing more layers doesn't ever feel like it improves anything).
    • If anything, I feel it might allow it to virtually overfit for some speakers, given how some the samples are on seen speakers.
    • Any substantial size increases would have to trade off with throughput speeds, sadly (although the model only really needs, like, 32 forward passes now that's it's almost-purely a NAR).
  • feeding it more data will just further the model along the curve of diminishing returns, and I have to be very careful to balance the dataset (the training for Korean/Chinese caused other languages to suffer a non-trivial amount, and post-post training on a balanced dataset did its best to fix it).
  • I think I've actually exhausted every avenue from prompts close to the reference, to implementation details, to exotic features with the demasking pure-NAR for RVQ level 0 (NAR-len).
    • Everything else is either suffering with exotic (meme) backends like SSMs (Mamba), or deviating from the ~spirit~ of the model with diffusers and flow matching or the dark arts of operating on extracted speaker features and the wonderful world of encoders (both of which I don't think I want to delve into desu).
    • Any other exotic feature would require a huge amount of compute anyways for something that might end in failure, which I feel would just be better put into more training on the existing model as-is.
  • there's some hope (cope) in the SpeechX tasks I've still yet to train (noise suppression, speech removal, text editing, speaker extracting) being able to bolster the TTS task, per the SpeechX paper (although I do need to reread the paper to see where exactly it improved), but I do not look forward to the agony of crippled training throughput.

I could be entirely wrong on my thoughts, as I'm still not an expert by any stretch. I just feel that if the model hasn't generalized well enough for zero-shot TTS, it probably won't ever as a tiny 220M attention-based transformer. Although, I think I'm fine with that.

I suppose when I get the hyperfixation spark to put more work into the model, it'd be on the SpeechX tasks and see if it amounts to anything, mostly because a bulk of the work is already in the trainer code. However, I think the answer is just to feed it more data in hopes the model can better map speakers to its own latent space or blah blah blah, as the model is still very undertrained in comparison to the other monsters boasting 50K+ hours of audio.

@HarryHe11
Copy link

使用清理/规范化的转录重新计算 WER,WER 实际上减少了(尤其是对于 Emilia JA/ZH)。

我仍然怀疑 WER/CER 是否非常低,

just saw a intersting paper https://arxiv.org/pdf/2412.10117 which repots the objective evaluation result of almost all sota TTS models on the seed-tts-eval dataset, that you may refer to.
Screenshot 2024-12-19 at 12 45 04

@e-c-k-e-r
Copy link
Owner

Right right, I forgot I needed to read CosyVoice 2's paper. At the moment I can't really give it a solid glance (and I probably need the weekend and ample free time to digest it), but the instruction stuff reminded me of the "in-context learning" VALL-E boasted that didn't really ever seem to have been explored to my memory. It's something I wanted to tackle but thinking over it always ended with "I just don't have a way to properly annotate data to that degree".

In addition, about 400 hard test cases are also included to evaluate the robustness of TTS models on text repetition, tongue twister and other challenging synthesis cases, denoted as test-hard in this report.

Oh, I completely forgot that I had this in mind before, and I think I referenced it somewhere in some other eval with the Harvard sentences (if I remember right, MaskGCT's demo page had something similar). Those test-hard WER scores suggests I should throw that at the model and see how it fares.

One thing that did cross my mind, though, is that I might need to do WER calculation as naively as possible by just comparing the transcriptions as-is without any cleanup, and seeing if that brings it closer to other models (although I could have sworn that's what I did initially and it didn't seem too egregious). I suppose I can do this quickly while it's not too late into the night.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 19, 2024

Demo page updated with test-hard, but:

  • I need to use Paraformer for transcribing Chinese-based datasets, as well as figure a better way to compute WER on them. I caught that the WER is deceptively low for JA/ZH due to how the word error rate is being calculated on one contiguous chunk due to lack of spaces, and puncutation is the only thing to break things apart. I suppose this is noted in that table where test-zh reports the CER instead, but test-hard reports the WER. I suppose I might not need to do anything and simply just drop the WER for those languages to not be misleading. The demo page will omit WER for JA/ZH tables.
  • I might actually need to normalize the error rates?
  • This paper has some pieces of wisdom about WER and whisper, so I'll give it a better look tomorrow.
  • I added PER which is just the CER on the phonemized text. I need to compute it for the rest of the tables.
  • Regardless of the metrics, test-hard outputs are pretty mediocre due to Chinese being very undertrained anyways. It's still a nice test to show how the model can perform in this edge case.

I've also added the Harvard sentences I use for my own evals for general robustness, as well as English tongue twisters to try and mirror test-hard. However, I think I need to append multiple sentences to further increase the duration.

@e-c-k-e-r
Copy link
Owner

e-c-k-e-r commented Dec 20, 2024

It came to my attention while doing something completely unrelated that, for a while now at least, the model actually wasn't properly trained with a language token, per

if "lang" in self.capabilities and lang_list is not None and lang_list[i] is not None:
. I don't know what compelled me to add that in and then completely forget about it. I do clearly remember it working before though, since I made note of it affecting the accent before. I guess I need to do some more post-training. Woo.

On the bright side, I suppose it's impressive it can perform as it does without a language marker. It doens't seem to offer much of a difference re-enabling it though. Maybe the attention mechanisms managed to derive the target language/accent from the input phonemes itself, since it's agnostic to the input voice itself. I'll have to dig through some papers to see if any other TTS solution made note of this.


It didn't seem to matter at all. I guess it's more technically-interesting this way, but I still prefered my old way of being able to influence the accent by changing the language marker.

Regardless, I'll still need to re-do the samples in the demo page anyways. I did also enable training for ns (noise-suppression) and sr (speech removal) per SpeechX and the very cursory tests I did are decent despite having relatively little training. I don't expect those tasks to further influence the base tts task but you never know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants