Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducibility problems with Librispeech model #13

Open
Vanlogh opened this issue Oct 23, 2024 · 3 comments
Open

Reproducibility problems with Librispeech model #13

Vanlogh opened this issue Oct 23, 2024 · 3 comments

Comments

@Vanlogh
Copy link

Vanlogh commented Oct 23, 2024

I want to thank all the authors for the great work that they have done with this paper.

I am trying to reproduce the Librispeech model training to get a better sense of how the model is training in the hopes of building a
25Hz version of xcodec in the future.

I downloaded all the 960h Librispeech training from here and kept the config of the model as it is. I only changed batch size from 8 in 8 GPUs to 16 in 4 GPUs.

The problem I am running into is that the training is not stable. It seems to me that the GAN setting is difficult to train and is the main culprit of this.
image
image

I just wanted to ask if you have experienced this during the experiments and how you dealt with this. I am almost tempted to just resume the training from an earlier checkpoint. It would be really helpful if you guys can guide me here.

Thank you and I appreciate the time you've taken to read this!

@ooooolong
Copy link

Hi, bro. Can I add your WeChat and talk about some questions with you!

@zhenye234
Copy link
Owner

Hi, i tried a lot on low-bitrate codec recently. For 25hz codec, maybe you can try vocos (iSTFT) decoder [1] since the model does not need to learn temporal upsampling. In addition, I will release a low-bitrate xcodec next month.

[1] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

@Vanlogh
Copy link
Author

Vanlogh commented Nov 20, 2024

@zhenye234 thank you for responding. I did notice that audio reconstruction was very good when using only 1 RVQ layers from the 8 quantizers available. I was wondering what the cause for that might be and if that is an intended result?

I noticed you mention doing some kind of "dropout" of the quantizer layers. (i.e: randomly selecting RVQ layers from options [1, 2, 3, 4, 8]). However, It doesn't seem to me that having that alone allows you to do audio reconstruction with 1 RVQ layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants