You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I want to thank all the authors for the great work that they have done with this paper.
I am trying to reproduce the Librispeech model training to get a better sense of how the model is training in the hopes of building a
25Hz version of xcodec in the future.
I downloaded all the 960h Librispeech training from here and kept the config of the model as it is. I only changed batch size from 8 in 8 GPUs to 16 in 4 GPUs.
The problem I am running into is that the training is not stable. It seems to me that the GAN setting is difficult to train and is the main culprit of this.
I just wanted to ask if you have experienced this during the experiments and how you dealt with this. I am almost tempted to just resume the training from an earlier checkpoint. It would be really helpful if you guys can guide me here.
Thank you and I appreciate the time you've taken to read this!
The text was updated successfully, but these errors were encountered:
Hi, i tried a lot on low-bitrate codec recently. For 25hz codec, maybe you can try vocos (iSTFT) decoder [1] since the model does not need to learn temporal upsampling. In addition, I will release a low-bitrate xcodec next month.
[1] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
@zhenye234 thank you for responding. I did notice that audio reconstruction was very good when using only 1 RVQ layers from the 8 quantizers available. I was wondering what the cause for that might be and if that is an intended result?
I noticed you mention doing some kind of "dropout" of the quantizer layers. (i.e: randomly selecting RVQ layers from options [1, 2, 3, 4, 8]). However, It doesn't seem to me that having that alone allows you to do audio reconstruction with 1 RVQ layer.
I want to thank all the authors for the great work that they have done with this paper.
I am trying to reproduce the Librispeech model training to get a better sense of how the model is training in the hopes of building a
25Hz version of xcodec in the future.
I downloaded all the 960h Librispeech training from here and kept the config of the model as it is. I only changed batch size from 8 in 8 GPUs to 16 in 4 GPUs.
The problem I am running into is that the training is not stable. It seems to me that the GAN setting is difficult to train and is the main culprit of this.


I just wanted to ask if you have experienced this during the experiments and how you dealt with this. I am almost tempted to just resume the training from an earlier checkpoint. It would be really helpful if you guys can guide me here.
Thank you and I appreciate the time you've taken to read this!
The text was updated successfully, but these errors were encountered: