About the input on ARCH benchmark #14

dzr1026 · 2024-11-13T06:10:15Z

Thank you for your work, it's a very innovative piece of research.
I have a question regarding the ARCH benchmark results (Table 5): What is the input for these results? Specifically, what is the "semantic representation"? Is it the latent space after RVQ (Residual Vector Quantization)? Or is the semantic representation the sum of the latent spaces from all eight quantizers?

zhenye234 · 2024-11-13T13:53:05Z

Quantized semantic feature, here

xcodec/models/soundstream_semantic.py

Line 114 in a2e52d3

o_semantic = self.decoder_semantic(quantized_semantic )

dzr1026 · 2024-11-14T00:43:59Z

Thank you for your reply！

ggiggit · 2024-11-20T07:32:55Z

@zhenye234 Thanks for your previous response! I have a couple more questions about Table 5, if you don't mind:

Could you please clarify the semantic representations for DAC, Encodec, and the Baseline Acoustic Codec in Table 5?
Also, I'm curious why SpeechTokenizer was excluded from the comparison?

Thanks so much for your help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the input on ARCH benchmark #14

About the input on ARCH benchmark #14

dzr1026 commented Nov 13, 2024

zhenye234 commented Nov 13, 2024

dzr1026 commented Nov 14, 2024

ggiggit commented Nov 20, 2024

About the input on ARCH benchmark #14

About the input on ARCH benchmark #14

Comments

dzr1026 commented Nov 13, 2024

zhenye234 commented Nov 13, 2024

dzr1026 commented Nov 14, 2024

ggiggit commented Nov 20, 2024