Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the question about dataset feature #29

Open
LithiumZhou opened this issue May 27, 2024 · 5 comments
Open

the question about dataset feature #29

LithiumZhou opened this issue May 27, 2024 · 5 comments
Labels
question Further information is requested

Comments

@LithiumZhou
Copy link

LithiumZhou commented May 27, 2024

Hi Yuan,

The features of the ESC dataset you provided seem to only have whisper-large-v1,But it seems that the provided code includes features from more than one model.
Thanks

@YuanGongND
Copy link
Owner

YuanGongND commented May 27, 2024

hi there,

Yes, in the paper we compare the performance of multiple models on ESC-50 (figure 1), the purpose was to show the advantage of Whisper on that. So we may have code of this in this repo.

Since this project is mostly on Whisper, and Whisper has the strongest performance, we only release the feature of Whisper. If you want to get features of other datasets, we do have the code at https://github.com/YuanGongND/whisper-at/tree/main/src/noise_robust_asr/intermediate_feat_extract/esc-50, you can run it by yourself. Hope this helps.

-Yuan

@YuanGongND YuanGongND added the question Further information is requested label May 27, 2024
@LithiumZhou
Copy link
Author

Thank you for your quick answer,
I currently have another question about ESC-50. I performed ESC-50 using your codes but could not reproduce the results in the paper on the two baselines of last-mlp and wa-mlp, with an accuracy of approximately 82%. I would like to know the hyperparameters of this experiment, such as learning rate decay, mixup, and learning rate.
Your work means a lot to me.

@YuanGongND
Copy link
Owner

Can you reproduce the result of TL-TR methods?

The only hyper parameter we might tuned for different models is the learning rate. You can try 5X 10X times larger or 5X 10X time smaller.

-Yuan

@YuanGongND
Copy link
Owner

And are you using the following code?

# (baseline)
if self.mode == 'mean_mlp':
audio_rep = torch.mean(audio_rep, dim=1)
audio_rep = torch.mean(audio_rep, dim=1)
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline)
elif self.mode == 'last_mlp':
audio_rep = audio_rep[:, -1, :, :] # get the last layer
audio_rep = torch.mean(audio_rep, dim=1)
audio_rep = self.mlp_layer(audio_rep)
return audio_rep
# (baseline)
elif self.mode == 'wa_mlp':
audio_rep = torch.mean(audio_rep, dim=2) # [B, 32 1280]
audio_rep = torch.permute(audio_rep, (0, 2, 1)) # (B, 1280, 32)
audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum()
audio_rep = self.mlp_layer(audio_rep)
return audio_rep

@LithiumZhou
Copy link
Author

thank you for your response,

Yes, I can reproduce the results of Tl-TR. I'm also using the same code as you say, and I will continue to try different learning rates to see how it performs on the baseline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants