the question about dataset feature #29

LithiumZhou · 2024-05-27T08:09:56Z

Hi Yuan,

The features of the ESC dataset you provided seem to only have whisper-large-v1，But it seems that the provided code includes features from more than one model.
Thanks

YuanGongND · 2024-05-27T09:01:04Z

hi there,

Yes, in the paper we compare the performance of multiple models on ESC-50 (figure 1), the purpose was to show the advantage of Whisper on that. So we may have code of this in this repo.

Since this project is mostly on Whisper, and Whisper has the strongest performance, we only release the feature of Whisper. If you want to get features of other datasets, we do have the code at https://github.com/YuanGongND/whisper-at/tree/main/src/noise_robust_asr/intermediate_feat_extract/esc-50, you can run it by yourself. Hope this helps.

-Yuan

LithiumZhou · 2024-05-28T12:45:47Z

Thank you for your quick answer，
I currently have another question about ESC-50. I performed ESC-50 using your codes but could not reproduce the results in the paper on the two baselines of last-mlp and wa-mlp, with an accuracy of approximately 82%. I would like to know the hyperparameters of this experiment, such as learning rate decay, mixup, and learning rate.
Your work means a lot to me.

YuanGongND · 2024-05-28T23:23:21Z

Can you reproduce the result of TL-TR methods?

The only hyper parameter we might tuned for different models is the learning rate. You can try 5X 10X times larger or 5X 10X time smaller.

-Yuan

YuanGongND · 2024-05-28T23:24:24Z

And are you using the following code?

whisper-at/src/whisper_at_train/models.py

Lines 112 to 132 in 17d94d6

    
           # (baseline) 
        
           if self.mode == 'mean_mlp': 
        
               audio_rep = torch.mean(audio_rep, dim=1) 
        
               audio_rep = torch.mean(audio_rep, dim=1) 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) 
        
           elif self.mode == 'last_mlp': 
        
               audio_rep = audio_rep[:, -1, :, :] # get the last layer 
        
               audio_rep = torch.mean(audio_rep, dim=1) 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep 
        
           # (baseline) 
        
           elif self.mode == 'wa_mlp': 
        
               audio_rep = torch.mean(audio_rep, dim=2) # [B, 32 1280] 
        
               audio_rep = torch.permute(audio_rep, (0, 2, 1)) # (B, 1280, 32) 
        
               audio_rep = (audio_rep @ self.layer_weight) / self.layer_weight.sum() 
        
               audio_rep = self.mlp_layer(audio_rep) 
        
               return audio_rep

LithiumZhou · 2024-05-29T04:12:10Z

thank you for your response，

Yes, I can reproduce the results of Tl-TR. I'm also using the same code as you say, and I will continue to try different learning rates to see how it performs on the baseline.

YuanGongND added the question Further information is requested label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

the question about dataset feature #29

the question about dataset feature #29

LithiumZhou commented May 27, 2024 •

edited

Loading

YuanGongND commented May 27, 2024 •

edited

Loading

LithiumZhou commented May 28, 2024

YuanGongND commented May 28, 2024

YuanGongND commented May 28, 2024

LithiumZhou commented May 29, 2024

the question about dataset feature #29

the question about dataset feature #29

Comments

LithiumZhou commented May 27, 2024 • edited Loading

YuanGongND commented May 27, 2024 • edited Loading

LithiumZhou commented May 28, 2024

YuanGongND commented May 28, 2024

YuanGongND commented May 28, 2024

LithiumZhou commented May 29, 2024

LithiumZhou commented May 27, 2024 •

edited

Loading

YuanGongND commented May 27, 2024 •

edited

Loading