This project investigates advanced techniques for detecting audio spoofing using state-of-the-art self-supervised learning (SSL) and transformer architectures. It evaluates various models on ASVspoof benchmark datasets, demonstrating significant improvements in distinguishing between genuine and synthesized speech.
- Wav2Vec 2.0
- HuBERT
- SSL Wav2Vec 2.0 with PSFAN Backend
- Audio Spectral Transformer (AST)
- Sound Event Detection Model (EfficientNet-B0)
- WavLM Base
- Gaussian noise injection
- Signal-to-noise ratio modifications
- Dynamic gain variations
- Background noise injection
- Focal Loss, Cross-Entropy Loss, BCEWithLogits Loss
- Optimizer: AdamW with linear/cosine schedulers
The Audio Spectral Transformer and SSL Wav2Vec outperformed other models with near-perfect precision and recall, demonstrating the power of transformer architectures in spoof detection.
Model | Public LB EER | Precision | Recall | F1-Score |
---|---|---|---|---|
Wav2Vec 2.0 | 0.46516 | 0.888 | 0.788 | 0.835 |
SSL Wav2Vec | 0.02925 | - | - | - |
Audio Spectral Transformer | 0.01384 | 0.999 | 0.999 | 0.999 |
HuBERT | 8.11672 | 0.877 | 0.764 | 0.817 |
Pretrained Wav2Vec | 0.77492 | 0.845 | 0.725 | 0.780 |
WavLM Base | 1.87658 | 0.820 | 0.690 | 0.750 |