Skip to content

Inference codebase for "Cacophony: An Improved Contrastive Audio-Text Model". Preprint: https://arxiv.org/abs/2402.06986

License

Notifications You must be signed in to change notification settings

Beatoven/Cacophony

Repository files navigation

Cacophony

Inference codebase for "Cacophony: An Improved Contrastive Audio-Text Model"

Abstract

Despite recent improvements in audio-text modeling, audio-text contrastive models still lag behind their image-text counterparts in scale and performance. We propose a method to improve both the scale and the training of audio-text contrastive models. Specifically, we craft a large-scale audio-text dataset consisting of over 13,000 hours of text-labeled audio, aided by large language model (LLM) processing and audio captioning. Further, we employ an masked autoencoder (MAE) pre-pretraining phase with random patch dropout, which allows us to both scale unlabeled audio datasets and train efficiently with variable length audio. After MAE pre-pretraining of our audio encoder, we train a contrastive model with an auxiliary captioning objective. Our final model, which we name Cacophony, achieves state-of-the-art performance on audio-text retrieval tasks, and exhibits competitive results on other downstream tasks such as zero-shot classification.



Requirements

Jax and Flax are used for the model implementation. Tested on RTX 2080Ti, CUDA version 11.5, cuDNN version 8.2.1, cudatoolkit 11.3.1, and Python 3.8.17.

pip install requirements.txt

Pretrained Models

We provide the following pretrained models on both stages of the Cacophony model:

AudioMAE

Cacophony

Evaluation

Audio-Text Retrieval

Zero-Shot Classification

Audio Captioning

HEAR Benchmark

Our environment does not support the HEAR benchmark, but we provide the code to run the benchmark in the hear directory. To successfully run the benchmark, follow the instructions in the hear directory.

Model ESC50 Libri
Count
CREMAD Gunshot SC 5hr SC Full Vox
Lingua
Vocal
Imitation
NSynth
Pitch
5hr
NSynth
Pitch
50hr
GTZAN
Genre
GTZAN
Music
Speech
Beijing
Opera
Percussion
LAION-CLAP-fusion 0.964 0.625 0.566 0.914 0.693 0.758 0.264 0.155 0172 0.376 0.842 0.962 0.962
LAION-CLAP 0.971 0.659 0.557 0.845 0.693 0.774 0.189 0.151 0.180 0.423 0.838 0.969 0.953
MS-CLAP 0.930 0.649 0.547 0.798 0.511 0.626 0.236 0.106 0.112 0.274 0.818 0.992 0.932
WavCaps-CNN14 0.962 0.646 0.556 0.789 0.583 0.640 0.270 0.158 0.140 0.324 0.861 0.992 0.957
WavCaps-HTSAT 0.961 0.690 0.595 0.929 0.752 0.806 0.234 0.168 0.256 0.548 0.847 0.962 0.958
Stage1: AudioMAE (Ours) 0.841 0.754 0.661 0.893 0.841 0.893 0.439 0.161 0.700 0.827 0.828 0.985 0.945
Stage2: Cacophony (Ours) 0.970 0.660 0.593 0.833 0.680 0.762 0.262 0.191 0.420 0.726 0.850 0.985 0.970

Acknowledgements

About

Inference codebase for "Cacophony: An Improved Contrastive Audio-Text Model". Preprint: https://arxiv.org/abs/2402.06986

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.0%
  • Other 1.0%