This is the code repository for the ICASSP 2022 paper (under review) Deep Embeddings for Robust User-Based Amateur Vocal Percussion Transcription by Alejandro Delgado, Emir Demirel, Vinod Subramanian, Charalampos Saitis, and Mark Sandler.
src
– the main codebase with scripts for processing data, models, and results (details in sections below).data
– datasets and processed data used throughout the study.models
– folder that hosts already trained models.results
– folder that hosts information relative to final accuracy results.
To install requirements:
pip install -r requirements.txt
If you are a Mac user, you may need to install Essentia using Homebrew.
Once the AVP-LVT dataset is downloaded and built following the instructions inside, place its contents in the data/external
directory.
The first step is to generate the spectrogram reperesentations that are later fed to the networks. These are 64x48 log Mel spectrograms computed with a frame size of 23 ms and a hop size of 8 ms. Also, several engineered (hand-crafted) feature vectors need to be extracted for the baseline methods using the same frame-wise parameters as for the spectrogram.
To build spectrogram representations, which will be saved in the data/interim
directory, run this command:
python src/data/generate_interim_datasets.py
To extract engineered features, also saved in the data/interim
directory, run this command:
python src/data/extract_engineered_features_mfcc_env.py
to extract "MFCCs + Envelope" features or
python src/data/extract_engineered_features_all.py
to extract 258-dimensional feature vectors to feed feature selection algorithms.
To train deep learning models and save embeddings predicted from evaluation data, run this command:
python src/models/train_deep.py
To train feature selection methods and save feature importances, run this command:
python src/models/train_selection.py
To evaluate the performance of learnt embeddings and selected features, which should be stored in data/processed
by now, run:
python src/results/eval_knn.py
for KNN classification or
python src/results/eval_alt.py
for classification with three alternative classifiers (logistic regression, random forest, and extreme gradient boosting).
Our learnt embeddings and engineered features achieve the following performances on the AVP-LVT dataset with a KNN classifier:
Method | Participant-wise Accuracy | Boxeme-wise Accuracy |
---|---|---|
GMM-HMM | .725 | .734 |
Timbre | .840 | .835 |
Feature Selection | .827 ± .012 | .795 ± .011 |
Instrument Original | .812 ± .012 | .774 ± .014 |
Instrument Reduced | .779 ± .019 | .738 ± .031 |
Syllable Original | .899 ± .005 | .874 ± .008 |
Syllable Reduced | .883 ± .005 | .852 ± .012 |
Phoneme Original | .876 ± .014 | .840 ± .018 |
Phoneme Reduced | .874 ± .013 | .838 ± .019 |
Boxeme Original | .861 ± .016 | .832 ± .018 |
Weights relative to the final pretrained models for each of the seven embedding learning methods can be downloaded here: (link)
We recommend using the cnn_syllable_level_original.h5
for feature extraction, as it yields the best performance in the table above.
- Add full table with results
- Add data and paper links
- Finish tidying up code
- Write routines for personal use
This work has received funding from the European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 765068.