Skip to content
/ shufl Public

GI15 - Information Retrieval Data Mining Group Project

License

Notifications You must be signed in to change notification settings

iogakos/shufl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

shufl

GI15 - Information Retrieval Data Mining Group Project (Group 37)

This project aims to address the coldstart problem in music recommendation and is meant to be an excercise in Machine Learning. It is largely based on Sander Dieleman's blog post which can be found here: http://benanne.github.io/2014/08/05/spotify-cnns.html

We use the MagnaTagATune dataset consisting of approximately 26,000 of 29 seconds music clips encoded in 16 kHz, 32kbps, mono mp3. Each of them is annotated with a combination of 188 tags, such as 'classical', 'guitar', 'slow', etc.

First, a 'ground truth' model is obtained using gensim doc2vec algorithm: given a set of tags for each clip, a 40 vector latent representation is produced.

The mp3 samples are then converted to their (mel) spectrogram representation using librosa.

Finally a convolutional neural network that follows Sander Dieleman's architecture is built and trained with Theano and Lasagne on an Amazon AWS GPU instance.

#Installation These instructions are for a computer running MAC OS X and assume no previous python development present.

Install homebrew package manager (http://brew.sh/):

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Install python using brew:

brew install python

Make sure /usr/local/bin is in the first line of /etc/paths so that when you type python the brewed version is used

Move it there if its not:

sudo nano /etc/paths

Install virtualenv so you can sandbox your development environment

pip install virtualenv

Make sure you dont install packages to your global python environment unless explicitly stated

$ nano ~/.bashrc
export PIP_REQUIRE_VIRTUALENV=true

gpip(){
    PIP_REQUIRE_VIRTUALENV="" pip "$@"
}

Clone shufl to your place of choice

git clone https://github.com/iogakos/shufl.git
cd shufl

Create virtual envirnment for shufl:

virtualenv venv-shufl

Activate the virtual environment:

source venv-shufl/bin/activate

Install dependancies:

librosa (https://bmcfee.github.io/librosa/install.html)

pip install librosa
brew install libsamplerate
pip install scikits.samplerate

Lasagne + Theano (http://lasagne.readthedocs.org/en/latest/user/installation.html):

pip install --upgrade https://github.com/Theano/Theano/archive/master.zip
pip install --upgrade https://github.com/Lasagne/Lasagne/archive/master.zip

gensim (https://radimrehurek.com/gensim/install.html):

pip install --upgrade gensim

Download audio samples from MagnaTagATune dataset (http://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset)

cd ~/Download
curl http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.001
curl http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.002
curl http://mi.soi.city.ac.uk/datasets/magnatagatune/mp3.zip.003

You need to concatenate the partial archive first

cat mp3.zip.* > magna.zip
unzip magna.zip -d SHUFL_ROOT/scripts/data/magna

Some mp3 files are broken/empty - you can find them:

cd SHUFL_ROOT/scripts/data/magna
$ du -ah SHUFL_ROOT/scripts/data/magna | grep -v "/$" | sort  | head -n 20
  0B	./6/norine_braun-now_and_zen-08-gently-117-146.mp3
  0B	./8/jacob_heringman-josquin_des_prez_lute_settings-19-gintzler__pater_noster-204-233.mp3
  0B	./9/american_baroque-dances_and_suites_of_rameau_and_couperin-25-le_petit_rien_xiveme_ordre_couperin-88-117.mp3

Remove these files as they are zero byte long:

rm ./6/norine_braun-now_and_zen-08-gently-117-146.mp3
rm ./8/jacob_heringman-josquin_des_prez_lute_settings-19-gintzler__pater_noster-204-233.mp3
rm ./9/american_baroque-dances_and_suites_of_rameau_and_couperin-25-le_petit_rien_xiveme_ordre_couperin-88-117.mp3

cloud installation

TODO: writing off the top of my head, verify

follow the stanford guide at http://cs231n.github.io/aws-tutorial/ to setup a gpu instance with Theano and Lasagne preinstalled

ssh into your machine

follow install dependencies

libsamplerate needs to be installed from source

scikits.samplerate might need libsamplerate0-dev

sudo apt-get install libsamplerate0-dev
pip install scikits.samplerate

getting ffmpeg on ubuntu (to be able to load mp3s for librosa) http://stackoverflow.com/questions/29125229/how-to-reinstall-ffmpeg-clean-on-ubuntu-14-04

sudo apt-get --purge remove ffmpeg
sudo apt-get --purge autoremove

sudo apt-get install ppa-purge
sudo ppa-purge ppa:jon-severinsson/ffmpeg

sudo add-apt-repository ppa:mc3man/trusty-media
sudo apt-get update
sudo apt-get dist-upgrade

sudo apt-get install ffmpeg

Update lasagne and theano to newest versions (see install dependancies)

You will need to get the newest cuDNN (v5) from https://developer.nvidia.com/cudnn. Register and download the linux archive.

"install" (taken from https://gist.github.com/joyofdata/11e936d0603dd7dd63f6)

scp -i  ~/.ssh/aws.pem ~/Downloads/cudnn.tar.gz [email protected]:/home/ubuntu/downloads

gzip -d file.tar.gz
tar xf file.tar

echo -e "\nexport LD_LIBRARY_PATH=/home/ubuntu/cudnn-6.5-linux-x64-v2:$LD_LIBRARY_PATH" >> ~/.bashrc

sudo cp cudnn.h /usr/local/cuda-7.0/include
sudo cp libcudnn* /usr/local/cuda-7.0/lib64

The persistent (root) filesystem on the aws instance is only 12GB, of which a lot is taken up by python and some other preinstalled packages. I would recommend removing caffe and torch if you dont plan to use them and keep the zipped mp3 samples in ~/downloads so you dont need to redownload it each time

Once booted you have an additional 60 GB of storage under /mnt folder (root access only) that gets wiped out on reboot. change the owner to ubuntu for easier access:

sudo chown -R ubuntu:ubuntu /mnt

Extract the mp3s:

unzip ~/downloads/mp3.zip -d /mnt/data/magna/

and remove the broken files:

rm /mnt/data/magna/6/norine_braun-now_and_zen-08-gently-117-146.mp3
rm /mnt/data/magna/8/jacob_heringman-josquin_des_prez_lute_settings-19-gintzler__pater_noster-204-233.mp3
rm /mnt/data/magna/9/american_baroque-dances_and_suites_of_rameau_and_couperin-25-le_petit_rien_xiveme_ordre_couperin-88-117.mp3

To make theano run on gpu by default:

echo -e '[global]\nfloatX=float32\ndevice=gpu\n\n[mode]=FAST_RUN\n\n[lib]\ncnmem=0.9\n\n[nvcc]\nfastmath=True\n\n[cuda]\nroot=/usr/local/cuda' > ~/.theanorc

Usage

The following includes a walkthrough execution of Shufl's environment starting from preprocessing the dataset until getting recommendations from the trained model

The following steps assume you have changed into the scripts directory of Shufl.

First we prepare a line separated file, each line containing the clip id in the beginning of the line and its corresponding (space separated) tags like the following

58870 vocals foreign voice female_vocal vocal indian singing notenglish women

Execute the following script in order to build this file:

python prepare_tags.py

Then, we move from tags to latent vector space representation for each clip using the Doc2Vec model available by the gensim library

python train_tags_model.py

This will save the Doc2Vec model in a file called d2vmodel.doc2vec under data directory. Having switched to the latent vector space, we are now ready to prepare a fully fledged data for training, validating and testing. By default the dataset preparation script assumes executing on a local environment that will only include 1000 samples for training. If you want to prepare a dataset for a production enviroment, you can use the -p flag that takes into account the whole dataset generating pickle files of approximately 15GB (requires ~1.5h in an 8-Core machine). We assume running on a commodity machine at the moment, thus we prepare the dataset simply by executing:

python prepare_dataset.py

This script generated the following files briefly explained below:

711M  mels.pickle # serialized mel spectrograms for training/validation
70M   mels-test.pickle # serialized mel spectrograms for testing
11M   tags.pickle # serialized tags for training/validation
900K  tags-test.pickle # serialized tags for testing
9.4K  clips # line separated file, each line containing the corresponding clip ids

Having prepared the dataset we are now ready to move to training and evaluation using the shufly.py script. Following we show different execution modes supported by the utility, and we later explain each of the flags further:

$ python shufl.py -h
usage: shufl.py [-h] [-e N] [-m <mode>] [-t TRACK_ID] [-c] [-p] [-s]

GI15 2016 - Shufl

optional arguments:
  -h, --help            show this help message and exit
  -e N, --epochs N      Number of training epochs (default: 500)
  -m <mode>, --mode <mode>
                        Set to `train` for training, `val` for validating.
                        `user` for user mode (Default `train`)
  -t TRACK_ID, --track-id TRACK_ID
                        Return the closest vectors for a specific track id
                        from the MagnaTagATune dataset
  -c, --checkpoint      Continue training from the last epoch checkpoint
  -p, --production      Load training specific options appropriate for
                        production
  -s, --shuffle         Shuffle the training data (caution: loads the whole
                        datas in memory)

-e controls how many epochs the network will go through while training. As a note, we trained our model using 21600 samples on an EC2 instance equipped with a GPU, where each epoch took ~40s resulting in 200 epochs training in ~2.5h.

-m controls the execution mode. train mode starts training assuming the already prepared dataset exists. val mode enters validation mode, which means no training is applied, only the validation dataset is fed in the network. user mode is used to fetch the closest predictions of clip id based on an already trained model. The user mode must be followed by the -t flag which defines the target clip id in CLIP_X format, where X is the id of the track in the MagnaTagATune dataset.

At the end of each training epoch, shufl.py saves the parameters of a trained model under data/shufl.pickle. The -c flag provides a checkpoint support, which when used continues training of the model from the last persisted epoch.

In order to train the fully fledged model, we set up an EC2 instance on AWS so we incorporated production environment support with the -p flag, to easily switch between development and production environment. The environments are being configured but a configuration file under config/shufl.cfg and following is the config file we used

[production]
data = 21642  # the total number of training/validation/test samples
mel_x = 599   # the length of the time domain of the mel spectrograms
mel_y = 128   # the number of frequency bins for each sample
latent_v = 40 # the size of the latent vector (1,40)

epochs = 200   # number of training epochs
train_data = 18000  # number of training samples
val_data = 2000     # number of validation samples
test_data = 1600    # number of test samples

train_batchsize = 500 # training minibatch size
val_batchsize = 500   # validation minibatch size
test_batchsize = 100  # test minibatch size

[local]
data=1000
mel_x=599
mel_y=128
latent_v=40

epochs=2
train_data = 800
val_data = 100
test_data = 100

train_batchsize=100
val_batchsize=10
test_batchsize=10

-s controls shuffling of the training data as they are being fed in the InputLayer. Shuffling helps the model tackle the problem of moving too fast towards specific patterns occuring in the dataset. The MagnaTagATune dataset contains multiple samples of the same track/artist in alphabetical order, so we wanted to avoid our model learning filters that might contain bias towards specific instruments for example, which would make it hard for the network to move away from a genre while going through the training dataset. Shuffling introduces a tradeoff between the quality of the learnt filters and memory requirements, as it requires loading the whole dataset in memory which is about ~6Gb of deserialized vectors, assuming a production environment. We tackled this issue on the EC2 instance where we were capable of loading the whole dataset in memory.

Having built this utility, we can easily switch between training and validation modes and some basic examples are provided below, again assuming a local execution environment.

By default, given no flags, shufl.py trains a model using configuration parameters for a local execution environment, starting training from scratch with no shuffling:

$ python shufl.py -e 200
Loading local configuration
Loading data...
Building model and compiling functions...
Entered training mode
Starting training from scratch...
Epoch 1 of 200 took 57.074s
  training loss:    0.175772
  validation loss:    0.150191
  validation accuracy:    84.98 %
Saving model...
Epoch 2 of 200 took 60.318s
  training loss:    0.170642
  validation loss:    0.143875
  validation accuracy:    85.61 %
Saving model...
...

In order to execute shufl.py in a production environment you can use the following command

nohup python -u shufl.py --production -m train -s > log.txt.1 2> log.err.1 < /dev/null &

This will spawn a process redirecting output to a non-tty, allowing you to terminate an SSH session without kill the training process.

As stated above, user mode provides a means of evaluation a track as follows:

$ python shufl.py -m user -t CLIP_2

query: CLIP_2 'classical','strings','opera','violin'
CLIP_10204 'classical','strings','violin','cello'
CLIP_51984 'classical','strings','violin','cello'
CLIP_8713 'drums','electronic','fast','techno','beat'
CLIP_54119 'classical','strings','violin','cello'
CLIP_54125 'classical','strings','violin','cello'
CLIP_51680 'classical','strings','violin','cello'
CLIP_54825 'classical','strings','violin','cello'
CLIP_41176 'classical','strings','violin','cello'
CLIP_49620 'classical','guitar','strings'
CLIP_54824 'classical','strings','violin','cello'

You can extract the clip ids using:

$ python shufl.py -m user -t CLIP_2 | grep -o 'CLIP_\d\+' | tr '\n' ' '
CLIP_2 CLIP_53212 CLIP_10204 CLIP_51984 CLIP_41176 CLIP_51680 CLIP_54125 CLIP_49620 CLIP_37278 CLIP_33383 CLIP_54825 

And using the clips2mp3.py script fetch the filenames

$ python clips2mp3.py CLIP_2 CLIP_53212 CLIP_10204 CLIP_51984 CLIP_41176 CLIP_51680 CLIP_54125 CLIP_49620 CLIP_37278 CLIP_33383 CLIP_54825
data/magna/f/american_bach_soloists-j_s__bach_solo_cantatas-01-bwv54__i_aria-30-59.mp3
data/magna/c/vito_paternoster-cd1bach_sonatas_and_partitas_for_solo_violin-15-sonata_seconda_in_re_minore__andante-204-233.mp3
data/magna/1/vito_paternoster-cd1bach_cello_suites-02-suite_i_in_sol_maggiore__allemande-88-117.mp3
data/magna/c/vito_paternoster-cd1bach_sonatas_and_partitas_for_solo_violin-14-sonata_seconda_in_re_minore__fuga-291-320.mp3
data/magna/2/vito_paternoster-cd2bach_cello_suites-09-suite_iv_in_mi_bemolle_maggiore__courante-59-88.mp3
data/magna/9/vito_paternoster-cd2bach_sonatas_and_partitas_for_solo_violin-14-partita_terza_in_la_maggiore__bouree-30-59.mp3
data/magna/1/vito_paternoster-cd1bach_cello_suites-16-suite_vi_in_re_magiore__sarabande-88-117.mp3
data/magna/a/jacob_heringman-black_cow-13-bakfark_non_accedat_ad_te_malum_secunda_pars-320-349.mp3
data/magna/d/processor-insomnia-08-shiraio_pig-117-146.mp3
data/magna/9/vito_paternoster-cd2bach_sonatas_and_partitas_for_solo_violin-07-sonata_terza_in_fa_maggiore__fuga-552-581.mp3
data/magna/1/vito_paternoster-cd1bach_cello_suites-17-suite_vi_in_re_magiore__gavotte_i_e_ii-59-88.mp3

About

GI15 - Information Retrieval Data Mining Group Project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages