The environment prerequisites are as follows:
- python 3.8
The datasets utilized in our paper are as follows:
AIST Dance Video Database (AIST Dance DB) is a shared database containing original street dance videos with copyright-cleared dance music. The database is available here.
The GTZAN dataset is a collection of 1,000 audio files spanning 10 music genres, all having a length of 30 seconds. The audio files are available here.
The Groove MIDI Dataset (GMD) is composed of 13.6 hours of aligned MIDI and (synthesized) audio of human-performed, tempo-aligned expressive drumming. The MIDI data is available in the documentation.
The Lakh Pianoroll Dataset (LPD) is a collection of 174,154 multitrack pianorolls derived from the Lakh MIDI Dataset (LMD). We use its subset lpd-5-cleansed that contains 21,425 five-track pianorolls.
- extract human skeleton keypoints using OpenPose
- extract ground truth music beats
- extract log mel-scaled spectrogram
- convert drum track/multi-track MIDI into token sequence
To train the MBPN.
python ./src/MBPN/train_MBPN.py
To pre-train the Dance style embedding network on AIST.
python ./src/SSM/train_dance_network.py
To pre-train the Music style embedding network on GTZAN.
python ./src/SSM/train_music_network.py
To jointly train the Dance and music style embedding networks.
python ./src/SSM/train_joint.py
To train the Drum Transformer on GrooveMIDI.
python ./src/PCMG/train_drum_Transformer.py
To train the Multi-track Transformer on LPD.
python ./src/PCMG/train_multi_Transformer.py