- Author: Yueh-Kao Wu, Ching-Yu Chiu, Yi-Hsuan Yang
- This repository contains the official implementation of the following paper: JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA [arxiv] [demo]
- Jukedrummer is a project on drum accopaniment generation given songs in which percussion instruments are completely absent (drumless songs) as input. The generated drum accompaniments should not only sound consistent with input but also sound similar to real drums.
- We use joined dataset consiting 3 different multi-track dataset: MUSDB18, MedleyDB, MixingSecret after delete several duplicated songs in the joined dataset.
- We put our results in our demo page. For further demonstration, please visit the site.
- Python version >= 3.6
- Install dependencies
pip3 install -r requirements.txt
- GPU with >10 GB RAM (optional, but recommended)
The script below would download the checkpoints to ckpt/
folder.
bash script/get_ckpt.sh
The model would load pre-trained parameters in \ckpt
when inference.
python3 inference.py \
--exp_idx \ # Determine the checkpoint id to load the pre-trained parameters
--cuda \ # Determine the cuda id
--input_dir \ # input drumless audio directory
--output_dir \ # output audio directory
--sample_iters \ # iterations of sampling
Note that exp_id
could be choiced from 1, 2, 11, 12
:
1
: Tranformer encoder/decoder + Low-level beat information2
: Transformer encoder/decoder11
: Tranformer encoder + Low-level beat information12
: Tranformer encoder
According to our experiment, model with checkpoint exp_id=1
is the best in both subjective and objective metrics.
(For more configuration setting, please refer hparams.py
)
-
Run the script below to establish the directories. In order to expedite the training processes, several intermediate data would be generated and stored in this directories.
bash script/build_folder.sh
-
Every Raw wave should be separated into a drum track and a drumless track. Then, put drum tracks into
audio/target
folder and put drumless tracks intoaudio/others
folder -
The preprocessing has 4 stages:
- Segmentation by either downbeats or hop window
- Extract Mel spectrograms from segemented audio waves
- Divide dataset into train & valid subset
- Beat Information Extraction
-
Users can run our script below to accomplish the whole preprocessing directly:
bash script/preprocessing.sh
- For complete training process, there are 5 stages:
- Train drum VQ-VAE
- Train drumless VQ-VAE
- Using VQ-VAEs to extract drum tokens and drumless tokens from Mel-spectrogram respectively.
- Train lanuguage model (Transformer) with those extracted tokens.
- Users can either take advantage of our script below to train the model or separately run those commands in script.
bash script/train.sh
However, there are several problems still remaining to be solved in future works:
- Generalizability: Generated accompaniments are worse when using recordings outside our joined dataset
- Stability: The model struggles to change its tempo going through different sections of a song.
- Dependency: Insufficient clues for locating beats and tempo would lead to bad accompaniment generation.
Feel free to email me: [email protected] if you have any problem.