pip install -e .
Install python >= 3.6
and pytorch
with GPU support if desired.
pip install -r requirements.txt
Demiurge is a tripartite neural network architecture devised to generate and sequence audio waveforms (Donahue et al. 2019). The architecture combines a synthesis engine based on a una-GAN + mel-GAN / hifi-GAN model with a custom transformer-based sequencer. The diagram below explains the relation between elements.
Audio generation and sequencing neural-network-based processes work as follows:
-
Modified versions of mel-GAN (a vocoder that is a convolutional non-autoregressive feed-forward adversarial network ) / hifi-GAN (introducing a similar approach to mel-GAN but with one multi-scale and one multi-period discriminator) and una-GAN (an auto-regressive unconditional sound generating boundary-equilibrium GAN) will first process audio files
.wav
from an original databaseRECORDED AUDIO DB
to produce GAN-generated.wav
sound files, which are compiled into a new databaseRAW GENERATED AUDIO DB
. -
The descriptor in the sequencer model extracts a series of Los Mel Frequency Cepstral Coefficients
MFCC
.json
strings from the audio files in thePREDICTOR DB
while the predictor, a time-series prediction model, generates projected descriptor sequences based on that data. -
As the predicted descriptors are just statistical values, a query engine matches them with those extracted from the
RAW GENERATED AUDIO DB
before it replaces matched with predicted descriptors using the audio reference from theRAW GENERATED AUDIO DB
, merging and combining the resultant sound sequences into an output.wav
audio file.
Please bear in mind that our model uses WandB to track and monitor training.
The chart below explains the GAN-based sound synthesis process. Please bear in mind that for ideal results the mel-GAN / hifi-GAN and una-GAN audio databases should be the same. Cross-feeding between different databases generates unpredictable (although sometimes musically interesting) results. Please record the wandb_run_ids
for the final sound generation process.
mel-GAN (Kumar et al. 2019) is a fully convolutional non-autoregressive feed-forward adversarial network that uses mel-spectrograms as a lower-resolution audio representation model that can be both efficiently computed from and inverted back to raw audio format. An average melGAN run on Google Colab using a single V100 GPU may need a week to produce satisfactory results. The results obtained using a multi-GPU approach with parallel data vary. To train the model please use the following notebook.
(Kumar et al. 2019)
hifi-GAN (Bae et al. 2020) is a convolutional non-autoregressive adversarial network that, as the mel-GAN, uses mel-spectrograms as a symbolic audio representation. The model combines a multi-receptive field fusion (MRF) generator with two different discriminators, introducing multi-period (MPD) and multi-scale (MSD) structures.
(Bae et al. 2020)
una-GAN (Liu et al. 2019) is an auto-regressive unconditional sound generating boundary-equilibrium GAN (Berthelot et al. 2017) that takes variable-length sequences of noise vectors to produce variable-length mel-spectrograms. A first UNAGAN model was eventually revised by Liu et al. at Academia Sinica to improve the resultant audio quality by introducing in the generator a hierarchical architecture model and circle regularization to avoid mode collapse. The model produces satisfactory results after 2 days of training on a single V100 GPU. The results obtained using a multi-GPU approach with parallel data vary. To train the model please use the following notebook.
After training mel-GAN/hifi-GAN and una-GAN, you will have to use una-GAN generate to output .wav
audio files. Please set the melgan_run_id
or hifi_run_id
and unagan_run_id
created in the previous training steps. The output .wav
files will be saved to the output_dir
specified in the notebook. To train the model please use the following notebook. The table below provides a selection of our best wandb_run_ids
that may be used to run una-GAN generate.
The sequencer combines an MFCC
descriptor extraction model with a descriptor predictor generator and query and playback engines that generate .wav
audio files out of those MFCC
.json
files. The diagram below explains the relation between the different elements of the prediction-transformer-query-playback workflow.
As outlined above, the descriptor model plays a crucial role in the the prediction workflow. You may use pretrained descriptor data by selecting a wandb_run_id
from the descriptor model or train your own model using this notebook, following the instructions found there, to generate MFCC
.json
files.
Four different time-series predictors were implemented as training options. Both the "LSTM" and "transformer encoder-only model" are one step prediction models, while "LSTM encoder-decoder model" and "transformer model" can predict descriptor sequences with specified sequence length.
- LSTM (Hochreiter et al. 1997)
- LSTM encoder-decoder model (Cho et al. 2014)
- Transformer encoder-only model
- Transformer model (Vaswani et al. 2017)
Once you train the model, record the wandb_run_id
and paste it in the prediction notebook. Then, provide paths to the RAW generated audio DB
and Prediction DB
databases and and run the notebook to generate new descriptors. The descriptors genereted from Prediction DB
will be used as the input of the neural sequencer to predict subsequent descriptors, which will be converted into .wav
audio files using the query and playback engines (see below). To train the model please use the following notebook.
You may alternatively train the descriptor model using a database containing files in .wav
format by running
python desc/train_function.py --selected_model <1 of 4 models above> --audio_db_dir <path to database> --window_size <input sequence length> --forecast_size <output sequence length>
This is the workflow of the query and playback engines, which will translate MFCC
.json
files into .wav
audio files. This workflow partially overlaps with the instructions provided above on the descriptor predictor model.
-
The descriptor model processes the
PREDICTION DB
databse (see diagram above) to generate descriptor input sequences and saves them inDESCRIPTOR DB II
. It then predicts subsequent descriptor strings based on that data. -
The model processes the audio database into
DESCRIPTOR DB I
and links each descriptor to anID reference
connected to the specific audio segment. -
The query function replaces the new predicted descriptors generated by the descriptor model with the closest match, based on a distance function, found in the
DESCRIPTOR DB I
-
The model combines and merges these segments referenced by the replaced descriptors from the query function into a new
.wav
audio file.
To train the model please use the following notebook.