Emotional voice conversion as data augmentation

This repository contains our code for using emotional voice conversion as a form of data augmentation for emotion recognition.

The general procedure is:

Download and preprocess datasets
Extract phoneme transcripts
Pretrain EVC model on Common Voice
Fine-tune EVC models for each dataset

This repo is based on and contains code from the following other projects:

Data

The datasets we use are IEMOCAP, MSP-IMPROV, EmoV-DB, CREMA-D, ESD and Common Voice. Please download these datasets, using ERTK to preprocess the data, extract features, and extract phone transcriptions.

Preprocessing

Download the datasets and preprocess using ERTK. For example:

cd emotion/datasets/CREMA-D
python process.py /path/to/CREMA-D

Phonemization

Extract phone transcriptions of the transcripts, e.g.:

cd emotion/datasets/CREMA-D
ertk-dataset process --features phonemize --batch_size -1 transcript.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags

cd emotion/datasets/ESD
ertk-dataset process --features phonemize --batch_size -1 transcripts_en.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
ertk-dataset process --features phonemize --batch_size -1 transcripts_zh.csv phones_ipa.csv language=cmn backend=espeak language_switch=remove-flags

For MSP-IMPROV, since no official transcripts are given, and a number of clips have arbitrary text content, we use a state-of-the-art speech recogniser to generate transcripts.

cd emotion/datasets/MSP-IMPROV
ertk-dataset process --features huggingface files_all.txt transcript_w2v.csv model=facebook/wav2vec2-large-960h-lv60-self
ertk-dataset process --features phonemize --batch_size -1 transcript_w2v.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags

Common Voice

We use a 10-language subset of Common Voice 10.0 consisting of Chinese, English, Arabic, Greek, Bangla, Farsi, Italian, Portuguese, Urdu and Estonian.

The gen_evc_data.py file in data/CommonVoice/ will generate train/validation subsets, as well as phone transcriptions and speaker information for each clip:

python gen_evc_data.py /path/to/CommonVoice

Emotional voice conversion

The gen_evc_data.py scripts in the data/* directories will generate train and validation subsets for training the EVC models.

Pre-training EVC models

The pre-train directory is modelled similarly to seq2seq EVC so also check there for additional details.

Generate mel spectrogram standardisation info:

python mel_mean_std.py data/CommonVoice/train.txt data/CommonVoice

This will generate data/CommonVoice/mel_mean_std.npy, containing mean and standard deviation of each mel-band over the whole training data.

Training

You can train with the following command, adjust as per your setup:

python train.py \
    --output_directory out_cv_10lang \
    --hparams distributed_run=True,batch_size=64,training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=data/CommonVoice/mel_mean_std.npy,phones_csv=data/CommonVoice/phones_ipa.csv,speaker_csv=data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315 \
    --n_gpus 4

We provide model weights for a pretrained model here.

Fine-tuning EVC models

Use the gen_embedding.py file to generate initial emotion embeddings from the pretrained model. For CREMA-D:

cd conversion
for emo in anger disgust fear happiness neutral sadness; do \
    python gen_embedding.py \
        --checkpoint_path ../pre-train/out_cv_10lang/logdir/checkpoint_84000 \
        --hparams mel_mean_std=../data/CREMA-D/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315 \
        --input ../data/CREMA-D/train_${emo}.txt \
        --output embeddings/CREMA-D/cv_10lang/${emo}.npy; \
done

The fine-tuning commands are given in conversion/run_train.sh.

HiFi-GAN Pretraining

cd hifi-gan
python train.py --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang --checkpoint_interval 2000 --validation_interval 2000 --summary_interval 200

We provide a HiFi-GAN model pretrained on the Common Voice data here.

HiFi-GAN Fine-tuning

First generate forward outputs from the EVC model using gen_fwd_mels.py:

cd pre-train
for split in train dev; do
    python gen_fwd_mels.py --checkpoint out_cv_10lang/logdir/checkpoint_84000 --output_dir out_cv_10lang/fwd_mels --input_list ../data/CommonVoice/${split}.txt --hparams training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=../data/CommonVoice/mel_mean_std.npy,phones_csv=../data/CommonVoice/phones_ipa.csv,speaker_csv=../data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315
done

To finetune HiFi-GAN, copy the pretrained checkpoint and use the fine-tuning version of the script:

cd hifi-gan
mkdir cp/v1_cv_10lang_ft_cv_10lang
cp cp/v1_cv_10lang/{g,do}_00152000 cp/v1_cv_10lang_ft_cv_10lang/
python train.py --fine_tuning True --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --input_mels_dir ../pre-train/out_cv_10lang/fwd_mels --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang_ft_cv_10lang/ --checkpoint_interval 2000 --validation_interval 1000 --summary_interval 200

We provide a HiFi-GAN model fine-tuned on EVC model outputs from the Common Voice data here.

Emotion conversion

Use convert_all.py to convert a set of speech files into all emotion categories:

cd conversion
python convert_all.py \
    --checkpoint_path out_ft_IEMOCAP_en_4class/logdir/checkpoint_13651 \
    --input_list ../emotion/datasets/MSP-IMPROV/files_neutral.txt \
    --output_dir ../augmentation/datasets/MSP-IMPROV_aug/IEMOCAP_evc/hifi-gan_v1_ft_mel_vocoded/ \
    --wav \
    --hifi_gan_path ../hifi-gan/cp/v1_cv_10lang_ft_cv_10lang/g_00496000 \
    --hparams \""emo_list=[anger,happiness,neutral,sadness]"\",emo_embedding_dir=embeddings/IEMOCAP/,mel_mean_std=../data/IEMOCAP/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315

Augmentation experiments

Extract features for real speech:

cd emotion
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/wav2vec_c_mean.nc model_type=wav2vec checkpoint=/path/to/wav2vec_large.pt layer=context aggregate=MEAN
ertk-dataset process --features huggingface --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/wav2vec_audeering_ft_c_mean.nc model=audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim task=EMBEDDINGS layer=context agg=MEAN

and synthetic speech

cd augmentation/datasets/MSP-IMPROV_aug/MSP-IMPROV_evc
find hifi-gan_v1_ft_mel_vocoded -type f | sort > files.txt
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 files.txt wav2vec_c_mean.nc model_type=wav2vec checkpoint=/path/to/wav2vec_large.pt layer=context aggregate=MEAN

The synthetic speech can be aligned to the original dataset using align_aug_annots.py:

cd augmentation/datasets
python align_aug_annots.py IEMOCAP_aug --original ../../emotion/datasets/IEMOCAP/corpus.yaml --annot speaker --annot session

This will generate a corpus.yaml file for each augmented dataset and create the appropriate metadata, including subsets for each EVC model and specified annotations.

Experiments

The experiments can be generated using the gen_evc_runs.sh script for each feature set. You can then run multiple experiments in parallel:

bash gen_evn_runs.sh > jobs.txt
ertk-util parallel_jobs jobs.txt --cpus 24 --failed failed.txt

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
augmentation		augmentation
conversion		conversion
data		data
emotion @ f476c31		emotion @ f476c31
hifi-gan		hifi-gan
pre-train		pre-train
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
README.md		README.md
mel_mean_std.py		mel_mean_std.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Emotional voice conversion as data augmentation

Data

Preprocessing

Phonemization

Common Voice

Emotional voice conversion

Pre-training EVC models

Training

Fine-tuning EVC models

HiFi-GAN Pretraining

HiFi-GAN Fine-tuning

Emotion conversion

Augmentation experiments

Experiments

About

Releases

Packages

Languages

agkphysics/EVC-augmentation

Folders and files

Latest commit

History

Repository files navigation

Emotional voice conversion as data augmentation

Data

Preprocessing

Phonemization

Common Voice

Emotional voice conversion

Pre-training EVC models

Training

Fine-tuning EVC models

HiFi-GAN Pretraining

HiFi-GAN Fine-tuning

Emotion conversion

Augmentation experiments

Experiments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages