This repository contains our code for using emotional voice conversion as a form of data augmentation for emotion recognition.
The general procedure is:
- Download and preprocess datasets
- Extract phoneme transcripts
- Pretrain EVC model on Common Voice
- Fine-tune EVC models for each dataset
This repo is based on and contains code from the following other projects:
The datasets we use are IEMOCAP, MSP-IMPROV, EmoV-DB, CREMA-D, ESD and Common Voice. Please download these datasets, using ERTK to preprocess the data, extract features, and extract phone transcriptions.
Download the datasets and preprocess using ERTK. For example:
cd emotion/datasets/CREMA-D
python process.py /path/to/CREMA-D
Extract phone transcriptions of the transcripts, e.g.:
cd emotion/datasets/CREMA-D
ertk-dataset process --features phonemize --batch_size -1 transcript.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
cd emotion/datasets/ESD
ertk-dataset process --features phonemize --batch_size -1 transcripts_en.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
ertk-dataset process --features phonemize --batch_size -1 transcripts_zh.csv phones_ipa.csv language=cmn backend=espeak language_switch=remove-flags
For MSP-IMPROV, since no official transcripts are given, and a number of clips have arbitrary text content, we use a state-of-the-art speech recogniser to generate transcripts.
cd emotion/datasets/MSP-IMPROV
ertk-dataset process --features huggingface files_all.txt transcript_w2v.csv model=facebook/wav2vec2-large-960h-lv60-self
ertk-dataset process --features phonemize --batch_size -1 transcript_w2v.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
We use a 10-language subset of Common Voice 10.0 consisting of Chinese, English, Arabic, Greek, Bangla, Farsi, Italian, Portuguese, Urdu and Estonian.
The gen_evc_data.py
file in data/CommonVoice/
will generate
train/validation subsets, as well as phone transcriptions and speaker
information for each clip:
python gen_evc_data.py /path/to/CommonVoice
The gen_evc_data.py
scripts in the data/*
directories will generate
train and validation subsets for training the EVC models.
The pre-train
directory is modelled similarly to seq2seq
EVC so also check there for
additional details.
Generate mel spectrogram standardisation info:
python mel_mean_std.py data/CommonVoice/train.txt data/CommonVoice
This will generate data/CommonVoice/mel_mean_std.npy
, containing mean
and standard deviation of each mel-band over the whole training data.
You can train with the following command, adjust as per your setup:
python train.py \
--output_directory out_cv_10lang \
--hparams distributed_run=True,batch_size=64,training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=data/CommonVoice/mel_mean_std.npy,phones_csv=data/CommonVoice/phones_ipa.csv,speaker_csv=data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315 \
--n_gpus 4
We provide model weights for a pretrained model here.
Use the gen_embedding.py
file to generate initial emotion embeddings
from the pretrained model. For CREMA-D:
cd conversion
for emo in anger disgust fear happiness neutral sadness; do \
python gen_embedding.py \
--checkpoint_path ../pre-train/out_cv_10lang/logdir/checkpoint_84000 \
--hparams mel_mean_std=../data/CREMA-D/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315 \
--input ../data/CREMA-D/train_${emo}.txt \
--output embeddings/CREMA-D/cv_10lang/${emo}.npy; \
done
The fine-tuning commands are given in
conversion/run_train.sh
.
cd hifi-gan
python train.py --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang --checkpoint_interval 2000 --validation_interval 2000 --summary_interval 200
We provide a HiFi-GAN model pretrained on the Common Voice data here.
First generate forward outputs from the EVC model using
gen_fwd_mels.py
:
cd pre-train
for split in train dev; do
python gen_fwd_mels.py --checkpoint out_cv_10lang/logdir/checkpoint_84000 --output_dir out_cv_10lang/fwd_mels --input_list ../data/CommonVoice/${split}.txt --hparams training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=../data/CommonVoice/mel_mean_std.npy,phones_csv=../data/CommonVoice/phones_ipa.csv,speaker_csv=../data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315
done
To finetune HiFi-GAN, copy the pretrained checkpoint and use the fine-tuning version of the script:
cd hifi-gan
mkdir cp/v1_cv_10lang_ft_cv_10lang
cp cp/v1_cv_10lang/{g,do}_00152000 cp/v1_cv_10lang_ft_cv_10lang/
python train.py --fine_tuning True --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --input_mels_dir ../pre-train/out_cv_10lang/fwd_mels --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang_ft_cv_10lang/ --checkpoint_interval 2000 --validation_interval 1000 --summary_interval 200
We provide a HiFi-GAN model fine-tuned on EVC model outputs from the Common Voice data here.
Use convert_all.py
to convert a set of speech files into all emotion
categories:
cd conversion
python convert_all.py \
--checkpoint_path out_ft_IEMOCAP_en_4class/logdir/checkpoint_13651 \
--input_list ../emotion/datasets/MSP-IMPROV/files_neutral.txt \
--output_dir ../augmentation/datasets/MSP-IMPROV_aug/IEMOCAP_evc/hifi-gan_v1_ft_mel_vocoded/ \
--wav \
--hifi_gan_path ../hifi-gan/cp/v1_cv_10lang_ft_cv_10lang/g_00496000 \
--hparams \""emo_list=[anger,happiness,neutral,sadness]"\",emo_embedding_dir=embeddings/IEMOCAP/,mel_mean_std=../data/IEMOCAP/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315
Extract features for real speech:
cd emotion
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/wav2vec_c_mean.nc model_type=wav2vec checkpoint=/path/to/wav2vec_large.pt layer=context aggregate=MEAN
ertk-dataset process --features huggingface --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/wav2vec_audeering_ft_c_mean.nc model=audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim task=EMBEDDINGS layer=context agg=MEAN
and synthetic speech
cd augmentation/datasets/MSP-IMPROV_aug/MSP-IMPROV_evc
find hifi-gan_v1_ft_mel_vocoded -type f | sort > files.txt
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 files.txt wav2vec_c_mean.nc model_type=wav2vec checkpoint=/path/to/wav2vec_large.pt layer=context aggregate=MEAN
The synthetic speech can be aligned to the original dataset using
align_aug_annots.py
:
cd augmentation/datasets
python align_aug_annots.py IEMOCAP_aug --original ../../emotion/datasets/IEMOCAP/corpus.yaml --annot speaker --annot session
This will generate a corpus.yaml
file for each augmented dataset and
create the appropriate metadata, including subsets for each EVC model
and specified annotations.
The experiments can be generated using the gen_evc_runs.sh
script for
each feature set. You can then run multiple experiments in parallel:
bash gen_evn_runs.sh > jobs.txt
ertk-util parallel_jobs jobs.txt --cpus 24 --failed failed.txt