This repository contains our code for using emotional voice conversion as a form of data augmentation for emotion recognition.
The general procedure is:
- Download and preprocess datasets
- Extract phoneme transcripts
- Pretrain EVC model on Common Voice
- Fine-tune EVC models for each dataset
This repo is based on and contains code from the following other projects:
The datasets we use are IEMOCAP, MSP-IMPROV, EmoV-DB, CREMA-D, ESD and Common Voice. Please download these datasets, using ERTK to preprocess the data, extract features, and extract phone transcriptions.
Download the datasets and preprocess using ERTK. For example:
cd emotion/datasets/CREMA-D
python /path/to/CREMA-D
Extract phone transcriptions of the transcripts, e.g.:
cd emotion/datasets/CREMA-D
ertk-dataset process --features phonemize --batch_size -1 transcript.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
cd emotion/datasets/ESD
ertk-dataset process --features phonemize --batch_size -1 transcripts_en.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
ertk-dataset process --features phonemize --batch_size -1 transcripts_zh.csv phones_ipa.csv language=cmn backend=espeak language_switch=remove-flags
For MSP-IMPROV, since no official transcripts are given, and a number of clips have arbitrary text content, we use a state-of-the-art speech recogniser to generate transcripts.
cd emotion/datasets/MSP-IMPROV
ertk-dataset process --features huggingface files_all.txt transcript_w2v.csv model=facebook/wav2vec2-large-960h-lv60-self
ertk-dataset process --features phonemize --batch_size -1 transcript_w2v.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
We use a 10-language subset of Common Voice 10.0 consisting of Chinese, English, Arabic, Greek, Bangla, Farsi, Italian, Portuguese, Urdu and Estonian.
file in data/CommonVoice/
will generate
train/validation subsets, as well as phone transcriptions and speaker
information for each clip:
python /path/to/CommonVoice
scripts in the data/*
directories will generate
train and validation subsets for training the EVC models.
The pre-train
directory is modelled similarly to seq2seq
EVC so also check there for
additional details.
Generate mel spectrogram standardisation info:
python data/CommonVoice/train.txt data/CommonVoice
This will generate data/CommonVoice/mel_mean_std.npy
, containing mean
and standard deviation of each mel-band over the whole training data.
You can train with the following command, adjust as per your setup:
python \
--output_directory out_cv_10lang \
--hparams distributed_run=True,batch_size=64,training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=data/CommonVoice/mel_mean_std.npy,phones_csv=data/CommonVoice/phones_ipa.csv,speaker_csv=data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315 \
--n_gpus 4
We provide model weights for a pretrained model here.
Use the
file to generate initial emotion embeddings
from the pretrained model. For CREMA-D:
cd conversion
for emo in anger disgust fear happiness neutral sadness; do \
python \
--checkpoint_path ../pre-train/out_cv_10lang/logdir/checkpoint_84000 \
--hparams mel_mean_std=../data/CREMA-D/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315 \
--input ../data/CREMA-D/train_${emo}.txt \
--output embeddings/CREMA-D/cv_10lang/${emo}.npy; \
The fine-tuning commands are given in
cd hifi-gan
python --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang --checkpoint_interval 2000 --validation_interval 2000 --summary_interval 200
We provide a HiFi-GAN model pretrained on the Common Voice data here.
First generate forward outputs from the EVC model using
cd pre-train
for split in train dev; do
python --checkpoint out_cv_10lang/logdir/checkpoint_84000 --output_dir out_cv_10lang/fwd_mels --input_list ../data/CommonVoice/${split}.txt --hparams training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=../data/CommonVoice/mel_mean_std.npy,phones_csv=../data/CommonVoice/phones_ipa.csv,speaker_csv=../data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315
To finetune HiFi-GAN, copy the pretrained checkpoint and use the fine-tuning version of the script:
cd hifi-gan
mkdir cp/v1_cv_10lang_ft_cv_10lang
cp cp/v1_cv_10lang/{g,do}_00152000 cp/v1_cv_10lang_ft_cv_10lang/
python --fine_tuning True --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --input_mels_dir ../pre-train/out_cv_10lang/fwd_mels --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang_ft_cv_10lang/ --checkpoint_interval 2000 --validation_interval 1000 --summary_interval 200
We provide a HiFi-GAN model fine-tuned on EVC model outputs from the Common Voice data here.
to convert a set of speech files into all emotion
cd conversion
python \
--checkpoint_path out_ft_IEMOCAP_en_4class/logdir/checkpoint_13651 \
--input_list ../emotion/datasets/MSP-IMPROV/files_neutral.txt \
--output_dir ../augmentation/datasets/MSP-IMPROV_aug/IEMOCAP_evc/hifi-gan_v1_ft_mel_vocoded/ \
--wav \
--hifi_gan_path ../hifi-gan/cp/v1_cv_10lang_ft_cv_10lang/g_00496000 \
--hparams \""emo_list=[anger,happiness,neutral,sadness]"\",emo_embedding_dir=embeddings/IEMOCAP/,mel_mean_std=../data/IEMOCAP/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315
Extract features for real speech:
cd emotion
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/ model_type=wav2vec checkpoint=/path/to/ layer=context aggregate=MEAN
ertk-dataset process --features huggingface --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/ model=audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim task=EMBEDDINGS layer=context agg=MEAN
and synthetic speech
cd augmentation/datasets/MSP-IMPROV_aug/MSP-IMPROV_evc
find hifi-gan_v1_ft_mel_vocoded -type f | sort > files.txt
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 files.txt model_type=wav2vec checkpoint=/path/to/ layer=context aggregate=MEAN
The synthetic speech can be aligned to the original dataset using
cd augmentation/datasets
python IEMOCAP_aug --original ../../emotion/datasets/IEMOCAP/corpus.yaml --annot speaker --annot session
This will generate a corpus.yaml
file for each augmented dataset and
create the appropriate metadata, including subsets for each EVC model
and specified annotations.
The experiments can be generated using the
script for
each feature set. You can then run multiple experiments in parallel:
bash > jobs.txt
ertk-util parallel_jobs jobs.txt --cpus 24 --failed failed.txt