Skip to content

agkphysics/EVC-augmentation

Repository files navigation

Emotional voice conversion as data augmentation

This repository contains our code for using emotional voice conversion as a form of data augmentation for emotion recognition.

The general procedure is:

  1. Download and preprocess datasets
  2. Extract phoneme transcripts
  3. Pretrain EVC model on Common Voice
  4. Fine-tune EVC models for each dataset

This repo is based on and contains code from the following other projects:

Data

The datasets we use are IEMOCAP, MSP-IMPROV, EmoV-DB, CREMA-D, ESD and Common Voice. Please download these datasets, using ERTK to preprocess the data, extract features, and extract phone transcriptions.

Preprocessing

Download the datasets and preprocess using ERTK. For example:

cd emotion/datasets/CREMA-D
python process.py /path/to/CREMA-D

Phonemization

Extract phone transcriptions of the transcripts, e.g.:

cd emotion/datasets/CREMA-D
ertk-dataset process --features phonemize --batch_size -1 transcript.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags

cd emotion/datasets/ESD
ertk-dataset process --features phonemize --batch_size -1 transcripts_en.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags
ertk-dataset process --features phonemize --batch_size -1 transcripts_zh.csv phones_ipa.csv language=cmn backend=espeak language_switch=remove-flags

For MSP-IMPROV, since no official transcripts are given, and a number of clips have arbitrary text content, we use a state-of-the-art speech recogniser to generate transcripts.

cd emotion/datasets/MSP-IMPROV
ertk-dataset process --features huggingface files_all.txt transcript_w2v.csv model=facebook/wav2vec2-large-960h-lv60-self
ertk-dataset process --features phonemize --batch_size -1 transcript_w2v.csv phones_ipa.csv language=en-us backend=espeak language_switch=remove-flags

Common Voice

We use a 10-language subset of Common Voice 10.0 consisting of Chinese, English, Arabic, Greek, Bangla, Farsi, Italian, Portuguese, Urdu and Estonian.

The gen_evc_data.py file in data/CommonVoice/ will generate train/validation subsets, as well as phone transcriptions and speaker information for each clip:

python gen_evc_data.py /path/to/CommonVoice

Emotional voice conversion

The gen_evc_data.py scripts in the data/* directories will generate train and validation subsets for training the EVC models.

Pre-training EVC models

The pre-train directory is modelled similarly to seq2seq EVC so also check there for additional details.

Generate mel spectrogram standardisation info:

python mel_mean_std.py data/CommonVoice/train.txt data/CommonVoice

This will generate data/CommonVoice/mel_mean_std.npy, containing mean and standard deviation of each mel-band over the whole training data.

Training

You can train with the following command, adjust as per your setup:

python train.py \
    --output_directory out_cv_10lang \
    --hparams distributed_run=True,batch_size=64,training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=data/CommonVoice/mel_mean_std.npy,phones_csv=data/CommonVoice/phones_ipa.csv,speaker_csv=data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315 \
    --n_gpus 4

We provide model weights for a pretrained model here.

Fine-tuning EVC models

Use the gen_embedding.py file to generate initial emotion embeddings from the pretrained model. For CREMA-D:

cd conversion
for emo in anger disgust fear happiness neutral sadness; do \
    python gen_embedding.py \
        --checkpoint_path ../pre-train/out_cv_10lang/logdir/checkpoint_84000 \
        --hparams mel_mean_std=../data/CREMA-D/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315 \
        --input ../data/CREMA-D/train_${emo}.txt \
        --output embeddings/CREMA-D/cv_10lang/${emo}.npy; \
done

The fine-tuning commands are given in conversion/run_train.sh.

HiFi-GAN Pretraining

cd hifi-gan
python train.py --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang --checkpoint_interval 2000 --validation_interval 2000 --summary_interval 200

We provide a HiFi-GAN model pretrained on the Common Voice data here.

HiFi-GAN Fine-tuning

First generate forward outputs from the EVC model using gen_fwd_mels.py:

cd pre-train
for split in train dev; do
    python gen_fwd_mels.py --checkpoint out_cv_10lang/logdir/checkpoint_84000 --output_dir out_cv_10lang/fwd_mels --input_list ../data/CommonVoice/${split}.txt --hparams training_list=../data/CommonVoice/train.txt,validation_list=../data/CommonVoice/dev.txt,mel_mean_std=../data/CommonVoice/mel_mean_std.npy,phones_csv=../data/CommonVoice/phones_ipa.csv,speaker_csv=../data/CommonVoice/speaker.csv,n_speakers=1967,n_symbols=315
done

To finetune HiFi-GAN, copy the pretrained checkpoint and use the fine-tuning version of the script:

cd hifi-gan
mkdir cp/v1_cv_10lang_ft_cv_10lang
cp cp/v1_cv_10lang/{g,do}_00152000 cp/v1_cv_10lang_ft_cv_10lang/
python train.py --fine_tuning True --input_training_file ../data/CommonVoice/train.txt --input_validation_file ../data/CommonVoice/dev.txt --input_mels_dir ../pre-train/out_cv_10lang/fwd_mels --config config_v1_16k_0.05_0.0125.json --checkpoint_path cp/v1_cv_10lang_ft_cv_10lang/ --checkpoint_interval 2000 --validation_interval 1000 --summary_interval 200

We provide a HiFi-GAN model fine-tuned on EVC model outputs from the Common Voice data here.

Emotion conversion

Use convert_all.py to convert a set of speech files into all emotion categories:

cd conversion
python convert_all.py \
    --checkpoint_path out_ft_IEMOCAP_en_4class/logdir/checkpoint_13651 \
    --input_list ../emotion/datasets/MSP-IMPROV/files_neutral.txt \
    --output_dir ../augmentation/datasets/MSP-IMPROV_aug/IEMOCAP_evc/hifi-gan_v1_ft_mel_vocoded/ \
    --wav \
    --hifi_gan_path ../hifi-gan/cp/v1_cv_10lang_ft_cv_10lang/g_00496000 \
    --hparams \""emo_list=[anger,happiness,neutral,sadness]"\",emo_embedding_dir=embeddings/IEMOCAP/,mel_mean_std=../data/IEMOCAP/mel_mean_std.npy,pretrain_n_speakers=1967,n_symbols=315

Augmentation experiments

Extract features for real speech:

cd emotion
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/wav2vec_c_mean.nc model_type=wav2vec checkpoint=/path/to/wav2vec_large.pt layer=context aggregate=MEAN
ertk-dataset process --features huggingface --corpus $corpus --sample_rate 16000 datasets/$corpus/files_all.txt features/$corpus/wav2vec_audeering_ft_c_mean.nc model=audeering/wav2vec2-large-robust-12-ft-emotion-msp-dim task=EMBEDDINGS layer=context agg=MEAN

and synthetic speech

cd augmentation/datasets/MSP-IMPROV_aug/MSP-IMPROV_evc
find hifi-gan_v1_ft_mel_vocoded -type f | sort > files.txt
ertk-dataset process --features fairseq --corpus $corpus --sample_rate 16000 files.txt wav2vec_c_mean.nc model_type=wav2vec checkpoint=/path/to/wav2vec_large.pt layer=context aggregate=MEAN

The synthetic speech can be aligned to the original dataset using align_aug_annots.py:

cd augmentation/datasets
python align_aug_annots.py IEMOCAP_aug --original ../../emotion/datasets/IEMOCAP/corpus.yaml --annot speaker --annot session

This will generate a corpus.yaml file for each augmented dataset and create the appropriate metadata, including subsets for each EVC model and specified annotations.

Experiments

The experiments can be generated using the gen_evc_runs.sh script for each feature set. You can then run multiple experiments in parallel:

bash gen_evn_runs.sh > jobs.txt
ertk-util parallel_jobs jobs.txt --cpus 24 --failed failed.txt

About

Emotional voice conversion for data augmentation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published