Implementation of "Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning"
https://arxiv.org/abs/2206.07229
Ubuntu 18.04.5 LTS
- GPU: Quadro RTX 6000
- Driver version: 450.80.02
- CUDA version: 11.0
Python 3.5
- tensorflow-gpu 2.0.0b1 (cudnn=7.6.0)
- scipy
- pandas
- matplotlib
- librosa
For example,
conda create -n strengthnet python=3.5
conda activate strengthnet
pip install -r requirements.txt
conda install cudnn=7.6.0
-
Run
python utils.py
to extract .wav to .h5; -
Run
python train.py
to train a CNN-BLSTM based StrengthNet;
-
Put the waveforms you wish to evaluate in a folder. For example,
<path>/<to>/<samples>
-
Run
python test.py --rootdir <path>/<to>/<samples>
This script will evaluate all the .wav
files in <path>/<to>/<samples>
, and write the results to <path>/<to>/<samples>/StrengthNet_result_raw.txt
.
By default, the output/strengthnet.h5
pretrained model is used.
If you find this work useful in your research, please consider citing:
@misc{liu2021strengthnet,
title={StrengthNet: Deep Learning-based Emotion Strength Assessment for Emotional Speech Synthesis},
author={Rui Liu and Berrak Sisman and Haizhou Li},
year={2021},
eprint={2110.03156},
archivePrefix={arXiv},
primaryClass={cs.SD}
}
The ESD corpus is released by the HLT lab, NUS, Singapore.
The strength scores for the English samples of the ESD corpus are available here.
MOSNet: https://github.com/lochenchou/MOSNet
Relative Attributes: Relative Attributes
This work is released under MIT License (see LICENSE file for details).