Deep Xi is implemented in TensorFlow 2 and is used for speech enhancement, noise estimation, for mask estimation, and as a front-end for robust ASR.
Deep Xi (where the Greek letter 'xi' or ξ is pronounced /zaɪ/) is a deep learning approach to a priori SNR estimation that was proposed in [1] and is implemented in TensorFlow 2. Some of its use cases include:
- It can be used by minimum mean-square error (MMSE) approaches to speech enhancement like the MMSE short-time spectral amplitude (MMSE-STSA) estimator.
- It can be used by minimum mean-square error (MMSE) approaches to noise estimation, as in DeepMMSE [2].
- Estimate the ideal binary mask (IBM) for missing feature approaches or the ideal ratio mask (IRM).
- A front-end for robust ASR, as shown in Figure 1.
Figure 1: Deep Xi used as a front-end for robust ASR. The back-end (Deep Speech) is available here. The noisy speech magnitude spectrogram, as shown in (a), is a mixture of clean speech with voice babble noise at an SNR level of -5 dB, and is the input to Deep Xi. Deep Xi estimates the a priori SNR, as shown in (b). The a priori SNR estimate is used to compute an MMSE approach gain function, which is multiplied elementwise with the noisy speech magnitude spectrum to produce the clean speech magnitude spectrum estimate, as shown in (c). MFCCs are computed from the estimated clean speech magnitude spectrogram, producing the estimated clean speech cepstrogram, as shown in (d). The back-end system, Deep Speech, computes the hypothesis transcript, from the estimated clean speech cepstrogram, as shown in (e). |
A training example is shown in Figure 2. A deep neural network (DNN) within the Deep Xi framework is fed the noisy-speech short-time magnitude spectrum as input. The training target of the DNN is a mapped version of the instantaneous a priori SNR (i.e. mapped a priori SNR). The instantaneous a priori SNR is mapped to the interval [0,1]
to improve the rate of convergence of the used stochastic gradient descent algorithm. The map is the cumulative distribution function (CDF) of the instantaneous a priori SNR, as given by Equation (13) in [1]. The statistics for the CDF are computed over a sample of the training set. An example of the mean and standard deviation of the sample for each frequency bin is shown in Figure 3. The training examples in each mini-batch are padded to the longest sequence length in the mini-batch. The sequence mask is used by TensorFlow to ensure that the DNN is not trained on the padding. During inference, the a priori SNR estimate is computed from the mapped a priori SNR using the sample statistics and Equation (12) from [2].
Figure 2: A training example for Deep Xi. Generated using |
Deep Xi operates on mono/single-channel audio (not stereo/dual-channel audio). Single-channel audio is used due to most cell phones using a single microphone. The available trained models operate on a sampling frequency of f_s=16000
Hz, which is currently the standard sampling frequency used in the speech enhancement community. The sampling frequency can be changed in run.sh
. Deep Xi can be trained using a higher sampling frequency (e.g. f_s=44100
Hz), but this is unnecessary as human speech rarely exceeds 8 kHz (the Nyquist frequency of f_s=16000
Hz is 8 kHz). The available trained models operate on a window duration and shift of T_d=32
ms and T_s=16
ms, respectively. To train a model on a different window duration and shift, T_d
and T_s
can be changed in run.sh
. Currently, Deep Xi supports .wav
, .mp3
, and .flac
audio codecs. The audio codec and bit rate does not affect the performance of Deep Xi.
Open-source training and testing sets are available for Deep Xi on IEEE DataPort:
Deep Xi Training Set: http://dx.doi.org/10.21227/3adt-pb04.
Deep Xi Test Set: http://dx.doi.org/10.21227/h3xh-tm88.
Test set from the original Deep Xi paper: http://dx.doi.org/10.21227/0ppr-yy46.
The MATLAB scripts used to generate these sets can be found in set
.
The following is already configured in the Deep Xi Training Set and Deep Xi Test Set.
Training set
The filenames of the waveforms in the train_clean_speech
and train_noise
directories are not restricted. There can be a different number of waveforms in each. The Deep Xi framework utilises each of the waveforms in train_clean_speech
once during an epoch. For each train_clean_speech
waveform of a mini-batch, the Deep Xi framework selects a random section of a randomely selected waveform from train_noise
(that is at a length greater than or equal to the train_clean_speech
waveform) and adds it to the train_clean_speech
waveform at a randomly selected SNR level (the SNR level range can be set in run.sh
).
Validation set
As the validation set must not change from epoch to epoch, a set of restrictions apply to the waveforms in val_clean_speech
and val_noise
. There must be the same amount of waveforms in val_clean_speech
and val_noise
. One waveform in val_clean_speech
corresponds to only one waveform in val_noise
, i.e. a clean speech and noise validation waveform pair. Each clean speech and noise validation waveform pair must have identical filenames and and an identical number of samples. Each clean speech and noise validation waveform pair must have the SNR level (dB) that they are to be mixed at placed at the end of their filenames. The convention used is _XdB
, where X
is replaced with the desired SNR level. E.g. val_clean_speech/NAME_-5dB.wav
and val_noise/NAME_-5dB.wav
. An example of the filenames for a clean speech and noise validation waveform pair is as follows: val_clean_speech/198_19-198-0003_Machinery17_15dB.wav
and val_noise/198_19-198-0003_Machinery17_15dB.wav
.
Test set
The filenames of the waveforms in the test_noisy_speech
directory are not restricted. This is all that is required if you want inference outputs from Deep Xi, i.e. ./run.sh VER="ANY_NAME" INFER=1
. If you are obtaining objective scores by using ./run.sh VER="ANY_NAME" TEST=1
, then reference waveforms for the objective measures need to be placed in test_clean_speech
. The waveforms in test_clean_speech
and test_noisy_speech
that correspond to each other must have the same number of samples (i.e. the same sequence length). The filename of the waveform in test_clean_speech
that corresponds to a waveform in test_noisy_speech
must be contained in the corresponding test noisy speech waveforn filename. E.g. if the filename of a test noisy speech waveform is test_noisy_speech/61-70968-0000_SIGNAL021_-5dB.wav
, then the filename of the corresponding test clean speech waveform must be contained in the filename of the test noisy speech waveform: test_clean_speech/61-70968-0000.wav
. This is because a test clean speech waveform may be used as a reference for multiple waveforms in test_noisy_speech
(e.g. test_noisy_speech/61-70968-0000_SIGNAL021_0dB.wav
, test_noisy_speech/61-70968-0000_SIGNAL021_5dB.wav
, and test_noisy_speech/61-70968-0000_SIGNAL021_10dB.wav
are additional test noisy speech waveforms that the test clean speech waveform from the previous example is a reference for).
Recurrent neural networks (RNNs) and temporal convolutional networks (TCNs), are available:
- ResNet: Residual network.
- ResLSTM: Residual long short-term memory network.
Deep Xi utilising a ResNet TCN (Deep Xi-ResNet) was proposed in [2]. It uses bottleneck residual blocks and a cyclic dilation rate. The network comprises of approximately 2 million parameters and has a contextual field of approximately 8 seconds. An example of Deep Xi-ResNet is shown in Figure 4. A trained model for version resnet-1.0c
is available in the model
directory. It is trained using the Deep Xi Training Set.
Deep Xi utilising a ResLSTM network (Deep Xi-ResLSTM) was proposed in [1]. Each of its residual blocks contain a single LSTM cell. The network comprises of approximately 10 million parameters.
There are multiple Deep Xi versions, comprising of different networks and restrictions. An example of the ver
naming convention is resnet-1.0c
. The network type is given at the start of ver
. Versions with c are causal. Versions with n are non-causal. The version iteration is also given, i.e. 1.0
. Here are the current versions:
resnet-1.0c
(available in the model
directory)
d_model=256
n_blocks=40
d_f=64
k=3
max_d_rate=16
test_epoch=180
mbatch_size=8
causal=1
resnet-1.0n
(technically, this is not a TCN due to the use of non-causal dilated 1D kernels)
d_model=256
n_blocks=40
d_f=64
k=3
max_d_rate=16
test_epoch=180
mbatch_size=8
causal=0
reslstm-1.0c
d_model=512
n_blocks=5
test_epoch=
mbatch_size=8
Average objective scores obtained over the conditions in the Deep Xi Test Set. SNR levels between -10 dB and 20 dB are considered only. MOS-LQO is the mean opinion score (MOS) objective listening quality score obtained using Wideband PESQ. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). eSTOI is extended STOI. Results for each condition can be found in log/results
Method | Gain | Causal | MOS-LQO | PESQ | STOI | eSTOI |
---|---|---|---|---|---|---|
Deep Xi-ResNet (resnet-1.0c) | MMSE-STSA | Yes | 1.90 | 2.34 | 80.92 | 65.90 |
Deep Xi-ResNet (resnet-1.0c) | MMSE-LSA | Yes | 1.92 | 2.37 | 80.79 | 65.77 |
Deep Xi-ResNet (resnet-1.0c) | SRWF/IRM | Yes | 1.87 | 2.31 | 80.98 | 65.94 |
Deep Xi-ResNet (resnet-1.0c) | cWF | Yes | 1.92 | 2.34 | 81.11 | 65.79 |
Deep Xi-ResNet (resnet-1.0c) | WF | Yes | 1.75 | 2.21 | 78.30 | 63.96 |
Deep Xi-ResNet (resnet-1.0c) | IBM | Yes | 1.38 | 1.73 | 70.85 | 55.95 |
Objective scores obtained on the DEMAND--Voicebank test set described here. As in previous works, the objective scores are averaged over all tested conditions. CSIG, CBAK, and COVL are mean opinion score (MOS) predictors of the signal distortion, background-noise intrusiveness, and overall signal quality, respectively. PESQ is the perceptual evaluation of speech quality measure. STOI is the short-time objective intelligibility measure (in %). The highest scores attained for each measure are indicated in boldface.
Method | Causal | CSIG | CBAK | COVL | PESQ | STOI |
---|---|---|---|---|---|---|
Noisy speech | -- | 3.35 | 2.44 | 2.63 | 1.97 | 92 (91.5) |
Wiener | Yes | 3.23 | 2.68 | 2.67 | 2.22 | -- |
SEGAN | No | 3.48 | 2.94 | 2.80 | 2.16 | 93 |
WaveNet | No | 3.62 | 3.23 | 2.98 | -- | -- |
MMSE-GAN | No | 3.80 | 3.12 | 3.14 | 2.53 | 93 |
Deep Feature Loss | Yes | 3.86 | 3.33 | 3.22 | -- | -- |
Metric-GAN | No | 3.99 | 3.18 | 3.42 | 2.86 | -- |
Deep Xi-ResNet (1.0c, causal) MMSE-LSA | Yes | 4.14 | 3.32 | 3.46 | 2.77 | 93 (93.2) |
Deep Xi-ResNet (1.0n, non-causal) MMSE-LSA | No | 4.28 | 3.46 | 3.64 | 2.95 | 94 (93.6) |
Prerequisites for GPU usage:
To install:
git clone https://github.com/anicolson/DeepXi.git
virtualenv --system-site-packages -p python3 ~/venv/DeepXi
source ~/venv/DeepXi/bin/activate
cd DeepXi
pip install -r requirements.txt
Use run.sh
to configure and run Deep Xi.
Inference: To perform inference and save the outputs, use the following:
./run.sh VER="resnet-1.0c" INFER=1 GAIN="mmse-lsa"
Please look in thoth/args.py
for available gain functions and run.sh
for further options.
Testing: To perform testing and get objective scores, use the following:
./run.sh VER="resnet-1.0c" TEST=1 GAIN="mmse-lsa"
Please look in log/results
for the results.
Training:
./run.sh VER="resnet-1.0c" TRAIN=1 GAIN="mmse-lsa"
Ensure to delete the data directory before training. This will allow training lists and statistics for your training set to be saved and used. To retrain from a certain epoch, set --resume_epoch
in run.sh
to the desired epoch.
If you would like to contribute to Deep Xi, please investigate the following and compare it to current models:
- Currently, the ResLSTM network is not performing as well as expected (when compared to TensorFlow 1.x performance).
Please cite the following depending on what you are using:
- If using Deep Xi-ResLSTM, please cite [1].
- If using Deep Xi-ResNet, please cite [1] and [2].
- If using DeepMMSE, please cite [2].