1. Tasks - DO KEY WORD SPOTTING

I utilized DNN(3128 and 6512),GRU,DSCNN to do key word spotting task

1. Tasks - DO KEY WORD SPOTTING

Feature representation → Model → Posterior handling → Evaluation on test set

Key word: HELLOW XIAOGUA
Training set: 1642 positive example, 9383 positive example
Test set: 559 positive example, 3453 negative example
Feature extraction rate: computed every 10ms over a window of 25ms
Feature shape: (frameNumbers,40)
Performance metrics: False-reject and False positive rate

2. Feature representation

In order to mimic the mechanism of human ear, we stack the nearby frames on labeling a specify frames, as the below picture show:

There are two types of feature stacking, one is stack vertically(Type2) and another is stack horizontally(Type1).Type1 can be consider as an image so is easy to perform conv-op and used as input of DSCNN. Type2 is a single vector, so can be used as the input of GRU or DNN.

As for the value of m and n

model	n	m
DNN	30	10
GRU, DSCNN	15	5

Then lets take a look at the models I used.

3. Models

3.1. DNN

DNN with 3(or 6) layers and each layer have 128(or 512) units

3.1.1. Loss function

Step 1: Softmax

$$ softmax(x)_i = \frac{exp(x_i)}{\sum_jexp(x_j)} $$

Step 2: Cross entropy

$$ Output_{y_0}(y) = -\sum_iy_o^{}log(y^{}) $$

$$ y_0 :true label $$

$$ y: output logits $$

3.1.2. Hyper parameter

trainBatchSize = 100 testBatchSize = 100 leftFrames = 30 rightFrames = 10 learningRate = 0.00001 decay_rate = 0.8 numEpochs = 5 w_smooth = 3 w_max = 30

Shuffle train and test data every epoch

Learning rate exponential decay

3.2. GRU

$$ c^{}_0 = tanh(W_c[c^{},x^{t}]+b_c) $$

$$ c^{} = \Gamma_u*c^{}_0 + (1-\Gamma_u)*c^{} $$

Update gate:

$$ \Gamma_u = \sigma(W_u[c^{},x^{t}]+b_u) $$

3.2.1. Loss function

tensorflow.contrib.seq2seq.sequence_loss

Apply cross-entropy loss between each element and its label in a sequence. The sequence length is not fixed, so we need to input a mask as a filter

3.2.2. Hyper parameter

modelName = "GRU" # "GRU" "DNN_6_512" "DNN_3_128" lossFunc = "seqLoss" # "Paper" "crossEntropy" trainBatchSize = 16 testBatchSize = 16 leftFrames = 15 shuffle = True rightFrames = 5 learningRate = 0.001 decay_rate = 0.895 numEpochs = 60 w_smooth = 5 w_max = 70

3.3. DSCNN

Layer Parameter

1:Conv+batchNorm kernel-(10,4), y_stride-2, x_stride-1, outputfeature-172

2:DS-Conv kernel-(3,3), y_stride-2, x_stride-2, outputfeature-172

3:DS-Conv kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172

4:DS-Conv kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172

5:DS-Conv kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172

6:AvgPooling+FullyConnected (None,3)

Note: DS-Conv stands for Depthwise Separatable Convolution, which include a depthwise convolution, batchnorm, a pointwise convolution and batchnorm.

3.3.1. Loss function

The same as DNN

3.3.2. Hyper parameter

trainBatchSize = 10 testBatchSize = 10 leftFrames = 15 rightFrames = 5 shuffle = True learningRate = 0.000002 decay_rate = 0.895 numEpochs = 60 w_smooth = 5 w_max = 70

4. Posterior handlng

For the output of the model, take following two steps:

Smoothing

$$ p^{'}{ij}=\frac{1}{j-h{smooth}+1}\sum_{k=h_{smooth}}^{j}p_{ik} $$

$$ h_{smooth}=max{1,j-w_{smooth}+1} $$

Here i's value can be {0,1,2}, which stand for {filler,keyword1,keyword2}

Calculate Confidence

$$ confidence = \sum^{n-1}{i=1}max{h_{max<=k<=j}}p_{ik}^{'} $$

Then I select the max value in confidence as the score of a specific data

5. Result

First let's do a brief compare to these four model

5.1. Parameter number

MODEL PARAMETERS

3-128 DNN (1640,128)+2*(128,128)+2*(128,)+(128,3)+(3,) = 243,459

6-512 DNN (1640,512)+5*(512,512)+6*(512,)+(512,3)+(3,) = 2,155,011

GRU_128 (968,256)+(256,)+(968,128)+(128,)+(128,3)+(3,) = 372,483

DSCNN 135,023

Note: leftFrames and rightFrames refer to the frames used during frame stacking

5.2. Performance

DSCNN > GRU > DNN_512_6 > DNN_128_3

5.3. Visualization

In order to examine the performance of model and make the debugging easier, I made the following visualization:

For top to bottom, I visualized:

Wave form: Plot the raw .wav file

Desired label: Where 0 stand for 'filler',1 stand for 'hello',2 stand for 'xiaogua'

Modeloutput_label_0: The probability of label0 (filler) in model output

Modeloutput_label_1_2: The probability of label1(hello) and label2(xiaogua) in model output

Confidence: The value we get after posterior handling of model output

6. Problems

I recorded these problems in my blog

Loss value soar abnormally after one epoch

Smoothing and the calculation of confidence described in paper doesn't make sense in this project, so I improved it

Data transforming process is too slow and GPU-Util is low

The usage of tensorflow

7. Conclusions

DSCNN:
It's the best one among these model, due to its light weight(less parameters),high concurrency and best performance. Firstly, DSCNN utilize the strength of CNN. It makes the kernel's weight params shared across different region, making model both robust and easy to train. Secondly, it lower the weight number even more comparing with CNN. By the application of depth-wise conv and point-wise conv, DSCNN is able to expand features efficiently and reduce params significantly.

GRU:
It has moderate performace in this project. It's better than DNN from two aspect: weight sharing, hidden state. First, Weight sharing mechanism both shrink the size of model and enable the efficient use of params. Second, hidden state, which can be seen as embedding of prior frames, is also useful during the classification of current frame.

DNN:
It is both cumbersome and incompetent. The advantage of this model, if it have one, may be its simplicity of realization.

8. Summary

It takes me about seventeen days to accomplish this project, which is the first project in ASLP. I think I have learnt a lot during this period of time. First of all, I have know about the common procedure realizing of a speech recognition system. Secondly I dive more deeper in DL by carefully examination, debugging and realization of GRU,DNN and DSCNN. Third, I become more proficient in using python and tensorflow. Besides, I found this process really interesting and offer me with a sense of accomplishment. Now I'm ready for future's more challenging task!

9. Links

Github: Key-word-spotting-DNN-GRU-DSCNN

BLOG: Procedure and Problems I have during the build of DNN key word spotting system

BLOG: Use GRU for Key word spotting

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.idea		.idea
__pycache__		__pycache__
models/GRU		models/GRU
pickles		pickles
PosteriorHandling.py		PosteriorHandling.py
README.md		README.md
config.py		config.py
countParamNumber.py		countParamNumber.py
dataLoader.py		dataLoader.py
fbankreader3.py		fbankreader3.py
makeGif.py		makeGif.py
model.py		model.py
train.py		train.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1. Tasks - DO KEY WORD SPOTTING

2. Feature representation

3. Models

3.1. DNN

3.1.1. Loss function

3.1.2. Hyper parameter

3.2. GRU

3.2.1. Loss function

3.2.2. Hyper parameter

3.3. DSCNN

3.3.1. Loss function

3.3.2. Hyper parameter

4. Posterior handlng

5. Result

5.1. Parameter number

5.2. Performance

5.3. Visualization

6. Problems

7. Conclusions

8. Summary

9. Links

About

Releases

Packages

Languages

Layer	Parameter
1:Conv+batchNorm	kernel-(10,4), y_stride-2, x_stride-1, outputfeature-172
2:DS-Conv	kernel-(3,3), y_stride-2, x_stride-2, outputfeature-172
3:DS-Conv	kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172
4:DS-Conv	kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172
5:DS-Conv	kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172
6:AvgPooling+FullyConnected	(None,3)

MODEL	PARAMETERS
3-128 DNN	(1640,128)+2(128,128)+2(128,)+(128,3)+(3,) = 243,459
6-512 DNN	(1640,512)+5(512,512)+6(512,)+(512,3)+(3,) = 2,155,011
GRU_128	(968,256)+(256,)+(968,128)+(128,)+(128,3)+(3,) = 372,483
DSCNN	135,023

l2009312042/Key-word-spotting-DNN-GRU-DSCNN

Folders and files

Latest commit

History

Repository files navigation

1. Tasks - DO KEY WORD SPOTTING

2. Feature representation

3. Models

3.1. DNN

3.1.1. Loss function

3.1.2. Hyper parameter

3.2. GRU

3.2.1. Loss function

3.2.2. Hyper parameter

3.3. DSCNN

3.3.1. Loss function

3.3.2. Hyper parameter

4. Posterior handlng

5. Result

5.1. Parameter number

5.2. Performance

5.3. Visualization

6. Problems

7. Conclusions

8. Summary

9. Links

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages