I utilized DNN(3128 and 6512),GRU,DSCNN to do key word spotting task
Feature representation → Model → Posterior handling → Evaluation on test set
- Key word: HELLOW XIAOGUA
- Training set: 1642 positive example, 9383 positive example
- Test set: 559 positive example, 3453 negative example
- Feature extraction rate: computed every 10ms over a window of 25ms
- Feature shape: (frameNumbers,40)
- Performance metrics: False-reject and False positive rate
In order to mimic the mechanism of human ear, we stack the nearby frames on labeling a specify frames, as the below picture show:
There are two types of feature stacking, one is stack vertically(Type2) and another is stack horizontally(Type1).Type1 can be consider as an image so is easy to perform conv-op and used as input of DSCNN. Type2 is a single vector, so can be used as the input of GRU or DNN.
As for the value of m and n
model | n | m |
---|---|---|
DNN | 30 | 10 |
GRU, DSCNN | 15 | 5 |
Then lets take a look at the models I used.
DNN with 3(or 6) layers and each layer have 128(or 512) units
- Step 1: Softmax
- Step 2: Cross entropy
trainBatchSize = 100
testBatchSize = 100
leftFrames = 30
rightFrames = 10
learningRate = 0.00001
decay_rate = 0.8
numEpochs = 5
w_smooth = 3
w_max = 30
- Shuffle train and test data every epoch
- Learning rate exponential decay
Update gate:
- tensorflow.contrib.seq2seq.sequence_loss
Apply cross-entropy loss between each element and its label in a sequence. The sequence length is not fixed, so we need to input a mask as a filter
modelName = "GRU" # "GRU" "DNN_6_512" "DNN_3_128"
lossFunc = "seqLoss" # "Paper" "crossEntropy"
trainBatchSize = 16
testBatchSize = 16
leftFrames = 15
shuffle = True
rightFrames = 5
learningRate = 0.001
decay_rate = 0.895
numEpochs = 60
w_smooth = 5
w_max = 70
Layer | Parameter |
---|---|
1:Conv+batchNorm | kernel-(10,4), y_stride-2, x_stride-1, outputfeature-172 |
2:DS-Conv | kernel-(3,3), y_stride-2, x_stride-2, outputfeature-172 |
3:DS-Conv | kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172 |
4:DS-Conv | kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172 |
5:DS-Conv | kernel-(3,3), y_stride-1, x_stride-1, outputfeature-172 |
6:AvgPooling+FullyConnected | (None,3) |
Note: DS-Conv stands for Depthwise Separatable Convolution, which include a depthwise convolution, batchnorm, a pointwise convolution and batchnorm.
The same as DNN
trainBatchSize = 10
testBatchSize = 10
leftFrames = 15
rightFrames = 5
shuffle = True
learningRate = 0.000002
decay_rate = 0.895
numEpochs = 60
w_smooth = 5
w_max = 70
For the output of the model, take following two steps:
-
Smoothing
$$ p^{'}{ij}=\frac{1}{j-h{smooth}+1}\sum_{k=h_{smooth}}^{j}p_{ik} $$
$$ h_{smooth}=max{1,j-w_{smooth}+1} $$
Here i's value can be {0,1,2}, which stand for {filler,keyword1,keyword2}
-
Calculate Confidence
$$ confidence = \sum^{n-1}{i=1}max{h_{max<=k<=j}}p_{ik}^{'} $$
Then I select the max value in confidence as the score of a specific data
First let's do a brief compare to these four model
MODEL | PARAMETERS |
---|---|
3-128 DNN | (1640,128)+2*(128,128)+2*(128,)+(128,3)+(3,) = 243,459 |
6-512 DNN | (1640,512)+5*(512,512)+6*(512,)+(512,3)+(3,) = 2,155,011 |
GRU_128 | (968,256)+(256,)+(968,128)+(128,)+(128,3)+(3,) = 372,483 |
DSCNN | 135,023 |
Note: leftFrames and rightFrames refer to the frames used during frame stacking
- DSCNN > GRU > DNN_512_6 > DNN_128_3
-
In order to examine the performance of model and make the debugging easier, I made the following visualization:
-
For top to bottom, I visualized:
- Wave form: Plot the raw .wav file
- Desired label: Where 0 stand for 'filler',1 stand for 'hello',2 stand for 'xiaogua'
- Modeloutput_label_0: The probability of label0 (filler) in model output
- Modeloutput_label_1_2: The probability of label1(hello) and label2(xiaogua) in model output
- Confidence: The value we get after posterior handling of model output
- I recorded these problems in my blog
- Loss value soar abnormally after one epoch
- Smoothing and the calculation of confidence described in paper doesn't make sense in this project, so I improved it
- Data transforming process is too slow and GPU-Util is low
- The usage of tensorflow
- DSCNN:
It's the best one among these model, due to its light weight(less parameters),high concurrency and best performance. Firstly, DSCNN utilize the strength of CNN. It makes the kernel's weight params shared across different region, making model both robust and easy to train. Secondly, it lower the weight number even more comparing with CNN. By the application of depth-wise conv and point-wise conv, DSCNN is able to expand features efficiently and reduce params significantly. - GRU:
It has moderate performace in this project. It's better than DNN from two aspect: weight sharing, hidden state. First, Weight sharing mechanism both shrink the size of model and enable the efficient use of params. Second, hidden state, which can be seen as embedding of prior frames, is also useful during the classification of current frame. - DNN:
It is both cumbersome and incompetent. The advantage of this model, if it have one, may be its simplicity of realization.
- It takes me about seventeen days to accomplish this project, which is the first project in ASLP. I think I have learnt a lot during this period of time. First of all, I have know about the common procedure realizing of a speech recognition system. Secondly I dive more deeper in DL by carefully examination, debugging and realization of GRU,DNN and DSCNN. Third, I become more proficient in using python and tensorflow. Besides, I found this process really interesting and offer me with a sense of accomplishment. Now I'm ready for future's more challenging task!