Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postprocessing skips words in final result #74

Open
ldeseynes opened this issue Oct 16, 2018 · 15 comments
Open

Postprocessing skips words in final result #74

ldeseynes opened this issue Oct 16, 2018 · 15 comments

Comments

@ldeseynes
Copy link

Hi Tanel !
First of all thanks for your great job.

I'm using gst-kaldi-nnet2-online to decode french speech. When running the client.py script, I get a quite correct transcription using my model but at some point the decoding stops and starts again a few seconds later. This results in missing words in the output.
Here is an example of the result with two dots corresponding to the missing words at the end of each sentence:

bonjour , je m' appelle Jean-Christophe je suis agriculteur dans le Loiret sur une exploitation céréalières . je me suis installé il y a une dizaine d' années .. <unk> trente-cinq ans aujourd' hui . je suis papa de trois enfants .. j' ai repris l' exploitation qui était consacré à la culture de la betterave sucrière de céréales depuis que je suis installé diversifiés .. j' y cultive aujourd' hui des oléagineux , comme le colza du maïs .. plusieurs types de céréales ..

Everytime this occurs, I get a warning from worker.py:
WARNING ([5.5.76~1-535b]:LatticeWordAligner():word-align-lattice.cc:263) [Lattice has input epsilons and/or is not input-deterministic (in Mohri sense)]-- i.e. lattice is not deterministic. Word-alignment may be slow and-or blow up in memory.

Any idea about that issue ?

Maybe there is a way to control the length of the final result. For example I get the following output from worker.py:
2018-10-16 16:59:55 - INFO: decoder2: 2faa241b-89ed-497a-9c2a-3b974bf1f8da: Got final result: bonjour , je m' appelle Jean-Christophe je suis agriculteur dans le Loiret sur une exploitation céréalières . je me suis installé il y a une dizaine d' années .

How can I change the code to split the results in two shorter ones ?

Thanks in advance!

@tingyang01
Copy link

Hello, ldeseynes,
I am wonderful that you get some considerable result for french stt.
I've also tried to use gst-kaldi-nnet2-online to decode french speech.
but I've got only one word for a call.
So I've tested it on offline-decode, then it gives considerable result.
I configured worker.yaml like this:

use-nnet2: True
decoder:
# All the properties nested here correspond to the kaldinnet2onlinedecoder GStreamer plugin properties.
# Use gst-inspect-1.0 ./libgstkaldionline2.so kaldinnet2onlinedecoder to discover the available properties
use-threaded-decoder: True
model : test/models/french/librifrench/final.mdl
word-syms : test/models/french/librifrench/words.txt
fst : test/models/french/librifrench/HCLG.fst
mfcc-config : test/models/french/librifrench/conf/mfcc.conf
ivector-extraction-config : test/models/french/librifrench/conf/ivector_extractor.conf
max-active: 10000
beam: 40.0
lattice-beam: 6.0
acoustic-scale: 0.083
do-endpointing : true
endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10"
traceback-period-in-secs: 0.25
chunk-length-in-secs: 0.25
num-nbest: 10
#Additional functionality that you can play with:
#lm-fst: test/models/english/librispeech_nnet_a_online/G.fst
#big-lm-const-arpa: test/models/english/librispeech_nnet_a_online/G.carpa
#phone-syms: test/models/english/librispeech_nnet_a_online/phones.txt
#word-boundary-file: test/models/english/librispeech_nnet_a_online/word_boundary.int
#do-phone-alignment: true
out-dir: tmp

use-vad: False
silence-timeout: 10

Just a sample post-processor that appends "." to the hypothesis

post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'
Then I've got following log message.

  • gst worker server:
    2019-09-18 01:37:14 - INFO: main: d01660ed-3b16-4401-a941-187b3bceb971: Postprocessing done.
    2019-09-18 01:37:14 - DEBUG: main: d01660ed-3b16-4401-a941-187b3bceb971: After postprocessing: {u'status': 0, u'segment-start': 5.18, u'segment-length': 3.54, u'total-length': 8.72, u'result': {u'hypotheses': [{u'likelihood': -6.27475, u'transcript': u'il.', 'original-transcript': u'il'}, {u'likelihood': -7.54526, u'transcript': u'ils.', 'original-transcript': u'ils'}, {u'likelihood': -8.89724, u'transcript': u'il jeta.', 'original-transcript': u'il jeta'}, {u'likelihood': -10.2568, u'transcript': u'il je.', 'original-transcript': u'il je'}, {u'likelihood': -10.3752, u'transcript': u"il j'.", 'original-transcript': u"il j'"}, {u'likelihood': -10.9077, u'transcript': u'il ne.', 'original-transcript': u'il ne'}, {u'likelihood': -11.0849, u'transcript': u'ils je.', 'original-transcript': u'ils je'}, {u'likelihood': -11.1516, u'transcript': u"ils j'.", 'original-transcript': u"ils j'"}, {u'likelihood': -11.2717, u'transcript': u'il me.', 'original-transcript': u'il me'}, {u'likelihood': -11.2749, u'transcript': u'de.', 'original-transcript': u'de'}], u'final': True}, 'segment': 0, 'id': u'd01660ed-3b16-4401-a941-187b3bceb971'}

  • gst master server
    INFO 2019-09-18 01:37:09,763 d01660ed-3b16-4401-a941-187b3bceb971: Sending event {u'status': 0, u'segment': 0, u'result': {u'hypotheses': [{u'transcript': u'de.'}], u'final': Fal... to client
    INFO 2019-09-18 01:37:14,013 d01660ed-3b16-4401-a941-187b3bceb971: Sending event {u'status': 0, u'segment-start': 5.18, u'segment-length': 3.54, u'total-length': 8.72, u'result':... to client
    INFO 2019-09-18 01:37:14,024 d01660ed-3b16-4401-a941-187b3bceb971: Sending event {u'status': 0, u'adaptation_state': {u'type': u'string+gzip+base64', u'id': u'd01660ed-3b16-4401-... to client

  • client
    Audio sent, now sending EOS
    il.

I am very grateful what's wrong on my configuration.
How can I get correct result?
Thanks in helpness.

@ldeseynes
Copy link
Author

Hi,
In your yaml config file, set acoustic-sclale to 1.0 and add frame-subsampling-factor: 3

@tingyang01
Copy link

Hello,
I changed config file and tested again.
"""
use-nnet2: True
decoder:
# All the properties nested here correspond to the kaldinnet2onlinedecoder GStreamer plugin properties.
# Use gst-inspect-1.0 ./libgstkaldionline2.so kaldinnet2onlinedecoder to discover the available properties
use-threaded-decoder: True
model : test/models/french/librifrench/final.mdl
word-syms : test/models/french/librifrench/words.txt
fst : test/models/french/librifrench/HCLG.fst
mfcc-config : test/models/french/librifrench/conf/mfcc.conf
ivector-extraction-config : test/models/french/librifrench/conf/ivector_extractor.conf
max-active: 10000
beam: 13.0
lattice-beam: 8.0
acoustic-scale: 1.0
frame-subsampling-factor: 3
#acoustic-scale: 0.083
do-endpointing : true
#endpoint-silence-phones : "1:2:3:4:5:6:7:8:9:10"
endpoint-silence-phones : "1:2:3:4:5"
traceback-period-in-secs: 0.25
chunk-length-in-secs: 0.25
num-nbest: 10
#Additional functionality that you can play with:
#lm-fst: test/models/english/librispeech_nnet_a_online/G.fst
#big-lm-const-arpa: test/models/english/librispeech_nnet_a_online/G.carpa
#phone-syms: test/models/english/librispeech_nnet_a_online/phones.txt
#word-boundary-file: test/models/english/librispeech_nnet_a_online/word_boundary.int
#do-phone-alignment: true
out-dir: tmp
"""
So I've got following result.
"""
une.
l' ai.jamais.
de. Audio sent, now sending EOS
de.
une. l' ai. de.
"""
Could you share your parameter?
I can share my model.
I hope to get your helpness.

@tingyang01
Copy link

I just found following command:
online2-wav-nnet2-latgen-faster --online=true --do-endpointing=false --config=exp/nnet2_online/nnet_ms_a_online/conf/online_nnet2_decoding.conf --max-active=7000 --beam=15.0 --lattice-beam=6.0 --acoustic-scale=0.1 --word-symbol-table=exp/tri4b/graph_SRILM/words.txt exp/nnet2_online/nnet_ms_a_online/final.mdl exp/tri4b/graph_SRILM/HCLG.fst ark:data/test_hires/split8/1/spk2utt 'ark,s,cs:extract-segments scp,p:data/test_hires/split8/1/wav.scp data/test_hires/split8/1/segments ark:- |' 'ark:|gzip -c > exp/nnet2_online/nnet_ms_a_online/decode_SRILM/lat.1.gz'

It decodes audio files like this:LOG (online2-wav-nnet2-latgen-faster[5.5.4631-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0030
13-1410-0031 rez de
LOG (online2-wav-nnet2-latgen-faster[5.5.463
1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0031
13-1410-0032 de
LOG (online2-wav-nnet2-latgen-faster[5.5.4631-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0032
13-1410-0033 rit de
LOG (online2-wav-nnet2-latgen-faster[5.5.463
1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0033
13-1410-0034 de
LOG (online2-wav-nnet2-latgen-faster[5.5.463~1-9f3d8]:main():online2-wav-nnet2-latgen-faster.cc:276) Decoded utterance 13-1410-0034
13-1410-0035 de

I am wonderful if you carefully check it.

@ldeseynes
Copy link
Author

Hi,

Here are the parameters I set but I have not used the system for a while. In your yaml file, you should add nnet-mode: 3. Also, check that you're decoding your audio file with the correct sample rate and number of channels.

use-threaded-decoder=true
nnet-mode=3
frame-subsampling-factor=3
acoustic-scale=1.0
model=models/final.mdl
fst=models/HCLG.fst
word-syms=models/words.txt
phone-syms=models/phones.txt
word-boundary-file=models/word_boundary.int
num-nbest=10
num-phone-alignment=3
do-phone-alignment=true
feature-type=mfcc
mfcc-config=models/conf/mfcc.conf
ivector-extraction-config=models/conf/ivector_extractor.conf
max-active=1000
beam=11.0
lattice-beam=5.0
do-endpointing=true
endpoint-silence-phones="1:2:3:4:5:6:7:8:9:10"
chunk-length-in-secs=0.23
phone-determinize=true
determinize-lattice=true
frames-per-chunk=10

@tingyang01
Copy link

Thanks for your kindly reply.
you seems using NNet3 online model. I've trained NNet2 online model.
So I've tried to use nnet-mode=3, but engine was crushed.
let meow your idea for it.
Best,
Ting

@ldeseynes
Copy link
Author

What's your command to start the scripts ?

@tingyang01
Copy link

tingyang01 commented Sep 25, 2019

I am using gstreamer server and client by using https://github.com/alumae/kaldi-gstreamer-server like this:

  • master server
    python kaldigstserver/master_server.py --port=8888

  • 'onlinegmmdecodefaster' based worker
    python kaldigstserver/worker.py -u ws://localhost:8888/worker/ws/speech -c french_stt.yaml

french_stt.yaml as follows:
use-nnet2: True
decoder:
use-threaded-decoder: True
model : test/models/french/librifrench/final.mdl
word-syms : test/models/french/librifrench/words.txt
fst : test/models/french/librifrench/HCLG.fst
mfcc-config : test/models/french/librifrench/conf/mfcc.conf
ivector-extraction-config : test/models/french/librifrench/conf/ivector_extractor.conf
max-active: 1000
beam: 13.0
lattice-beam: 8.0
acoustic-scale: 1.0
frame-subsampling-factor: 3
do-endpointing : true
nnet-mode : 2
endpoint-silence-phones : "1:2:3:4:5"
traceback-period-in-secs: 0.25
chunk-length-in-secs: 0.25
num-nbest: 10
frames-per-chunk : 10
out-dir: tmp

use-vad: False
silence-timeout: 10

post-processor: perl -npe 'BEGIN {use IO::Handle; STDOUT->autoflush(1);} s/(.*)/\1./;'

logging:
version : 1
disable_existing_loggers: False
formatters:
simpleFormater:
format: '%(asctime)s - %(levelname)7s: %(name)10s: %(message)s'
datefmt: '%Y-%m-%d %H:%M:%S'
handlers:
console:
class: logging.StreamHandler
formatter: simpleFormater
level: DEBUG
root:
level: DEBUG
handlers: [console]

  • client
    python kaldigstserver/client.py -r test.wav

I've confirmed test.wav is 16KHz, 16bit, mono.

Let me know your idea for it.
I look forward from you.

@ldeseynes
Copy link
Author

This looks fine to me. Just check the parameters you used for your training (acoustic scale and frame subsampling factor) because I'm not sure about their value in the nnet2 setup. Anyway you'd rather use a later model if you want to get decent results.

@tingyang01
Copy link

Thanks for your reply.
Did you think ever about nnet-latgen-faster and online2-wav-nnet2-latgen-faster of kaldi?
I think there may be some problem in the difference above two decoding method.
And could you tell me more about a later model?

@ldeseynes
Copy link
Author

Just use a chain model, you'll get better results and far more details about the recipe

@tingyang01
Copy link

tingyang01 commented Sep 25, 2019

I've built French STT model by using wsj/s5/local/online/run_nnet2.sh.
Thanks let me try again.

@tingyang01
Copy link

One thing,
You mean NNet3 chain model?
Could you tell me which script shall I use?

@ldeseynes
Copy link
Author

Sure, you can retrain a model using tedlium/s5_r3/run.sh. You don't need the rnnlm stuff after stage 18 for your Gstreamer application

@tingyang01
Copy link

Thank you, will try.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants