This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
forked from kaldi-asr/kaldi
-
Notifications
You must be signed in to change notification settings - Fork 65
/
fst_algo.dox
188 lines (152 loc) · 10.1 KB
/
fst_algo.dox
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
// doc/fst_algo.dox
// Copyright 2009-2011 Microsoft Corporation
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/**
\page fst_algo Finite State Transducer algorithms
Here we describe the FST algorithms in the Kaldi toolkit that are new or
different than the the ones in OpenFst (we use the OpenFst code itself
for many algorithms). These algorithms are
in the directory fstext/, and the corresponding command-line programs,
where they exist, are in fstbin/. This code uses the OpenFst library.
We will only describe here the algorithms that are actually used
in our current recipes.
\section fst_algo_det Determinization
We use a different determinization algorithm from the one in OpenFst. We
distinguish this with the name DeterminizeStar(); the corresponding command-line
program is named fstdeterminizestar. Our determinization algorithm is actually
closer to the standard FST determinization algorithm than the one in OpenFst,
in that it does epsilon removal along with determinization (thus, like
many other FST algorithms, we do not consider epsilon to be a "real symbol".
Our determinization algorithm has a different way of handling what happens
when a transition in the initial determinized output has more than one output
symbol on it. The OpenFst determinization algorithm uses a
function called FactorWeights that moves around the output symbols (encoded
as weights) around while maintaining equivalence, in order to ensure that
no arc has more than one (encoded) output symbol; it does not introduce new states with epsilon
symbols on their input, which is the most "obvious" thing to do. However, the
FactorWeights algorithm can fail for the output of DeterminizeStar, because
there can be cycles with more outputs than states on the cycle (this is not
possible for the output of the normal Determinize algorithm, because it does not do
epsilon removal). Instead, whenever we encounter a link with more than one
output symbols, we create a chain with a sufficient number of intermediate
states to accommodate all the output symbols. The weight and input symbol
go on the first link of this chain. The output of our DeterminizeStar
algorithm is deterministic according to the definitions OpenFst uses,
i.e. treating epsilons as a normal symbols. Its output does have epsilons
on the input side, which is against the normal definition of
determinism, but this is to be viewed as an encoding mechanism for allowing
more than one output symbol on a link, and in any case it only happens
in quite specific circumstances (an epsilon arc is always the only
arc out of a state).
One other difference is that our program fstdeterminizestar does not require
the input FST to have its output symbols encoded as weights.
\subsection fst_algo_det_debug Debugging determinization
In general debugging determinization is quite hard, because when a process
takes a long time it's hard to tell whether ever terminate. The program
fstdeterminizestar has the feature that if you send it a signal with
"kill -SIGUSR1 \<its processid\>", it will stop and print out some
information that can be useful for debugging the determinization.
\subsection fst_algo_det_log Determinization in the log semiring
We supply a function DeterminizeInLog() which casts an Fst in the
normal (tropical) semiring to the log semiring before determinizing, and
then converts back. This is the form of determinization used in our
algorithms, as it preserves stochasticity (see \ref fst_algo_stochastic).
\section fst_algo_eps Removing epsilons
We supply an epsilon-removal algorithm called RemoveEpsLocal() that is
guaranteed to never blow up the FST, but on the other hand is not guaranteed
to remove all epsilons. Essentially it removes whatever epsilons it
can easily remove without making the graph larger. There is no optimality
guaranteed here, as this is a hard problem. The RemoveEpsLocal() function
has slightly different behaviour from OpenFst's RemoveEps() function, because
it will combine two arcs if one has an input epsilon and one an output epsilon.
The function RemoveEpsLocal() preserves FST equivalence.
There is also a function RemoveEpsLocalSpecial() that preserves equivalence
in the tropical semiring while preserving stochasticity in the log semiring
(for more on stochasticity see next section). This appears to be a case
where the usefulness of the semiring formalism breaks down a little bit, as
we have to consider two semirings simultaneously.
\section fst_algo_stochastic Preserving stochasticity and testing it
We define a stochastic FST as an FST in which, in the FST's semiring the sum of
the weights of the arcs out of any given state (plus the final-weight) equals
one (in the semiring). This concept is most useful and natural in the log
semiring; essentially, a stochastic FST is one where the sum of the weights out
of a given arc is one (e.g. a properly normalized HMM would correspond to
a stochastic FST).
The function IsStochasticFst() tests stochasticity. Optionally it can output
minimum and maximum weights, to inform the user of the extent to which the
FST fails to be stochastic. The command-line version of this is fstisstochastic.
We aim for most of the FST algorithms we use to preserve stochasticity, in the
sense that they will produce stochastic outputs given stochastic inputs. For
non-stochastic inputs, we aim that the minimum and maximum range of the weights
will not get larger. By default the program isstochasticfst will test stochasticity
after casting to the log semiring, as this is the most useful case (you can stop this
by supplying the option --test-in-log=false).
In order to preserve stochasticity, the FSTs that we compose with have to have
certain properties. These should apply to L, C and H. Consider L, for example.
We require that for any linear FST corresponding to a path through G, call this
FST F, the product L o F must be stochastic. This basically means that L
have properly normalized pronunciation probabilities. The actual property formally
required may be a little stronger than this (this relates to ensuring that
the probabilities appear at the "right time"). In practice we verify stochasticity
by running the program isstochasticfst for each stage of graph creation.
\section fst_algo_minimization Minimization
We use the minimization algorithm supplied by OpenFst, but we apply a patch
before compiling OpenFst so that minimization can be applied to non-deterministic
FSTs. The reason for this is so that we can remove disambiguation symbols before
minimizing, which is more optimal (it allows minimization to combine more states).
The patch fixes data-structure related issues; fundamentally, OpenFst's minimization
algorithm is applicable to non-deterministic FSTs. We supply a command-line program called
fstminimizeencoded that minimizes FSTs after encoding weights and output symbols
as part of the input symbols (so the FST becomes an acceptor). This is the same
thing the fstminimize program does except that it does not do weight pushing. This
is desirable for us because the way we ensure stochasticity entails avoiding any
weight pushing.
\section fst_algo_composition Composition
For the most part we use OpenFst's own composition algorithms, but we do
make use of a function TableCompose(), and a corresponding command-line program
fsttablecompose, which is a more efficient composition algorithm for certain
common cases. It uses the "Matcher" concept of OpenFst; a Matcher is a kind
of helper class used during composition that performs lookup on the arcs out of
a state to find any arcs with a particular input or output symbol. The normal
matcher that OpenFst uses is SortedMatcher, which relies on arcs being sorted
on the relevant label, and does a binary search. TableMatcher detects
cases where it would be efficient to build a table indexed by label, and for
those states it avoids the overhead of binary search. This leads to a
speedup when composing with things like lexicons that have a very high
out-degree.
\section fst_algo_disambig Adding and removing disambiguation symbols
Our FST recipes (like other transducer-based recipes) rely on
disambiguation symbols. In the normal recipes, these
are added to the input side of the lexicon FST (L) to make it determizable.
We also add disambiguation symbols to G and C (see \ref graph_disambig).
Whenever we do a composition and the FST on the right has disambiguation
symbols on its input, we (in theory) add to each state in the left-hand
FST a self-loop for each of the disambiguation symbols, which has that
symbol on both its input and output. Note that the actual integer symbols
id's for the disambiguation symbols on the left and right may not be the same.
For instance, we have a special symbol \#0 in G (where epsilon would normally
be). The symbol-id for this would generally be the highest-numbered word
plus one. But when we want to pass this symbol through L, we need a symbol
in L's input symbol table (which mainly contains phones), to represent \#0.
We have a function AddSelfLoops() that takes a mutable FST and
two vectors of labels (a label is an integer id for a symbol). The vectors
are the same size, and represent corresponding input and output labels for
the disambiguation symbols. This function adds self-loops to each final
state and each state with non-epsilon output symbols on at least one arc
out of it.
We remove disambiguation symbols with the function DeleteISymbols(), accessible
on the command line with the program fstrmsymbols.
*/
}