This repository has been archived by the owner on Apr 16, 2024. It is now read-only.
forked from kaldi-asr/kaldi
-
Notifications
You must be signed in to change notification settings - Fork 65
/
tree_internals.dox
252 lines (215 loc) · 14.2 KB
/
tree_internals.dox
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
// doc/tree_internals.dox
// Copyright 2009-2011 Microsoft Corporation
// See ../../COPYING for clarification regarding multiple authors
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
// http://www.apache.org/licenses/LICENSE-2.0
// THIS CODE IS PROVIDED *AS IS* BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
// KIND, EITHER EXPRESS OR IMPLIED, INCLUDING WITHOUT LIMITATION ANY IMPLIED
// WARRANTIES OR CONDITIONS OF TITLE, FITNESS FOR A PARTICULAR PURPOSE,
// MERCHANTABLITY OR NON-INFRINGEMENT.
// See the Apache 2 License for the specific language governing permissions and
// limitations under the License.
namespace kaldi {
/**
\page tree_internals Decision tree internals
This page describes the internals of the phonetic decision-tree clustering
code, which are implemented as quite generic structures and algorithms. For a
higher-level view that explains the overall algorithm being implemented and how it
is used by the toolkit, see \ref tree_externals.
\section treei_event_map Event maps
The central concept in the decision-tree building code is the event map,
represented by type EventMap. Don't be misled by the term "event" to think
that this is something happening at a particular time. An event is just a set
of (key,value) pairs, with no key repeated. Conceptually, it could be represented by
the type std::map<int32,int32>. In fact, for efficiency we represent it as a
sorted vector of pairs, via the typedef:
\code
typedef std::vector<std::pair<EventKeyType,EventValueType> > EventType;
\endcode
Here, EventKeyType and EventValueType are just typedefs to int32, but the code is
easier to understand if we use distinct names for them. Think of an Event
(type EventType) as being like a collection of variables, except that the
names and values of the variables are both integers. There is also the type
\ref EventAnswerType, and this is also int32. It is the type returned by the EventMap;
in practice it is a pdf identifier (acoustic-state index).
Functions that require EventType expect you to have sorted it first (e.g. by calling
std::sort(e.begin(),e.end()); )
The purpose of the EventMap class is best illustrated via an example. Suppose our phone-in-context is a/b/c;
suppose we have the standard 3-state topology; and suppose we want to ask what is the index of the pdf
for the central state of phone "b" in this context. The state in question would be
state 1, since we use zero-based indexing.
We are skimming over a few details here when we refer to "state"; see \ref pdf_class for more information.
Suppose the integer indices of phones a, b and c are 10, 11 and 12 respectively.
The "event" would correspond to the mapping:
\verbatim
(0 -> 10), (1 -> 11), (2 -> 12), (-1 -> 1)
\endverbatim
where 0, 1 and 2 are the positions in the 3-phone window "a/b/c", and -1 is a special
index we use to encode the state-id (c.f. the constant \ref kPdfClass = -1). The way this
would be represented as a sorted vector of pairs is:
\code
// caution: not valid C++.
EventType e = { {-1,1}, {0,10}, {1,11}, {2,12} };
\endcode
Suppose we that the acoustic-state index (pdf-id) corresponding to this
acoustic state happens to be 1000. Then if we have an
EventMap "emap" representing the tree, then we would expect the following assert
not to fail:
\code
EventAnswerType ans;
bool ret = emap.Map(e, &ans); // emap is of type EventMap; e is EventType
KALDI_ASSERT(ret == true && ans == 1000);
\endcode
So when we state that an EventMap is a map from EventType to EventAnswerType, you can think of
this roughly as a map from a phone-in-context to an integer index. The phone-in-context
is represented as a set of (key-value) pairs, and in principle we could add new keys and
put in more information. Notice that the EventMap::Map() function returns bool. This is
because it's possible for certain events not to map to any answer at all (e.g. consider
an invalid phone, or imagine what happens when the EventType doesn't contain all the keys
that the EventMap is querying).
The EventMap is a very passive object. It does not have any
capability to learn decision-trees, it is just a means to store decision-trees.
Think of it as a structure that represents a function from EventType to EventAnswerType.
EventMap is a polymorphic, pure-virtual class (i.e. it cannot be instantiated because it
has virtual functions that are not implemented). There are three concrete classes
that implement the EventMap interface:
- ConstantEventMap: Think of this as a decision-tree leaf node.
This class stores an integer of type EventAnswerType and its Map function always returns that value.
- SplitEventMap: Think of this as a decision-tree non-leaf node that queries a certain
key and goes to the "yes" or "no" child node depending on the answer. Its Map
function calls the Map function of the appropriate child node. It stores a set of
integers of type kAnswerType that correspond to the "yes" child (everything else goes to "no").
- TableEventMap: This does a complete split on a particular key. A typical example is:
you might first split completely on the central phone and then have a separate
decision tree for each value of that phone. Internally it stores a vector of
EventMap* pointers. It looks up the value corresponding to the key that it
is splitting on, and calls the Map function of the EventMap at the corresponding
position in the vector.
The EventMap cannot really do very much apart from map EventType to EventAnswerType.
Its interface does not provide functions that allow you to traverse the tree, either
up or down. There is just one function that allows you to modify the EventMap, and this
is the EventMap::Copy() function, declared like this (as a class-member):
\code
virtual EventMap *Copy(const std::vector<EventMap*> &new_leaves) const;
\endcode
This has an effect similar to function composition.
If you call Copy() with an empty vector "new_leaves", it will just return a deep copy
of the entire object, copying all the pointers as it goes down the tree. However,
if new_leaves is nonempty, each time the Copy() function reaches a leaf, if that leaf
\c l is within the range (0, new_leaves.size()-1) and \c new_leaves[\c l] is non-NULL,
the Copy() function will return the result of calling \c new_leaves[\c l]->\c Copy(), rather
than the \c Copy() function of the leaf itself (which would have returned a new ConstantEventMap).
A typical example is where you decide to split a particular leaf, say leaf 852.
You could create an object of type vector<EventMap*> whose only non-NULL member is
as position 852. It would contain a pointer to an object whose real type is
SplitEventMap, and the "yes" and "no" pointers would be to type ConstantEventMap with
leaf values, say, 852 and 1234 (we reuse the old leaf-node id for the new leaf). The real
tree-building code is not this inefficient.
\section treei_stats Statistics for building the tree
The statistics used to build the phonetic decision tree are of the following type:
\code
typedef std::vector<std::pair<EventType, Clusterable*> > BuildTreeStatsType;
\endcode
A reference to an object of this type is passed to all the decision-tree
building routines. These statistics are expected to contain no duplicates with
the same EventType member, i.e. they conceptually represent a map from \ref EventType
to Clusterable. In our current code the Clusterable objects are really of type
GaussClusterable, but the tree-building code doesn't know this. A program that
accumulates these statistics is acc-tree-stats.cc.
\section treei_func Classes and functions involved in tree-building
\subsection treei_func_questions Questions (config class)
The class Questions is a class that in its interaction with the
tree-building code, behaves a little bit like a "configuration" class.
It is really a map from the "key" value (of type \ref EventKeyType),
to a configuration class of type QuestionsForKey.
Class QuestionsForKey
has a set of "questions", each of which is just a set of integers of
type \ref EventValueType; these mostly correspond to sets of
phones, or if the key is -1, they would in the typical case correspond to
sets of HMM-state indices, i.e. subsets of {0,1,2}. QuestionsForKey also contains
a configuration class
of type RefineClustersOptions. This controls tree-building behaviour in that
the tree-building code will attempt to iteratively move values (e.g. phones) between
the two sides of the split in order to maximize likelihood (as in K-means
with K=2). However, this can be turned off by setting the number
of iterations of "refining clusters" to zero, which corresponds to just choosing
from among a fixed list of questions. This seems to work a bit better.
\subsection treei_func_low Lowest-level functions
See \ref tree_group_lower for a complete list;
we summarize some important ones here.
These functions mostly involve manipulations of objects of type BuildTreeStatsType, which as
we saw above is a vector of pairs of (EventType, Clusterable*). The simplest ones
are DeleteBuildTreeStats(), WriteBuildTreeStats() and ReadBuildTreeStats().
The function PossibleValues() finds out which values a particular key takes within
a collection of statistics (and informs the user whether or not the key was always
defined); SplitStatsByKey() will split up an object of type
BuildTreeStatsType into a vector<BuildTreeStatsType> according to the value taken by
a particular key (e.g. splitting on the central phone); SplitStatsByMap() does the
same but the index is not the value for that key, but the answer returned by an EventMap
that is provided to the function.
SumStats() sums up the statistics (i.e. the Clusterable objects) in an object
of BuildTreeStatsType and returns corresponding Clusterable* object.
SumStatsVec() takes an object of
type vector<BuildTreeStatsType> and outputs something of type vector<Clusterable*>,
i.e. it's like SumStats() but for a vector; it's useful in processing the output of
SplitStatsByKey() and SplitStatsByMap().
ObjfGivenMap() evaluates the objective function given some statistics and an EventMap:
it sums up all the statistics within each cluster (corresponding to each distinct answer
of the EventMap::Map() function), adds up the objective functions across the clusters,
and returns the total.
FindAllKeys() will find all the keys that are defined in a collection of statistics, and
depending on the arguments it can find all keys that are defined for all the events, or
all the keys that are defined for any event (i.e. take the intersection or union of sets of
defined keys).
\subsection treei_func_intermediate Intermediate-level functions
The next group of functions involved in tree building are listed \ref
tree_group_intermediate "here" and correspond to various stages of tree building.
We will now only mention some representative ones.
Firstly, we point out that a lot of these functions have a parameter int32
*num_leaves. The integer which this points to acts as a counter to allocate new
leaves. At the start of tree building the caller would set it to zero.
Aach time a new leaf is required, the tree-building code takes the number it
currently points to as the new leaf-id, and then increments it.
One important function is GetStubMap(). This function returns a tree that has
not yet been split, i.e. the pdfs do not depend on the context. The input to this
function controls whether all the phones are distinct or some of them share
decision-tree roots, and whether all the states within a particular phone share
the same decision-tree root.
The function SplitDecisionTree() takes as input an EventMap that would normally
correspond to a non-split "stub" decision tree of the type created by GetStubMap(),
and does the decision-tree splitting until either the maximum number of leaves
has been reached, or the gain from splitting a leaf is less than a specified threshold.
The function ClusterEventMap() takes an EventMap and a threshold, and merges the
leaves of the EventMap as long as the cost of doing so is less than the threshold.
This function would normally be called after SplitDecisionTree().
There are other versions of this function that operate with restrictions (e.g. to
avoid merging leaves from distinct phones).
\subsection treei_func_top Top-level tree-building functions
The top-level tree-building functions are listed \ref tree_group_top "here".
These functions are called directly from the command-line programs.
The most important one is BuildTree(). It is given a Questions configuration class,
and some information about sets of phones whose roots to share and (for each
set of phones) whether to share the \ref pdf_class "pdf-classes" in a single decision
tree root or to have a separate root for each pdf-class. It is also passed information
about the lengths of the phones, plus various thresholds. It builds the tree and
returns the EventMap object which is used to construct the ContextDependency object.
Another important routine in this group is
AutomaticallyObtainQuestions(), which used to obtain questions via automatic clustering of
phones. It clusters the phones into a tree, and for each node in the tree, all leaves
accessible from that node form one question (questions are equivalent to sets of phones).
This (called from cluster-phones.cc) makes our recipes independent of human-generated
phone sets.
The function AccumulateTreeStats() accumulates statistics for training the tree, given
features and an alignment that is a sequence of transition-ids (see \ref transition_model_identifiers).
This function is defined in a different directory from the other tree-building-related functions
(hmm/ not tree/) as it depends on more code (e.g. it knows about the the TransitionModel class),
and we preferred to keep the dependencies of the core tree-building code to a minimum.
*/
/**
\defgroup tree_group Modules involved in tree building
See \page tree_internals for documentation.
*/
}