forked from bayesnet/bnt
-
Notifications
You must be signed in to change notification settings - Fork 0
/
usage.html
3072 lines (2710 loc) · 97.8 KB
/
usage.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<HEAD>
<TITLE>How to use the Bayes Net Toolbox</TITLE>
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<!-- white background is better for the pictures and equations -->
<h1>How to use the Bayes Net Toolbox</h1>
This documentation was last updated on 29 October 2007.
<br>
Click
<a href="http://bnt.insa-rouen.fr/">here</a>
for a French version of this documentation (last updated in 2005).
<p>
<ul>
<li> <a href="install.html">Installation</a>
<li> <a href="#basics">Creating your first Bayes net</a>
<ul>
<li> <a href="#basics">Creating a model by hand</a>
<li> <a href="#file">Loading a model from a file</a>
<li> <a href="#GUI">Creating a model using a GUI</a>
<li> <a href="graphviz.html">Graph visualization</a>
</ul>
<li> <a href="#inference">Inference</a>
<ul>
<li> <a href="#marginal">Computing marginal distributions</a>
<li> <a href="#joint">Computing joint distributions</a>
<li> <a href="#soft">Soft/virtual evidence</a>
<li> <a href="#mpe">Most probable explanation</a>
</ul>
<li> <a href="#cpd">Conditional Probability Distributions</a>
<ul>
<li> <a href="#tabular">Tabular (multinomial) nodes</a>
<li> <a href="#noisyor">Noisy-or nodes</a>
<li> <a href="#deterministic">Other (noisy) deterministic nodes</a>
<li> <a href="#softmax">Softmax (multinomial logit) nodes</a>
<li> <a href="#mlp">Neural network nodes</a>
<li> <a href="#root">Root nodes</a>
<li> <a href="#gaussian">Gaussian nodes</a>
<li> <a href="#glm">Generalized linear model nodes</a>
<li> <a href="#dtree">Classification/regression tree nodes</a>
<li> <a href="#nongauss">Other continuous distributions</a>
<li> <a href="#cpd_summary">Summary of CPD types</a>
</ul>
<li> <a href="#examples">Example models</a>
<ul>
<li> <a
href="http://www.media.mit.edu/wearables/mithril/BNT/mixtureBNT.txt">
Gaussian mixture models</a>
<li> <a href="#pca">PCA, ICA, and all that</a>
<li> <a href="#mixep">Mixtures of experts</a>
<li> <a href="#hme">Hierarchical mixtures of experts</a>
<li> <a href="#qmr">QMR</a>
<li> <a href="#cg_model">Conditional Gaussian models</a>
<li> <a href="#hybrid">Other hybrid models</a>
</ul>
<li> <a href="#param_learning">Parameter learning</a>
<ul>
<li> <a href="#load_data">Loading data from a file</a>
<li> <a href="#mle_complete">Maximum likelihood parameter estimation from complete data</a>
<li> <a href="#prior">Parameter priors</a>
<li> <a href="#bayes_learn">(Sequential) Bayesian parameter updating from complete data</a>
<li> <a href="#em">Maximum likelihood parameter estimation with missing values (EM)</a>
<li> <a href="#tying">Parameter tying</a>
</ul>
<li> <a href="#structure_learning">Structure learning</a>
<ul>
<li> <a href="#enumerate">Exhaustive search</a>
<li> <a href="#K2">K2</a>
<li> <a href="#hill_climb">Hill-climbing</a>
<li> <a href="#mcmc">MCMC</a>
<li> <a href="#active">Active learning</a>
<li> <a href="#struct_em">Structural EM</a>
<li> <a href="#graphdraw">Visualizing the learned graph structure</a>
<li> <a href="#constraint">Constraint-based methods</a>
</ul>
<li> <a href="#engines">Inference engines</a>
<ul>
<li> <a href="#jtree">Junction tree</a>
<li> <a href="#varelim">Variable elimination</a>
<li> <a href="#global">Global inference methods</a>
<li> <a href="#quickscore">Quickscore</a>
<li> <a href="#belprop">Belief propagation</a>
<li> <a href="#sampling">Sampling (Monte Carlo)</a>
<li> <a href="#engine_summary">Summary of inference engines</a>
</ul>
<li> <a href="#influence">Influence diagrams/ decision making</a>
<li> <a href="usage_dbn.html">DBNs, HMMs, Kalman filters and all that</a>
</ul>
</ul>
<h1><a name="basics">Creating your first Bayes net</h1>
To define a Bayes net, you must specify the graph structure and then
the parameters. We look at each in turn, using a simple example
(adapted from Russell and
Norvig, "Artificial Intelligence: a Modern Approach", Prentice Hall,
1995, p454).
<h2>Graph structure</h2>
Consider the following network.
<p>
<center>
<IMG SRC="Figures/sprinkler.gif">
</center>
<p>
<P>
To specify this directed acyclic graph (dag), we create an adjacency matrix:
<PRE>
N = 4;
dag = zeros(N,N);
C = 1; S = 2; R = 3; W = 4;
dag(C,[R S]) = 1;
dag(R,W) = 1;
dag(S,W)=1;
</PRE>
<P>
We have numbered the nodes as follows:
Cloudy = 1, Sprinkler = 2, Rain = 3, WetGrass = 4.
<b>The nodes must always be numbered in topological order, i.e.,
ancestors before descendants.</b>
For a more complicated graph, this is a little inconvenient: we will
see how to get around this <a href="usage_dbn.html#bat">below</a>.
<p>
In Matlab 6, you can use logical arrays instead of double arrays,
which are 4 times smaller:
<pre>
dag = false(N,N);
dag(C,[R S]) = true;
...
</pre>
However, <b>some graph functions (eg acyclic) do not work on
logical arrays</b>!
<p>
You can visualize the resulting graph structure using
the methods discussed <a href="#graphdraw">below</a>.
For details on GUIs,
click <a href="#GUI">here</a>.
<h2>Creating the Bayes net shell</h2>
In addition to specifying the graph structure,
we must specify the size and type of each node.
If a node is discrete, its size is the
number of possible values
each node can take on; if a node is continuous,
it can be a vector, and its size is the length of this vector.
In this case, we will assume all nodes are discrete and binary.
<PRE>
discrete_nodes = 1:N;
node_sizes = 2*ones(1,N);
</pre>
If the nodes were not binary, you could type e.g.,
<pre>
node_sizes = [4 2 3 5];
</pre>
meaning that Cloudy has 4 possible values,
Sprinkler has 2 possible values, etc.
Note that these are cardinal values, not ordinal, i.e.,
they are not ordered in any way, like 'low', 'medium', 'high'.
<p>
We are now ready to make the Bayes net:
<pre>
bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes);
</PRE>
By default, all nodes are assumed to be discrete, so we can also just
write
<pre>
bnet = mk_bnet(dag, node_sizes);
</PRE>
You may also specify which nodes will be observed.
If you don't know, or if this not fixed in advance,
just use the empty list (the default).
<pre>
onodes = [];
bnet = mk_bnet(dag, node_sizes, 'discrete', discrete_nodes, 'observed', onodes);
</PRE>
Note that optional arguments are specified using a name/value syntax.
This is common for many BNT functions.
In general, to find out more about a function (e.g., which optional
arguments it takes), please see its
documentation string by typing
<pre>
help mk_bnet
</pre>
See also other <a href="matlab_tips.html">useful Matlab tips</a>.
<p>
It is possible to associate names with nodes, as follows:
<pre>
bnet = mk_bnet(dag, node_sizes, 'names', {'cloudy','S','R','W'}, 'discrete', 1:4);
</pre>
You can then refer to a node by its name:
<pre>
C = bnet.names('cloudy'); % bnet.names is an associative array
bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
</pre>
This feature uses my own associative array class.
<h2><a name="cpt">Parameters</h2>
A model consists of the graph structure and the parameters.
The parameters are represented by CPD objects (CPD = Conditional
Probability Distribution), which define the probability distribution
of a node given its parents.
(We will use the terms "node" and "random variable" interchangeably.)
The simplest kind of CPD is a table (multi-dimensional array), which
is suitable when all the nodes are discrete-valued. Note that the discrete
values are not assumed to be ordered in any way; that is, they
represent categorical quantities, like male and female, rather than
ordinal quantities, like low, medium and high.
(We will discuss CPDs in more detail <a href="#cpd">below</a>.)
<p>
Tabular CPDs, also called CPTs (conditional probability tables),
are stored as multidimensional arrays, where the dimensions
are arranged in the same order as the nodes, e.g., the CPT for node 4
(WetGrass) is indexed by Sprinkler (2), Rain (3) and then WetGrass (4) itself.
Hence the child is always the last dimension.
If a node has no parents, its CPT is a column vector representing its
prior.
Note that in Matlab (unlike C), arrays are indexed
from 1, and are layed out in memory such that the first index toggles
fastest, e.g., the CPT for node 4 (WetGrass) is as follows
<P>
<P><IMG ALIGN=BOTTOM SRC="Figures/CPTgrass.gif"><P>
<P>
where we have used the convention that false==1, true==2.
We can create this CPT in Matlab as follows
<PRE>
CPT = zeros(2,2,2);
CPT(1,1,1) = 1.0;
CPT(2,1,1) = 0.1;
...
</PRE>
Here is an easier way:
<PRE>
CPT = reshape([1 0.1 0.1 0.01 0 0.9 0.9 0.99], [2 2 2]);
</PRE>
In fact, we don't need to reshape the array, since the CPD constructor
will do that for us. So we can just write
<pre>
bnet.CPD{W} = tabular_CPD(bnet, W, 'CPT', [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
</pre>
The other nodes are created similarly (using the old syntax for
optional parameters)
<PRE>
bnet.CPD{C} = tabular_CPD(bnet, C, [0.5 0.5]);
bnet.CPD{R} = tabular_CPD(bnet, R, [0.8 0.2 0.2 0.8]);
bnet.CPD{S} = tabular_CPD(bnet, S, [0.5 0.9 0.5 0.1]);
bnet.CPD{W} = tabular_CPD(bnet, W, [1 0.1 0.1 0.01 0 0.9 0.9 0.99]);
</PRE>
<h2><a name="rnd_cpt">Random Parameters</h2>
If we do not specify the CPT, random parameters will be
created, i.e., each "row" of the CPT will be drawn from the uniform distribution.
To ensure repeatable results, use
<pre>
rand('state', seed);
randn('state', seed);
</pre>
To control the degree of randomness (entropy),
you can sample each row of the CPT from a Dirichlet(p,p,...) distribution.
If p << 1, this encourages "deterministic" CPTs (one entry near 1, the rest near 0).
If p = 1, each entry is drawn from U[0,1].
If p >> 1, the entries will all be near 1/k, where k is the arity of
this node, i.e., each row will be nearly uniform.
You can do this as follows, assuming this node
is number i, and ns is the node_sizes.
<pre>
k = ns(i);
ps = parents(dag, i);
psz = prod(ns(ps));
CPT = sample_dirichlet(p*ones(1,k), psz);
bnet.CPD{i} = tabular_CPD(bnet, i, 'CPT', CPT);
</pre>
<h2><a name="file">Loading a network from a file</h2>
If you already have a Bayes net represented in the XML-based
<a href="http://www.cs.cmu.edu/afs/cs/user/fgcozman/www/Research/InterchangeFormat/">
Bayes Net Interchange Format (BNIF)</a> (e.g., downloaded from the
<a
href="http://www.cs.huji.ac.il/labs/compbio/Repository">
Bayes Net repository</a>),
you can convert it to BNT format using
the
<a href="http://www.digitas.harvard.edu/~ken/bif2bnt/">BIF-BNT Java
program</a> written by Ken Shan.
(This is not necessarily up-to-date.)
<p>
<b>It is currently not possible to save/load a BNT matlab object to
file</b>, but this is easily fixed if you modify all the constructors
for all the classes (see matlab documentation).
<h2><a name="GUI">Creating a model using a GUI</h2>
<ul>
<li>Senthil Nachimuthu
has started (Oct 07) an open source
GUI for BNT called
<a href="http://projeny.sourceforge.net">projeny</a>
using Java. This is a successor to BNJ.
<li>
Philippe LeRay has written (Sep 05)
a
<a href="http://banquiseasi.insa-rouen.fr/projects/bnt-editor/">
BNT GUI</a> in matlab.
<li>
<a
href="http://www.dataonstage.com/BNT/PACKAGES/LinkStrength/index.html">
LinkStrength</a>,
a package by Imme Ebert-Uphoff for visualizing the strength of
dependencies between nodes.
</ul>
<h2>Graph visualization</h2>
Click <a href="graphviz.html">here</a> for more information
on graph visualization.
<h1><a name="inference">Inference</h1>
Having created the BN, we can now use it for inference.
There are many different algorithms for doing inference in Bayes nets,
that make different tradeoffs between speed,
complexity, generality, and accuracy.
BNT therefore offers a variety of different inference
"engines". We will discuss these
in more detail <a href="#engines">below</a>.
For now, we will use the junction tree
engine, which is the mother of all exact inference algorithms.
This can be created as follows.
<pre>
engine = jtree_inf_engine(bnet);
</pre>
The other engines have similar constructors, but might take
additional, algorithm-specific parameters.
All engines are used in the same way, once they have been created.
We illustrate this in the following sections.
<h2><a name="marginal">Computing marginal distributions</h2>
Suppose we want to compute the probability that the sprinker was on
given that the grass is wet.
The evidence consists of the fact that W=2. All the other nodes
are hidden (unobserved). We can specify this as follows.
<pre>
evidence = cell(1,N);
evidence{W} = 2;
</pre>
We use a 1D cell array instead of a vector to
cope with the fact that nodes can be vectors of different lengths.
In addition, the value [] can be used
to denote 'no evidence', instead of having to specify the observation
pattern as a separate argument.
(Click <a href="cellarray.html">here</a> for a quick tutorial on cell
arrays in matlab.)
<p>
We are now ready to add the evidence to the engine.
<pre>
[engine, loglik] = enter_evidence(engine, evidence);
</pre>
The behavior of this function is algorithm-specific, and is discussed
in more detail <a href="#engines">below</a>.
In the case of the jtree engine,
enter_evidence implements a two-pass message-passing scheme.
The first return argument contains the modified engine, which
incorporates the evidence. The second return argument contains the
log-likelihood of the evidence. (Not all engines are capable of
computing the log-likelihood.)
<p>
Finally, we can compute p=P(S=2|W=2) as follows.
<PRE>
marg = marginal_nodes(engine, S);
marg.T
ans =
0.57024
0.42976
p = marg.T(2);
</PRE>
We see that p = 0.4298.
<p>
Now let us add the evidence that it was raining, and see what
difference it makes.
<PRE>
evidence{R} = 2;
[engine, loglik] = enter_evidence(engine, evidence);
marg = marginal_nodes(engine, S);
p = marg.T(2);
</PRE>
We find that p = P(S=2|W=2,R=2) = 0.1945,
which is lower than
before, because the rain can ``explain away'' the
fact that the grass is wet.
<p>
You can plot a marginal distribution over a discrete variable
as a barchart using the built 'bar' function:
<pre>
bar(marg.T)
</pre>
This is what it looks like
<p>
<center>
<IMG SRC="Figures/sprinkler_bar.gif">
</center>
<p>
<h2><a name="observed">Observed nodes</h2>
What happens if we ask for the marginal on an observed node, e.g. P(W|W=2)?
An observed discrete node effectively only has 1 value (the observed
one) --- all other values would result in 0 probability.
For efficiency, BNT treats observed (discrete) nodes as if they were
set to 1, as we see below:
<pre>
evidence = cell(1,N);
evidence{W} = 2;
engine = enter_evidence(engine, evidence);
m = marginal_nodes(engine, W);
m.T
ans =
1
</pre>
This can get a little confusing, since we assigned W=2.
So we can ask BNT to add the evidence back in by passing in an optional argument:
<pre>
m = marginal_nodes(engine, W, 1);
m.T
ans =
0
1
</pre>
This shows that P(W=1|W=2) = 0 and P(W=2|W=2) = 1.
<h2><a name="joint">Computing joint distributions</h2>
We can compute the joint probability on a set of nodes as in the
following example.
<pre>
evidence = cell(1,N);
[engine, ll] = enter_evidence(engine, evidence);
m = marginal_nodes(engine, [S R W]);
</pre>
m is a structure. The 'T' field is a multi-dimensional array (in
this case, 3-dimensional) that contains the joint probability
distribution on the specified nodes.
<pre>
>> m.T
ans(:,:,1) =
0.2900 0.0410
0.0210 0.0009
ans(:,:,2) =
0 0.3690
0.1890 0.0891
</pre>
We see that P(S=1,R=1,W=2) = 0, since it is impossible for the grass
to be wet if both the rain and sprinkler are off.
<p>
Let us now add some evidence to R.
<pre>
evidence{R} = 2;
[engine, ll] = enter_evidence(engine, evidence);
m = marginal_nodes(engine, [S R W])
m =
domain: [2 3 4]
T: [2x1x2 double]
>> m.T
m.T
ans(:,:,1) =
0.0820
0.0018
ans(:,:,2) =
0.7380
0.1782
</pre>
The joint T(i,j,k) = P(S=i,R=j,W=k|evidence)
should have T(i,1,k) = 0 for all i,k, since R=1 is incompatible
with the evidence that R=2.
Instead of creating large tables with many 0s, BNT sets the effective
size of observed (discrete) nodes to 1, as explained above.
This is why m.T has size 2x1x2.
To get a 2x2x2 table, type
<pre>
m = marginal_nodes(engine, [S R W], 1)
m =
domain: [2 3 4]
T: [2x2x2 double]
>> m.T
m.T
ans(:,:,1) =
0 0.082
0 0.0018
ans(:,:,2) =
0 0.738
0 0.1782
</pre>
<p>
Note: It is not always possible to compute the joint on arbitrary
sets of nodes: it depends on which inference engine you use, as discussed
in more detail <a href="#engines">below</a>.
<h2><a name="soft">Soft/virtual evidence</h2>
Sometimes a node is not observed, but we have some distribution over
its possible values; this is often called "soft" or "virtual"
evidence.
One can use this as follows
<pre>
[engine, loglik] = enter_evidence(engine, evidence, 'soft', soft_evidence);
</pre>
where soft_evidence{i} is either [] (if node i has no soft evidence)
or is a vector representing the probability distribution over i's
possible values.
For example, if we don't know i's exact value, but we know its
likelihood ratio is 60/40, we can write evidence{i} = [] and
soft_evidence{i} = [0.6 0.4].
<p>
Currently only jtree_inf_engine supports this option.
It assumes that all hidden nodes, and all nodes for
which we have soft evidence, are discrete.
For a longer example, see BNT/examples/static/softev1.m.
<h2><a name="mpe">Most probable explanation</h2>
To compute the most probable explanation (MPE) of the evidence (i.e.,
the most probable assignment, or a mode of the joint), use
<pre>
[mpe, ll] = calc_mpe(engine, evidence);
</pre>
mpe{i} is the most likely value of node i.
This calls enter_evidence with the 'maximize' flag set to 1, which
causes the engine to do max-product instead of sum-product.
The resulting max-marginals are then thresholded.
If there is more than one maximum probability assignment, we must take
care to break ties in a consistent manner (thresholding the
max-marginals may give the wrong result). To force this behavior,
type
<pre>
[mpe, ll] = calc_mpe(engine, evidence, 1);
</pre>
Note that computing the MPE is someties called abductive reasoning.
<p>
You can also use <tt>calc_mpe_bucket</tt> written by Ron Zohar,
that does a forwards max-product pass, and then a backwards traceback
pass, which is how Viterbi is traditionally implemented.
<h1><a name="cpd">Conditional Probability Distributions</h1>
A Conditional Probability Distributions (CPD)
defines P(X(i) | X(Pa(i))), where X(i) is the i'th node, and X(Pa(i))
are the parents of node i. There are many ways to represent this
distribution, which depend in part on whether X(i) and X(Pa(i)) are
discrete, continuous, or a combination.
We will discuss various representations below.
<h2><a name="tabular">Tabular nodes</h2>
If the CPD is represented as a table (i.e., if it is a multinomial
distribution), it has a number of parameters that is exponential in
the number of parents. See the example <a href="#cpt">above</a>.
<h2><a name="noisyor">Noisy-or nodes</h2>
A noisy-OR node is like a regular logical OR gate except that
sometimes the effects of parents that are on get inhibited.
Let the prob. that parent i gets inhibited be q(i).
Then a node, C, with 2 parents, A and B, has the following CPD, where
we use F and T to represent off and on (1 and 2 in BNT).
<pre>
A B P(C=off) P(C=on)
---------------------------
F F 1.0 0.0
T F q(A) 1-q(A)
F T q(B) 1-q(B)
T T q(A)q(B) 1-q(A)q(B)
</pre>
Thus we see that the causes get inhibited independently.
It is common to associate a "leak" node with a noisy-or CPD, which is
like a parent that is always on. This can account for all other unmodelled
causes which might turn the node on.
<p>
The noisy-or distribution is similar to the logistic distribution.
To see this, let the nodes, S(i), have values in {0,1}, and let q(i,j)
be the prob. that j inhibits i. Then
<pre>
Pr(S(i)=1 | parents(S(i))) = 1 - prod_{j} q(i,j)^S(j)
</pre>
Now define w(i,j) = -ln q(i,j) and rho(x) = 1-exp(-x). Then
<pre>
Pr(S(i)=1 | parents(S(i))) = rho(sum_j w(i,j) S(j))
</pre>
For a sigmoid node, we have
<pre>
Pr(S(i)=1 | parents(S(i))) = sigma(-sum_j w(i,j) S(j))
</pre>
where sigma(x) = 1/(1+exp(-x)). Hence they differ in the choice of
the activation function (although both are monotonically increasing).
In addition, in the case of a noisy-or, the weights are constrained to be
positive, since they derive from probabilities q(i,j).
In both cases, the number of parameters is <em>linear</em> in the
number of parents, unlike the case of a multinomial distribution,
where the number of parameters is exponential in the number of parents.
We will see an example of noisy-OR nodes <a href="#qmr">below</a>.
<h2><a name="deterministic">Other (noisy) deterministic nodes</h2>
Deterministic CPDs for discrete random variables can be created using
the deterministic_CPD class. It is also possible to 'flip' the output
of the function with some probability, to simulate noise.
The boolean_CPD class is just a special case of a
deterministic CPD, where the parents and child are all binary.
<p>
Both of these classes are just "syntactic sugar" for the tabular_CPD
class.
<h2><a name="softmax">Softmax nodes</h2>
If we have a discrete node with a continuous parent,
we can define its CPD using a softmax function
(also known as the multinomial logit function).
This acts like a soft thresholding operator, and is defined as follows:
<pre>
exp(w(:,i)'*x + b(i))
Pr(Q=i | X=x) = -----------------------------
sum_j exp(w(:,j)'*x + b(j))
</pre>
The parameters of a softmax node, w(:,i) and b(i), i=1..|Q|, have the
following interpretation: w(:,i)-w(:,j) is the normal vector to the
decision boundary between classes i and j,
and b(i)-b(j) is its offset (bias). For example, suppose
X is a 2-vector, and Q is binary. Then
<pre>
w = [1 -1;
0 0];
b = [0 0];
</pre>
means class 1 are points in the 2D plane with positive x coordinate,
and class 2 are points in the 2D plane with negative x coordinate.
If w has large magnitude, the decision boundary is sharp, otherwise it
is soft.
In the special case that Q is binary (0/1), the softmax function reduces to the logistic
(sigmoid) function.
<p>
Fitting a softmax function can be done using the iteratively reweighted
least squares (IRLS) algorithm.
We use the implementation from
<a href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
Note that since
the softmax distribution is not in the exponential family, it does not
have finite sufficient statistics, and hence we must store all the
training data in uncompressed form.
If this takes too much space, one should use online (stochastic) gradient
descent (not implemented in BNT).
<p>
If a softmax node also has discrete parents,
we use a different set of w/b parameters for each combination of
parent values, as in the <a href="#gaussian">conditional linear
Gaussian CPD</a>.
This feature was implemented by Pierpaolo Brutti.
He is currently extending it so that discrete parents can be treated
as if they were continuous, by adding indicator variables to the X
vector.
<p>
We will see an example of softmax nodes <a href="#mixexp">below</a>.
<h2><a name="mlp">Neural network nodes</h2>
Pierpaolo Brutti has implemented the mlp_CPD class, which uses a multi layer perceptron
to implement a mapping from continuous parents to discrete children,
similar to the softmax function.
(If there are also discrete parents, it creates a mixture of MLPs.)
It uses code from <a
href="http://www.ncrg.aston.ac.uk/netlab/">Netlab</a>.
This is work in progress.
<h2><a name="root">Root nodes</h2>
A root node has no parents and no parameters; it can be used to model
an observed, exogeneous input variable, i.e., one which is "outside"
the model.
This is useful for conditional density models.
We will see an example of root nodes <a href="#mixexp">below</a>.
<h2><a name="gaussian">Gaussian nodes</h2>
We now consider a distribution suitable for the continuous-valued nodes.
Suppose the node is called Y, its continuous parents (if any) are
called X, and its discrete parents (if any) are called Q.
The distribution on Y is defined as follows:
<pre>
- no parents: Y ~ N(mu, Sigma)
- cts parents : Y|X=x ~ N(mu + W x, Sigma)
- discrete parents: Y|Q=i ~ N(mu(:,i), Sigma(:,:,i))
- cts and discrete parents: Y|X=x,Q=i ~ N(mu(:,i) + W(:,:,i) * x, Sigma(:,:,i))
</pre>
where N(mu, Sigma) denotes a Normal distribution with mean mu and
covariance Sigma. Let |X|, |Y| and |Q| denote the sizes of X, Y and Q
respectively.
If there are no discrete parents, |Q|=1; if there is
more than one, then |Q| = a vector of the sizes of each discrete parent.
If there are no continuous parents, |X|=0; if there is more than one,
then |X| = the sum of their sizes.
Then mu is a |Y|*|Q| vector, Sigma is a |Y|*|Y|*|Q| positive
semi-definite matrix, and W is a |Y|*|X|*|Q| regression (weight)
matrix.
<p>
We can create a Gaussian node with random parameters as follows.
<pre>
bnet.CPD{i} = gaussian_CPD(bnet, i);
</pre>
We can specify the value of one or more of the parameters as in the
following example, in which |Y|=2, and |Q|=1.
<pre>
bnet.CPD{i} = gaussian_CPD(bnet, i, 'mean', [0; 0], 'weights', randn(Y,X), 'cov', eye(Y));
</pre>
<p>
We will see an example of conditional linear Gaussian nodes <a
href="#cg_model">below</a>.
<p>
<b>When learning Gaussians from data</b>, it is helpful to ensure the
data has a small magnitde
(see e.g., KPMstats/standardize) to prevent numerical problems.
Unless you have a lot of data, it is also a very good idea to use
diagonal instead of full covariance matrices.
(BNT does not currently support spherical covariances, although it
would be easy to add, since KPMstats/clg_Mstep supports this option;
you would just need to modify gaussian_CPD/update_ess to accumulate
weighted inner products.)
<h2><a name="nongauss">Other continuous distributions</h2>
Currently BNT does not support any CPDs for continuous nodes other
than the Gaussian.
However, you can use a mixture of Gaussians to
approximate other continuous distributions. We will see some an example
of this with the IFA model <a href="#pca">below</a>.
<h2><a name="glm">Generalized linear model nodes</h2>
In the future, we may incorporate some of the functionality of
<a href =
"http://www.sci.usq.edu.au/staff/dunn/glmlab/glmlab.html">glmlab</a>
into BNT.
<h2><a name="dtree">Classification/regression tree nodes</h2>
We plan to add classification and regression trees to define CPDs for
discrete and continuous nodes, respectively.
Trees have many advantages: they are easy to interpret, they can do
feature selection, they can
handle discrete and continuous inputs, they do not make strong
assumptions about the form of the distribution, the number of
parameters can grow in a data-dependent way (i.e., they are
semi-parametric), they can handle missing data, etc.
However, they are not yet implemented.
<!--
Yimin Zhang is currently (Feb '02) implementing this.
-->
<h2><a name="cpd_summary">Summary of CPD types</h2>
We list all the different types of CPDs supported by BNT.
For each CPD, we specify if the child and parents can be discrete (D) or
continuous (C) (Binary (B) nodes are a special case).
We also specify which methods each class supports.
If a method is inherited, the name of the parent class is mentioned.
If a parent class calls a child method, this is mentioned.
<p>
The <tt>CPD_to_CPT</tt> method converts a CPD to a table; this
requires that the child and all parents are discrete.
The CPT might be exponentially big...
<tt>convert_to_table</tt> evaluates a CPD with evidence, and
represents the the resulting potential as an array.
This requires that the child is discrete, and any continuous parents
are observed.
<tt>convert_to_pot</tt> evaluates a CPD with evidence, and
represents the resulting potential as a dpot, gpot, cgpot or upot, as
requested. (d=discrete, g=Gaussian, cg = conditional Gaussian, u =
utility).
<p>
When we sample a node, all the parents are observed.
When we compute the (log) probability of a node, all the parents and
the child are observed.
<p>
We also specify if the parameters are learnable.
For learning with EM, we require
the methods <tt>reset_ess</tt>, <tt>update_ess</tt> and
<tt>maximize_params</tt>.
For learning from fully observed data, we require
the method <tt>learn_params</tt>.
By default, all classes inherit this from generic_CPD, which simply
calls <tt>update_ess</tt> N times, once for each data case, followed
by <tt>maximize_params</tt>, i.e., it is like EM, without the E step.
Some classes implement a batch formula, which is quicker.
<p>
Bayesian learning means computing a posterior over the parameters
given fully observed data.
<p>
Pearl means we implement the methods <tt>compute_pi</tt> and
<tt>compute_lambda_msg</tt>, used by
<tt>pearl_inf_engine</tt>, which runs on directed graphs.
<tt>belprop_inf_engine</tt> only needs <tt>convert_to_pot</tt>.H
The pearl methods can exploit special properties of the CPDs for
computing the messages efficiently, whereas belprop does not.
<p>
The only method implemented by generic_CPD is <tt>adjustable_CPD</tt>,
which is not shown, since it is not very interesting.
<p>
<table>
<table border units = pixels><tr>
<td align=center>Name
<td align=center>Child
<td align=center>Parents
<td align=center>Comments
<td align=center>CPD_to_CPT
<td align=center>conv_to_table
<td align=center>conv_to_pot
<td align=center>sample
<td align=center>prob
<td align=center>learn
<td align=center>Bayes
<td align=center>Pearl
<tr>
<!-- Name--><td>
<!-- Child--><td>
<!-- Parents--><td>
<!-- Comments--><td>
<!-- CPD_to_CPT--><td>
<!-- conv_to_table--><td>
<!-- conv_to_pot--><td>
<!-- sample--><td>
<!-- prob--><td>
<!-- learn--><td>
<!-- Bayes--><td>
<!-- Pearl--><td>
<tr>
<!-- Name--><td>boolean
<!-- Child--><td>B
<!-- Parents--><td>B
<!-- Comments--><td>Syntactic sugar for tabular
<!-- CPD_to_CPT--><td>-
<!-- conv_to_table--><td>-
<!-- conv_to_pot--><td>-
<!-- sample--><td>-
<!-- prob--><td>-
<!-- learn--><td>-
<!-- Bayes--><td>-
<!-- Pearl--><td>-
<tr>
<!-- Name--><td>deterministic
<!-- Child--><td>D
<!-- Parents--><td>D
<!-- Comments--><td>Syntactic sugar for tabular
<!-- CPD_to_CPT--><td>-
<!-- conv_to_table--><td>-
<!-- conv_to_pot--><td>-
<!-- sample--><td>-
<!-- prob--><td>-
<!-- learn--><td>-
<!-- Bayes--><td>-
<!-- Pearl--><td>-
<tr>
<!-- Name--><td>Discrete
<!-- Child--><td>D
<!-- Parents--><td>C/D
<!-- Comments--><td>Virtual class
<!-- CPD_to_CPT--><td>N
<!-- conv_to_table--><td>Calls CPD_to_CPT
<!-- conv_to_pot--><td>Calls conv_to_table
<!-- sample--><td>Calls conv_to_table
<!-- prob--><td>Calls conv_to_table
<!-- learn--><td>N
<!-- Bayes--><td>N
<!-- Pearl--><td>N
<tr>
<!-- Name--><td>Gaussian
<!-- Child--><td>C
<!-- Parents--><td>C/D
<!-- Comments--><td>-
<!-- CPD_to_CPT--><td>N
<!-- conv_to_table--><td>N
<!-- conv_to_pot--><td>Y
<!-- sample--><td>Y
<!-- prob--><td>Y
<!-- learn--><td>Y
<!-- Bayes--><td>N
<!-- Pearl--><td>N
<tr>
<!-- Name--><td>gmux
<!-- Child--><td>C
<!-- Parents--><td>C/D
<!-- Comments--><td>multiplexer
<!-- CPD_to_CPT--><td>N
<!-- conv_to_table--><td>N
<!-- conv_to_pot--><td>Y
<!-- sample--><td>N
<!-- prob--><td>N
<!-- learn--><td>N
<!-- Bayes--><td>N
<!-- Pearl--><td>Y
<tr>
<!-- Name--><td>MLP
<!-- Child--><td>D
<!-- Parents--><td>C/D
<!-- Comments--><td>multi layer perceptron
<!-- CPD_to_CPT--><td>N
<!-- conv_to_table--><td>Y
<!-- conv_to_pot--><td>Inherits from discrete
<!-- sample--><td>Inherits from discrete
<!-- prob--><td>Inherits from discrete
<!-- learn--><td>Y
<!-- Bayes--><td>N
<!-- Pearl--><td>N
<tr>
<!-- Name--><td>noisy-or
<!-- Child--><td>B
<!-- Parents--><td>B
<!-- Comments--><td>-
<!-- CPD_to_CPT--><td>Y
<!-- conv_to_table--><td>Inherits from discrete
<!-- conv_to_pot--><td>Inherits from discrete
<!-- sample--><td>Inherits from discrete
<!-- prob--><td>Inherits from discrete
<!-- learn--><td>N
<!-- Bayes--><td>N
<!-- Pearl--><td>Y
<tr>
<!-- Name--><td>root
<!-- Child--><td>C/D