forked from AllenDowney/ThinkBayes2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbook.tex
10089 lines (7702 loc) · 330 KB
/
book.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% LaTeX source for ``Think Bayes, 2nd edition''
% Copyright 2018 Allen B. Downey.
% License: Creative Commons Attribution-NonCommercial 4.0 Unported License.
% http://creativecommons.org/licenses/by-nc/4.0/
%
\documentclass[12pt]{book}
\usepackage[width=5.5in,height=8.5in,
hmarginratio=3:2,vmarginratio=1:1]{geometry}
% for some of these packages, you might have to install
% texlive-latex-extra (in Ubuntu)
\usepackage[T1]{fontenc}
\usepackage{url}
\usepackage{graphicx}
\usepackage{exercise}
\usepackage{makeidx}
\usepackage{upquote}
\usepackage{fancyhdr}
\usepackage{amsmath}
\usepackage[bookmarks]{hyperref}
\newenvironment{exercise}{\begin{Exercise}}{\end{Exercise}}
\input{latexonly}
\newcommand{\PMF}{\mathrm{PMF}}
\newcommand{\PDF}{\mathrm{PDF}}
\newcommand{\CDF}{\mathrm{CDF}}
\newcommand{\ICDF}{\mathrm{ICDF}}
\newcommand{\p}[1]{\ensuremath{P(#1)}}
\newcommand{\odds}[1]{\ensuremath{\mathrm{o}(#1)}}
\newcommand{\T}[1]{\mbox{#1}}
\newcommand{\AND}{~\mathrm{and}~}
\newcommand{\NOT}{\mathrm{not}~}
\newcommand{\given}{~|~}
\title{Think Bayes}
\author{Allen B. Downey}
\newcommand{\thetitle}{Think Bayes: Bayesian Statistics Made Simple}
\newcommand{\theversion}{2.0.0}
\makeindex
\begin{document}
\maketitle
\frontmatter
\pagebreak
\thispagestyle{empty}
Copyright \copyright ~2018 Allen B. Downey.
\vspace{0.2in}
\begin{flushleft}
Green Tea Press \\
9 Washburn Ave \\
Needham MA 02492
\end{flushleft}
Permission is granted to copy, distribute, and/or modify this document
under the terms of the Creative Commons Attribution-NonCommercial 3.0 Unported
License, which is available at \url{http://creativecommons.org/licenses/by-nc/3.0/}.
\vspace{0.2in}
\chapter{Preface}
\label{preface}
\section{My theory, which is mine}
The premise of this book, and the other books in the {\it Think X}
series, is that if you know how to program, you
can use that skill to learn other topics.
Most books on Bayesian statistics use mathematical notation and
present ideas in terms of mathematical concepts like calculus.
This book uses Python code instead of math, and discrete approximations
instead of continuous mathematics. As a result, what would
be an integral in a math book becomes a summation, and
most operations on probability distributions are simple loops.
I think this presentation is easier to understand, at least for people with
programming skills. It is also more general, because when we make
modeling decisions, we can choose the most appropriate model without
worrying too much about whether the model lends itself to conventional
analysis.
Also, it provides a smooth development path from simple examples to
real-world problems. Chapter~\ref{estimation} is a good example. It
starts with a simple example involving dice, one of the staples of
basic probability. From there it proceeds in small steps to the
locomotive problem, which I borrowed from Mosteller's {\it
Fifty Challenging Problems in Probability with Solutions}, and from
there to the German tank problem, a famously successful application of
Bayesian methods during World War II.
\section{Modeling and approximation}
Most chapters in this book are motivated by a real-world problem, so
they involve some degree of modeling. Before we can apply Bayesian
methods (or any other analysis), we have to make decisions about which
parts of the real-world system to include in the model and which
details we can abstract away. \index{modeling}
For example, in Chapter~\ref{prediction}, the motivating problem is to
predict the winner of a hockey game. I model goal-scoring as a
Poisson process, which implies that a goal is equally likely at any
point in the game. That is not exactly true, but it is probably a
good enough model for most purposes.
\index{Poisson process}
In Chapter~\ref{evidence} the motivating problem is interpreting SAT
scores (the SAT is a standardized test used for college admissions in
the United States). I start with a simple model that assumes that all
SAT questions are equally difficult, but in fact the designers of the
SAT deliberately include some questions that are relatively easy and
some that are relatively hard. I present a second model that accounts
for this aspect of the design, and show that it doesn't have a big
effect on the results after all.
I think it is important to include modeling as an explicit part
of problem solving because it reminds us to think about modeling
errors (that is, errors due to simplifications and assumptions
of the model).
Many of the methods in this book are based on discrete distributions,
which makes some people worry about numerical errors. But for
real-world problems, numerical errors are almost always
smaller than modeling errors.
Furthermore, the discrete approach often allows better modeling
decisions, and I would rather have an approximate solution
to a good model than an exact solution to a bad model.
On the other hand, continuous methods sometimes yield performance
advantages---for example by replacing a linear- or quadratic-time
computation with a constant-time solution.
So I recommend a general process with these steps:
\begin{enumerate}
\item While you are exploring a problem, start with simple models and
implement them in code that is clear, readable, and demonstrably
correct. Focus your attention on good modeling decisions, not
optimization.
\item Once you have a simple model working, identify the
biggest sources of error. You might need to increase the number of
values in a discrete approximation, or increase the number of
iterations in a Monte Carlo simulation, or add details to the model.
\item If the performance of your solution is good enough for your
application, you might not have to do any optimization. But if you
do, there are two approaches to consider. You can review your code
and look for optimizations; for example, if you cache previously
computed results you might be able to avoid redundant computation.
Or you can look for analytic methods that yield computational
shortcuts.
\end{enumerate}
One benefit of this process is that Steps 1 and 2 tend to be fast, so you
can explore several alternative models before investing heavily in any
of them.
Another benefit is that if you get to Step 3, you will be starting
with a reference implementation that is likely to be correct,
which you can use for regression testing (that is, checking that the
optimized code yields the same results, at least approximately).
\index{regression testing}
\section{Working with the code}
\label{download}
The code and sound samples used in this book are available from
\url{https://github.com/AllenDowney/ThinkBayes}. Git is a version
control system that allows you to keep track of the files that
make up a project. A collection of files under Git's control is
called a ``repository''. GitHub is a hosting service that provides
storage for Git repositories and a convenient web interface.
\index{repository}
\index{Git}
\index{GitHub}
The GitHub homepage for my repository provides several ways to
work with the code:
\begin{itemize}
\item You can create a copy of my repository
on GitHub by pressing the {\sf Fork} button. If you don't already
have a GitHub account, you'll need to create one. After forking, you'll
have your own repository on GitHub that you can use to keep track
of code you write while working on this book. Then you can
clone the repo, which means that you copy the files
to your computer.
\index{fork}
\item Or you could clone
my repository. You don't need a GitHub account to do this, but you
won't be able to write your changes back to GitHub.
\index{clone}
\item If you don't want to use Git at all, you can download the files
in a Zip file using the button in the lower-right corner of the
GitHub page.
\end{itemize}
The code for the first edition of the book works with Python 2.
If you are using Python 3, you might want to use the updated code
in \url{https://github.com/AllenDowney/ThinkBayes2} instead.
I developed this book using Anaconda from
Continuum Analytics, which is a free Python distribution that includes
all the packages you'll need to run the code (and lots more).
I found Anaconda easy to install. By default it does a user-level
installation, not system-level, so you don't need administrative
privileges. You can
download Anaconda from \url{http://continuum.io/downloads}.
\index{Anaconda}
If you don't want to use Anaconda, you will need the following
packages:
\begin{itemize}
\item NumPy for basic numerical computation, \url{http://www.numpy.org/};
\index{NumPy}
\item SciPy for scientific computation,
\url{http://www.scipy.org/};
\index{SciPy}
\item matplotlib for visualization, \url{http://matplotlib.org/}.
\index{matplotlib}
\end{itemize}
Although these are commonly used packages, they are not included with
all Python installations, and they can be hard to install in some
environments. If you have trouble installing them, I
recommend using Anaconda or one of the other Python distributions
that include these packages.
\index{installation}
Many of the examples in this book use classes and functions defined in
{\tt thinkbayes.py}. Some of them also use {\tt thinkplot.py}, which
provides wrappers for some of the functions in {\tt pyplot}, which is
part of {\tt matplotlib}.
\section{Code style}
Experienced Python programmers will notice that the code in this
book does not comply with PEP 8, which is the most common
style guide for Python (\url{http://www.python.org/dev/peps/pep-0008/}).
\index{PEP 8}
Specifically, PEP 8 calls for lowercase function names with
underscores between words, \verb"like_this". In this book and
the accompanying code, function and method names begin with
a capital letter and use camel case, \verb"LikeThis".
I broke this rule because I developed some of the code
while I was a Visiting Scientist at Google, so I followed
the Google style guide, which deviates from PEP 8 in a few
places. Once I got used to Google style, I found that I liked
it. And at this point, it would be too much trouble to change.
Also on the topic of style, I write ``Bayes's theorem''
with an {\it s} after the apostrophe, which is preferred in some
style guides and deprecated in others. I don't have a strong
preference. I had to choose one, and this is the one I chose.
And finally one typographical note: throughout the book, I use
PMF and CDF for the mathematical concept of a probability
mass function or cumulative distribution function, and Pmf and Cdf
to refer to the Python objects I use to represent them.
\section{Prerequisites}
There are several excellent modules for doing Bayesian statistics in
Python, including PyMC and OpenBUGS. I chose not to use them
for this book because you need a fair amount of background knowledge
to get started with these modules, and I want to keep the
prerequisites minimal. If you know Python and a little bit about
probability, you are ready to start this book.
Chapter~\ref{intro} is about probability and Bayes's theorem; it has
no code. Chapter~\ref{compstat} introduces {\tt Pmf}, a thinly disguised
Python dictionary I use to represent a probability mass function
(PMF). Then Chapter~\ref{estimation} introduces {\tt Suite}, a kind
of Pmf that provides a framework for doing Bayesian updates.
In some of the later chapters, I use
analytic distributions including the Gaussian (normal) distribution,
the exponential and Poisson distributions, and the beta distribution.
In Chapter~\ref{species} I break out the less-common Dirichlet
distribution, but I explain it as I go along. If you are not familiar
with these distributions, you can read about them on Wikipedia. You
could also read the companion to this book, {\it Think Stats}, or an
introductory statistics book (although I'm afraid most of them take
a mathematical approach that is not particularly helpful for practical
purposes).
\section*{Contributor List}
If you have a suggestion or correction, please send email to
{\it [email protected]}. If I make a change based on your
feedback, I will add you to the contributor list
(unless you ask to be omitted).
\index{contributors}
If you include at least part of the sentence the
error appears in, that makes it easy for me to search. Page and
section numbers are fine, too, but not as easy to work with.
Thanks!
\small
\begin{itemize}
\item First, I have to acknowledge David MacKay's excellent book,
{\it Information Theory, Inference, and Learning Algorithms}, which is
where I first came to understand Bayesian methods. With his
permission, I use several problems from
his book as examples.
\item This book also benefited from my interactions with Sanjoy
Mahajan, especially in fall 2012, when I audited his class on
Bayesian Inference at Olin College.
\item I wrote parts of this book during project nights with the Boston
Python User Group, so I would like to thank them for their
company and pizza.
\item Olivier Yiptong sent several helpful suggestions.
\item Yuriy Pasichnyk found several errors.
\item Kristopher Overholt sent a long list of corrections and suggestions.
\item Max Hailperin suggested a clarification in Chapter~\ref{intro}.
\item Markus Dobler pointed out that drawing cookies from a bowl
with replacement is an unrealistic scenario.
\item In spring 2013, students in my class, Computational Bayesian
Statistics, made many helpful corrections and suggestions: Kai
Austin, Claire Barnes, Kari Bender, Rachel Boy, Kat Mendoza, Arjun
Iyer, Ben Kroop, Nathan Lintz, Kyle McConnaughay, Alec Radford,
Brendan Ritter, and Evan Simpson.
\item Greg Marra and Matt Aasted helped me clarify the discussion of
{\it The Price is Right} problem.
\item Marcus Ogren pointed out that the original statement of the
locomotive problem was ambiguous.
\item Jasmine Kwityn and Dan Fauxsmith at O'Reilly Media proofread the
book and found many opportunities for improvement.
\item Linda Pescatore found a typo and made some helpful suggestions.
\item Tomasz Mi\k{a}sko sent many excellent corrections and suggestions.
% ENDCONTRIB
\end{itemize}
Other people who spotted typos and small errors include
Tom Pollard,
Paul A. Giannaros,
Jonathan Edwards,
George Purkins,
Robert Marcus,
Ram Limbu,
James Lawry,
Ben Kahle,
Jeffrey Law, and
Alvaro Sanchez.
\normalsize
\pagebreak
\tableofcontents
% START THE BOOK
\mainmatter
\chapter{Bayes's Theorem}
\label{intro}
The fundamental idea behind all Bayesian statistics is Bayes's theorem,
which is surprisingly easy to derive if you understand
conditional probability. So we'll start with probability, then
conditional probability, then Bayes's theorem, and on to Bayesian
statistics.
\index{conditional probability}
\index{probability!conditional}
\section{The Linda problem}
As part of a psychological experiment, Tversky and Kahneman posed the following question\footnote{See \url{https://en.wikipedia.org/wiki/Conjunction_fallacy}}:
\begin{quote}
Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations. Which is more probable?
\begin{enumerate}
\item Linda is a bank teller.
\item Linda is a bank teller and is active in the feminist movement.
\end{enumerate}
\end{quote}
Many people choose the second answer, presumably because it seems more consistent with the description. It seems unlikely that Linda would be ``just" a bank teller; if she is a bank teller, it seems likely that she would also be a feminist.
But the second answer cannot be correct. Suppose we find 1000 people who fit Linda's description and 10 of them work as bank tellers. How many of them are also feminists? At most, all 10 of them are; in that case, the two options are equally likely. Or some of them are; in that case the second option is less likely. But there can't be more than 10 out of 10, so the second option cannot be more likely.
If this example makes you uncomfortable, you are in good company. The biologist Stephen J. Gould wrote:
\begin{quote}
I am particularly fond of this example because I know that the [second] statement is least probable, yet a little homunculus\footnote{See \url{https://en.wikipedia.org/wiki/Homunculus_argument}.} in my head continues to jump up and down, shouting at me, ``but she can't just be a bank teller; read the description.''
\end{quote}
In the following sections I'll use this example to demonstrate probability, conditional probability, and Bayes's theorem.
\section{Probability}
The definition of probability is more controversial than you might expect (see \url{https://en.wikipedia.org/wiki/Probability_interpretations}). To avoid getting bogged down before we get started, I will start with a simple definition: a {\bf probability} is a fraction of a {\bf population}.
For example, if we survey 1000 people, and 10 of them are bank tellers, the fraction that work as bank tellers is 0.01 or 1\%. If we choose a person from this population at random, the probability that they are a bank teller is 1\%. (By ``at random'' I mean that every person in the population has the same chance of being chosen.)
In this example, the population is finite, so we can compute probabilities by counting. To demonstrate, and to get back to the Linda problem, I'll use a data set from the General Social Survey\footnote{See \url{http://gss.norc.org/}} (GSS).
As you will see in the Jupyter notebook for this chapter, I downloaded a dataset with a population of 50,287 people who responded to the survey. For each respondent I extract the following variables:
\begin{description}
\item[female]: \py{True} if the respondent is female, \py{False} otherwise.
\item[liberal]: \py{True} if the respondent self-identifies as liberal.
\item[democrat]: \py{True} if the respondent self-identifies as a Democrat, that is, a member of the Democratic Party in the United States.
\item[banker]: \py{True} if the respondent works in banking.
\end{description}
Each of these variables is a Series object, which is defined by Pandas. If you are not familiar with Pandas, I will explain what you need to know as we go along.
To compute the number of respondents who are bankers, we could use a for loop:
\begin{code}
total = 0
for x in banker:
if x is True:
total += 1
\end{code}
But it is easier to use methods provided by Series. For example, \py{sum} computes the total of the elements in the Series, treating \py{True} as 1 and \py{False} as 0. So we can replace the for loop with:
\begin{code}
total = banker.sum()
\end{code}
To get a probability, we could divide the total by the length of the Series:
\begin{code}
prob = total / len(banker)
\end{code}
Or we could use the Series method \py{mean}, which computes the mean of the elements, again treating \py{True} as 1 and \py{False} as 0:
\begin{code}
prob = banker.mean()
\end{code}
To make the code more readable, I'll define a function that computes the fraction of \py{True} values in a Series:
\begin{code}
def prob(A):
return A.mean()
\end{code}
Then we can compute a probability for each variable:
\begin{code}
prob(female)
0.5385487302881461
prob(liberal)
0.14741384453238413
prob(democrat)
0.36639688189790603
prob(banker)
0.014536560144769025
\end{code}
In this dataset, about 54\% of respondents are female, 15\% say they are liberal, 37\% say they are Democrats, and 1.5\% work in banking.
The mathematical notation for probability is $\p{A}$, where $A$ represents a fact which might be true or false, or a prediction that might come true or not. In the example, $\p{banker}$ is the probability that a random respondent is a banker.
\section{Conditional probability}
I'll use this data to solve a simplified version of the Linda problem:
\begin{quote}
Linda is female and liberal:
\begin{itemize}
\item What is the probability that she is a banker?
\item What is the probability that she is a banker and a Democrat?
\end{itemize}
\end{quote}
The answers to these question are conditional probabilities.
A {\bf conditional probability} is a probability based on a specified subset of a population. For example, given that Linda is female, what is the probability she is a banker?
\index{conditional probability}
\index{probability!conditional}
One way to answer this question is to select the subset of the population that is female and compute the fraction that are bankers. We can do that with a for loop:
To select a subset, we can use one Series as an index into another:
\begin{code}
prob(banker[female])
0.020788715752160108
\end{code}
The Series \py{banker[female]} contains the elements of \py{banker} for female respondents only. The result indicates that about 2.1\% of female respondents are bankers.
The mathematical notation for conditional probability is \p{A|B}, which is the probability of $A$ given that $B$ is true. In this example, $A$ is the event that Linda is a banker and $B$ is the condition that she is female, so we could write $\p{banker|female}$.
I define the following function to compute conditional probabilities.
\begin{code}
def conditional(A, B):
return prob(A[B])
\end{code}
For this example, we would call it like this:
\begin{code}
conditional(banker, female)
0.020788715752160108
\end{code}
\section{Conjoint probability}
\label{conjoint}
{\bf Conjoint probability} is a fancy way to say the probability that
two things are true. For example, we might want to know the probability that a random person is a female banker.
\index{conjoint probability}
\index{probability!conjoint}
We can compute this probability using the \py{&} operator, which computes the logical AND of the elements in a series; for example, \py{female&banker} is \py{True} only where \py{female} and \py{banker} are \py{True}:
\begin{code}
prob(female & banker)
0.011195736472647006
\end{code}
About 1.1\% of the respondents are female bankers.
The mathematical notation for conjoint probability is \p{A \AND B}, so we could write this example as $\p{female \AND banker}$.
As this example shows, the conjoint probability, \p{A \AND B}, is not generally the same as the conditional probability, \p{A|B}. However, they are related by the following equation:
\[ \p{A | B} = \p{A \AND B} / \p{B} \]
For example:
\[ \p{banker | female} = \p{banker \AND female} / \p{female} \]
In other words, we can compute the conditional probability that Linda is a banker, given that she is female, like this:
\begin{code}
prob(banker & female) / prob(female)
0.0207887
\end{code}
And we can confirm that we get the same result if we select the subset of female respondents and compute the fraction of bankers:
\begin{code}
prob(banker[female])
0.0207887
\end{code}
\section{Bayes's theorem}
We can also write the relationship between conditional and conjoint probability the other way around:
%
\[ \p{A \AND B} = \p{B} \p{A | B} \]
%
In other words, the probability that $A$ and $B$ are true is the product of two probabilities: the probability of $B$ and the conditional probability of
$A$ given $B$.
The AND operator is commutative, so:
%
\[ \p{A \AND B} = \p{B \AND A} \]
%
That implies that we can compute a conjoint probability either way:
%
\[ \p{B} \p{A | B} = \p{A} \p{B | A} \]
%
If you think about what that means, it is not surprising: you can check $B$ first, and then $A$ given $B$, or $A$ first, and then $B$ given $A$. You get the same thing either way.
Now, if we divide through by $\p{B}$, we get Bayes's theorem:
%
\[ \p{A | B} = \frac{\p{A} ~ \p{B | A}}{\p{B}} \]
%
We can confirm that Bayes's theorem works with the example data. If $A$ is $banker$ and $B$ is $female$, we can compute the conditional probability $\p{banker|female}$ directly:
\begin{code}
conditional(banker|female)
0.02078871575216
\end{code}
Or we can compute it using Bayes's theorem:
\begin{code}
prob(banker) * conditional(female, banker) / prob(female)
0.02078871575216
\end{code}
The results are the same.
In one sense, there is nothing special about Bayes's theorem. It's just an equation that relates conditional probabilities.
And in this example, it is not particularly useful, because it is easier to compute the probability we want directly. So let's see an example where it is more useful.
\section{The cookie problem}
\label{cookie}
\begin{quote}
Suppose there are two bowls of cookies\footnote{Based on an example from
\url{http://en.wikipedia.org/wiki/Bayes'_theorem} that is no longer
there.}. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without
looking, select a cookie at random. The cookie is vanilla. What is
the probability that it came from Bowl 1?
\end{quote}
\index{Bayes's theorem}
\index{cookie problem}
The probability we want is
\p{\T{Bowl 1} \given \T{vanilla}},
but it is not obvious how to compute it. Bayes's theorem provides a way: if we can't compute $\p{A|B}$ directly, we can find it by computing:
\begin{itemize}
\item \p{A}, which is the probability of Bowl 1, unconditioned on the vanilla cookie. Since we chose the bowls at random, $\p{\T{Bowl 1}} = 1/2$
\item \p{B|A}, which is the chance of getting a vanilla cookie from Bowl 1. From the statement of the problem, we know $\p{\T{vanilla} \given \T{Bowl 1}} = 3/4$.
\item \p{B}, which is the chance of getting a vanilla cookie, unconditioned on which bowl we chose from. In the example, the bowls have the same number of cookies, so every cookie has the same chance of being chosen. Out of 80 cookies in both bowls, 50 are vanilla, so $\p{\T{vanilla}} = 5/8$.
\end{itemize}
From Bayes's theorem we have:
%
\[ \p{\T{Bowl 1} \given \T{vanilla}} = \frac{\p{\T{Bowl 1}} ~ \p{\T{vanilla} \given \T{Bowl 1}}}{\p{\T{vanilla}}}\]
%
Plugging in the numbers, we get:
%
\[ \p{\T{Bowl 1} \given \T{vanilla}} = \frac{(1/2) ~ (3/4)}{(5/8)} = 3/5 \]
%
This example demonstrates one use of Bayes's theorem: it provides
a strategy to get from \p{B|A} to \p{A|B}. This strategy is useful
in cases, like the cookie problem, where it is easier to compute
the terms on the right side of Bayes's theorem than the term on the
left.
\section{The diachronic interpretation}
There is another way to think of Bayes's theorem: it gives us a
way to update the probability of a hypothesis, $H$, in light of
some body of data, $D$.
\index{diachronic interpretation}
This way of thinking about Bayes's theorem is called the
{\bf diachronic interpretation}. ``Diachronic'' means that something
is happening over time; in this case
the probability of the hypotheses changes, over time, as
we see new data.
Rewriting Bayes's theorem with $H$ and $D$ yields:
%
\[ \p{H|D} = \frac{\p{H}~\p{D|H}}{\p{D}} \]
%
In this interpretation, each term has a name:
\index{prior}
\index{posterior}
\index{likelihood}
\index{normalizing constant}
\begin{itemize}
\item \p{H} is the probability of the hypothesis before we see
the data, called the prior probability, or just {\bf prior}.
\item \p{H|D} is what we want to compute, the probability of
the hypothesis after we see the data, called the {\bf posterior}.
\item \p{D|H} is the probability of the data under the hypothesis,
called the {\bf likelihood}.
\item \p{D} is the probability of the data under any hypothesis,
called the {\bf total probability of the data}.
\end{itemize}
Sometimes we can compute the prior based on background
information. For example, the cookie problem specifies that we choose
a bowl at random with equal probability.
In other cases the prior is subjective; that is, reasonable people
might disagree, either because they use different background
information or because they interpret the same information
differently.
\index{subjective prior}
The likelihood is usually the easiest part to compute. In the
cookie problem, if we know which bowl the cookie came from,
we find the probability of a vanilla cookie by counting.
Computing the total probability of the data can be tricky. It is supposed to be the
probability of seeing the data under any hypothesis at all, but in the
most general case it is hard to nail down what that means.
Most often we simplify things by specifying a set of hypotheses
that are
\index{mutually exclusive}
\index{collectively exhaustive}
\begin{description}
\item[Mutually exclusive:] At most one hypothesis in the set can be true, and
\item[Collectively exhaustive:] There are no other possibilities; at least one of the hypotheses has to be true.
\end{description}
I use the word {\bf suite} for a set of hypotheses that has these properties.
\index{suite}
\index{total probability}
\index{law of total probability}
With a suite of $N$ hypotheses, numbered $H_1$ to $H_N$, we can compute \p{D} using the {\bf law of total probability}:
%
\[ \p{D} = \sum_{i=1}^N \p{H_i} ~ \p{D \given H_i} \]
%
In the cookie problem, there are only two hypotheses---the cookie
came from Bowl 1 or Bowl 2---and they are mutually exclusive and
collectively exhaustive. So we can write:
%
\begin{equation*}
\begin{split}
\p{D} = & \p{\T{Bowl 1}} ~ \p{\T{vanilla} \given \T{Bowl 1}} \\
+ & \p{\T{Bowl 2}} ~ \p{\T{vanilla} \given \T{Bowl 2}}
\end{split}
\end{equation*}
%
Plugging in the values from the cookie problem, we have
%
\[ \p{D} = (1/2)~(3/4) + (1/2)~(1/2) = 5/8 \]
%
which is what we computed earlier.
However, as we'll see in the next chapter, there are easier ways to compute the total probability of the data. And sometimes you don't have to compute it at all.
\chapter{Computational probability}
\section{Bayes table}
In Section~\ref{cookie} we solved the cookie problem by hand:
\begin{quote}
Suppose there are two bowls of cookies. Bowl 1 contains 30 vanilla cookies and 10 chocolate cookies. Bowl 2 contains 20 of each.
Now suppose you choose one of the bowls at random and, without
looking, select a cookie at random. The cookie is vanilla. What is
the probability that it came from Bowl 1?
\end{quote}
Now we'll solve this problem using a {\bf Bayes table}. A Bayes table contains one row for each hypothesis, numbered $H_1$ to $H_N$. For each hypothesis it has the following columns:
\begin{itemize}
\item Prior: The prior probability, $\p{H_i}$.
\item Likelihood: The likelihood of the data under each hypothesis, $\p{D | H_i}$.
\item Unnormalized posterior: The product $\p{H_i} ~ \p{D | H_i}$.
\item Posterior: The normalized posterior $\p{H_i | D}$.
\end{itemize}
The following is the Bayes table for the cookie problem:
\begin{tabular}{|c|c|c|c|c|}
\hline
& Prior & Likelihood & Unnorm Post & Posterior \\
& $\p{H}$ & $\p{D|H}$ & $\p{H}~\p{D|H}$ & $\p{H|D}$ \\
\hline
A & 1/2 & 3/4 & 3/8 & 3/5 \\
B & 1/2 & 1/2 & 2/8 & 2/5 \\
\hline
\end{tabular}
Since we choose a bowl at random, the priors are both 1/2. The likelihoods are the conditional probabilities of the data: in Bowl 1, the probability of a vanilla cookie is 3/4; in Bowl 2 it is 1/2.
The third column, the unnormalized posteriors, are the product of the first two columns. To say that they are "unnormalized" means that they do not add up to 1. But because the hypotheses are mutually exclusive and collectively exhaustive, we know that the posterior have to add up to 1.
So we can compute the posteriors by normalizing, that is, by computing the sum of the unnormalized posteriors and dividing through.
The sum of the unnormalized posteriors is 5/8, which you might recognize as the total probability of the data. Dividing the unnormalized posteriors by 5/8 yields the posteriors, 3/5 and 2/5.
Using a Bayes table makes it easier to compute the total probability of the data. It is also straightforward to implement a Bayes table computationally using a spreadsheet or, as we'll see in the next section, a Pandas DataFrame.
\section{Computational Bayes table}
%TODO: write this
\section{Independence}
In Section~\ref{conjoint} we define conjoint probability as the probability that two facts, $A$ and $B$, are true. And we wrote the general equation:
%
\[ \p{A \AND B} = \p{A} ~ \p{B|A} \]
%
If you previously learned about probability in the context of coin tosses and
dice, you might have seen the following equation:
%
\[ \p{A \AND B} = \p{A}~\p{B} \quad\quad\mbox{WARNING: not always true}\]
%
This simpler equation is sometimes true, but not always.
For example, if I toss two coins, and $A$ means the first coin lands
face up, and $B$ means the second coin lands face up, then $\p{A} =
\p{B} = 0.5$, and sure enough, $\p{A \AND B} = \p{A}~\p{B} = 0.25$.
But if $A$ means it rains on Saturday and $B$ means it rains on Sunday, I can't use the simple formula, because if it rains on Saturday, it is more likely to rain again on Sunday.
In the first example, knowing the outcome of the first coin does not affect the probability of the second coin, so $\p{B|A} = \p{B}$. In this case we say that they two outcomes are {\bf independent}.
In the second example, knowing that it rained on Saturday affects the probability of rain on Sunday, and $\p{B|A} > \p{B}$. In this case, the outcomes are {\bf dependent}.
Knowing when events are independent is important. In the 2016 U.S. presidential election, many forecasters underestimated the probability Donald Trump would win because they underestimated the dependence between states, that is, the conditional probability that Trump would win Ohio, for example, given that he won Pennsylvania.
However, it is often difficult to know when events are dependent, or how strongly dependent they are. Sometimes we make a modeling decision to treat events as independent, even if we think they are not. And other times we can be convinced they are independent, and it turns out they are not.
We will see two notorious examples at the end of this chapter.
\newcommand{\MM}{M\&M}
\section{The \MM~problem}
\MM's are small candy-coated chocolates that come in a variety of
colors. Mars, Inc., which makes \MM's, changes the mixture of
colors from time to time.
\index{M and M problem}
In 1995, they introduced blue \MM's. Before then, the color mix in
a bag of plain \MM's was 30\% Brown, 20\% Yellow, 20\% Red, 10\%
Green, 10\% Orange, 10\% Tan. Afterward it was 24\% Blue , 20\%
Green, 16\% Orange, 14\% Yellow, 13\% Red, 13\% Brown.
Suppose a friend of mine has two bags of \MM's, and he tells me
that one is from 1994 and one from 1996. He won't tell me which is
which, but he gives me one \MM~from each bag. One is yellow and
one is green. What is the probability that the yellow one came
from the 1994 bag?
This problem is similar to the cookie problem, with the twist that I
draw one sample from each bowl/bag. This problem also gives me a
chance to demonstrate the table method, which is useful for solving
problems like this on paper. In the next chapter we will
solve them computationally.
\index{table method}
The first step is to enumerate the hypotheses. The bag the yellow
\MM~came from I'll call Bag 1; I'll call the other Bag 2. So
the hypotheses are:
\begin{itemize}
\item A: Bag 1 is from 1994, which implies that Bag 2 is from 1996.
\item B: Bag 1 is from 1996 and Bag 2 from 1994.
\end{itemize}
Now we construct a table with a row for each hypothesis and a
column for each term in Bayes's theorem:
\begin{tabular}{|c|c|c|c|c|}
\hline
& Prior & Likelihood & & Posterior \\
& \p{H} & \p{D|H} & \p{H}~\p{D|H} & \p{H|D} \\
\hline
A & 1/2 & (20)(20) & 200 & 20/27 \\
B & 1/2 & (14)(10) & 70 & 7/27 \\
\hline
\end{tabular}
The first column has the priors.
Based on the statement of the problem,
it is reasonable to choose $\p{A} = \p{B} = 1/2$.
The second column has the likelihoods, which follow from the
information in the problem. For example, if $A$ is true, the yellow
\MM~came from the 1994 bag with probability 20\%, and the green came
from the 1996 bag with probability 20\%. If $B$ is true, the yellow
\MM~came from the 1996 bag with probability 14\%, and the green came
from the 1994 bag with probability 10\%.
Because the selections are
independent, we get the conjoint probability by multiplying.
\index{independence}
The third column is just the product of the previous two.
The sum of this column, 270, is the normalizing constant.
To get the last column, which contains the posteriors, we divide
the third column by the normalizing constant.
That's it. Simple, right?
Well, you might be bothered by one detail. I write \p{D|H}
in terms of percentages, not probabilities, which means it
is off by a factor of 10,000. But that
cancels out when we divide through by the normalizing constant, so
it doesn't affect the result.
\index{normalizing constant}
When the set of hypotheses is mutually exclusive and collectively
exhaustive, you can multiply the likelihoods by any factor, if it is
convenient, as long as you apply the same factor to the entire column.
\section{The Monty Hall problem}
The Monty Hall problem might be the most contentious question in
the history of probability. The scenario is simple, but the correct
answer is so counterintuitive that many people just can't accept
it, and many smart people have embarrassed themselves not just by
getting it wrong but by arguing the wrong side, aggressively,