forked from StatsWithR/book
-
Notifications
You must be signed in to change notification settings - Fork 0
/
02-inference-01-continuous.Rmd
261 lines (168 loc) · 18.3 KB
/
02-inference-01-continuous.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
```{r include=FALSE}
library(ggplot2)
```
## Continuous Variables and Eliciting Probability Distributions
We are going to introduce continuous variables and how to elicit probability distributions, from a prior belief to a posterior distribution using the Bayesian framework.
### From the Discrete to the Continuous
This section leads the reader from the discrete random variable to continuous random variables. Let's start with the binomial random variable such as the number of heads in ten coin tosses, can only take a discrete number of values: 0, 1, 2, up to 10.
When the probability of a coin landing heads is $p$, the chance of getting $k$ heads in $n$ tosses is
$$P(X = k) = \left( \begin{array}{c} n \\ k \end{array} \right) p^k (1-p)^{n-k}$$.
This formula is called the **probability mass function** (pmf) for the binomial.
The probability mass function can be visualized as a histogram in Figure \@ref(fig:histogram). The area under the histogram is one, and the area of each bar is the probability of seeing a binomial random variable, whose value is equal to the x-value at the center of the bars base.
```{r histogram, fig.height=3, fig.width=3, fig.align='center', fig.cap="Histogram of binomial random variable", echo=FALSE, message=FALSE, warning = F}
library(ggthemes)
data = data.frame(trials = c(0,1,1,1,2,2,2,3))
ggplot(data, aes(data$trials)) +
geom_histogram(aes(y=..count../sum(..count..))) +
xlab("") + ylab("") + theme_tufte()
```
In contrast, the normal distribution, a.k.a. Gaussian distribution or the bell-shaped curve, can take any numerical value in $(-\infty,+\infty)$. A random variable generated from a normal distribution because it can take a continuum of values.
In general, if the set of possible values a random variable can take are separated points, it is a discrete random variable. But if it can take any value in some (possibly infinite) interval, then it is a continuous random variable.
When the random variable is **discrete**, it has a **probability mass function** or pmf. That pmf tells us the probability that the random variable takes each of the possible values. But when the random variable is continuous, it has probability zero of taking any single value. (Hence probability zero does not equal to impossible, an event of probabilty zero can still happen.)
We can only talk about the probability of a continuous random variable lined within some interval. For example, suppose that heights are approximately normally distributed. The probability of finding someone who is exactly 6 feet and 0.0000 inches tall (for an infinite number of 0s after the decimal point) is 0. But we can easily calculate the probability of finding someone who is between 5'11" inches tall and 6'1" inches tall.
A **continuous** random variable has a **probability density function** or pdf, instead of probability mass functions. The probability of finding someone whose height lies between 5'11" (71 inches) and 6'1" (73 inches) is the area under the pdf curve for height between those two values, as shown in the blue area of Figure \@ref(fig:pdf-auc).<!--^[Code reference: http://www.statmethods.net/advgraphs/probability.html]-->
```{r pdf-auc, out.width = '70%', fig.align='center', fig.cap="Area under curve for the probability density function", echo=FALSE, warning=F, message=F}
library(ggplot2)
library(ggthemes)
# 5'11"=71 inches; 6'1"=73 inches
x = seq(from = 62, to = 82, by = 0.001)
hx = dnorm(x,mean = 72, sd = 2)
#plot(x,hx,type="n",xlab="Height (inches)",ylab="Density")
#lb=+71; ub=+73
#ii <- x >= lb & x <= ub
#lines(x, hx)
#polygon(c(lb,x[ii],ub), c(0,hx[ii],0), col="blue")
# New code
mydata = data.frame(x = x, y = hx)
shade = rbind(c(71, 0), subset(mydata, x > 71 & x < 73),
c(73, 0))
ytop1 = dnorm(71, mean = 72, sd = 2)
ytop2 = dnorm(73, mean = 72, sd = 2)
p = qplot(x = mydata$x, y = mydata$y, geom = "line", col = I("black"))
p = p + geom_segment(aes(x = 71, y = 0, xend = 71, yend = ytop1), color = "dodgerblue3") +
geom_segment(aes(x = 73, y = 0, xend = 73, yend = ytop2), color = "dodgerblue3") +
geom_polygon(data = shade, aes(x = x, y = y), fill = "dodgerblue3") +
xlab("Height (inches)") + ylab("Probability Density") + theme_tufte()
print(p)
```
For example, a normal distribution with mean $\mu$ and standard deviation $\sigma$ (i.e., variance $\sigma^2$) is defined as
$$f(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp[-\frac{1}{2\sigma^2}(x-\mu)^2],$$
where $x$ is any value the random variable $X$ can take. This is denoted as $X \sim N(\mu,\sigma^2)$, where $\mu$ and $\sigma^2$ are the parameters of the normal distribution.
Recall that a probability mass function assigns the probability that a random variable takes a specific value for the discrete set of possible values. The sum of those probabilities over all possible values must equal one.
Similarly, a probability density function is any $f(x)$ that is non-negative and has area one underneath its curve. The pdf can be regarded as the limit of histograms made from its sample data. As the sample size becomes infinitely large, the bin width of the histogram shrinks to zero.
There are infinite number of pmf's and an infinite number of pdf's. Some distributions are so important that they have been given names:
* Continuous: normal, uniform, beta, gamma
* Discrete: binomial, Poisson
Here is a summary of the key ideas in this section:
1. Continuous random variables exist and they can take any value within some possibly infinite range.
2. The probability that a continuous random variable takes a specific value is zero.
3. Probabilities from a continuous random variable are determined by the density function with this non-negative and the area beneath it is one.
4. We can find the probability that a random variable lies between two values ($c$ and $d$) as the area under the density function that lies between them.
### Elicitation
Next, we introduce the concept of prior elicitation in Bayesian statistics. Often, one has a belief about the distribution of one's data. You may think that your data come from a binomial distribution and in that case you typically know the $n$, the number of trials but you usually do not know $p$, the probability of success. Or you may think that your data come from a normal distribution. But you do not know the mean $\mu$ or the standard deviation $\sigma$ of the normal. Beside to knowing the distribution of one's data, you may also have beliefs about the unknown $p$ in the binomial or the unknown mean $\mu$ in the normal.
Bayesians express their belief in terms of personal probabilities. These personal probabilities encapsulate everything a Bayesian knows or believes about the problem. But these beliefs must obey the laws of probability, and be consistent with everything else the Bayesian knows.
```{example, label="200percent"}
You cannot say that your probability of passing this course is 200%, no matter how confident you are. A probability value must be between zero and one. (If you still think you have a probability of 200% to pass the course, you are definitely not going to pass it.)
```
```{example, label="binomial-data"}
You may know nothing at all about the value of $p$ that generated some binomial data. In which case any value between zero and one is equally likely, you may want to make an inference on the proportion of people who would buy a new band of toothpaste. If you have industry experience, you may have a strong belief about the value of $p$, but if you are new to the industry you would do nothing about $p$. In any value between zero and one seems equally like a deal. This major personal probability is the uniform distribution whose probably density function is flat, denoted as $\text{Unif}(0,1)$.
```
```{example, label="coin-toss"}
If you were tossing a coin, most people believed that the probability of heads is pretty close to half. They know that some coins are biased and that some coins may have two heads or two tails. And they probably also know that coins are not perfectly balanced. Nonetheless, before they start to collect data by tossing the coin and counting the number of heads their belief is that values of $p$ near 0.5 are very likely, whereas values of $p$ near 0 or 1 are very unlikely.
```
```{example, label="marriage"}
In real life, here are two ways to elicit a probability that you cousin will get married. A frequentist might go to the U.S. Census records and determine what proportion of people get married (or, better, what proportion of people of your cousin's ethnicity, education level, religion, and age cohort are married). In contrast, a Bayesian might think "My cousin is brilliant, attractive, and fun. The probability that my cousin gets married is really high -- probably around 0.97."
```
So a base angle sits to express their belief about the value of $p$ through a probability distribution, and a very flexible family of distributions for this purpose is the **beta family**. A member of the beta family is specified by two parameters, $\alpha$ and $\beta$; we denote this as $p \sim \text{beta}(\alpha, \beta)$. The probability density function is
\begin{equation}
f(p) = \frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1},
(\#eq:beta)
\end{equation}
where $0 \leq p \leq 1, \alpha>0, \beta>0$, and $\Gamma$ is a factorial:
$$\Gamma(n) = (n-1)! = (n-1) \times (n-2) \times \cdots \times 1$$
When $\alpha=\beta=1$, the beta distribution becomes a uniform distribution, i.e. the probabilty density function is a flat line. In other words, the uniform distribution is a special case of the beta family.
The expected value of $p$ is $\frac{\alpha}{\alpha+\beta}$, so $\alpha$ can be regarded as the prior number of successes, and $\beta$ the prior number of failures. When $\alpha=\beta$, then one gets a symmetrical pdf around 0.5. For large but equal values of $\alpha$ and $\beta$, the area under the beta probability density near 0.5 is very large. Figure \@ref(fig:beta) compares the beta distribution with different parameter values.
```{r beta, out.width = '80%', fig.align="center", fig.cap="Beta family", warning = F, message = F, echo=FALSE}
library(ggthemes)
p.range = seq(from=0, to=1, by=0.01)
beta0 = dbeta(p.range, 1, 1)
beta1 = dbeta(p.range, 0.5, 0.5)
beta2 = dbeta(p.range, 5, 1)
beta3 = dbeta(p.range, 1, 3)
beta4 = dbeta(p.range, 2, 2)
beta5 = dbeta(p.range, 2, 5)
#texts = c("alpha=beta=1", "alpha=beta=0.5", "alpha=5, beta=1",
# "alpha=1, beta=3", "alpha=2, beta=2", "alpha=2, beta=5")
colors = c("brown","red","green","blue","pink","black")
#plot(p.range, beta1, type="l", col=colors[2],
# xlab="p", ylab="probabilty density")
# lines(p.range, beta0, type="l", col=colors[1])
# lines(p.range, beta2, type="l", col=colors[3])
# lines(p.range, beta3, type="l", col=colors[4])
# lines(p.range, beta4, type="l", col=colors[5])
# lines(p.range, beta5, type="l", col=colors[6])
# legend("top", texts, lty=rep(c(1),6), col = colors)
# New code
my_beta = data.frame(x = p.range, beta0 = beta0, beta1 = beta1, beta2 = beta2,
beta3 = beta3, beta4 = beta4, beta5 = beta5)
ggplot(data = my_beta) + geom_line(aes(x = x, y = beta0, col = "a")) +
geom_line(aes(x = x, y = beta1, col = "b")) +
geom_line(aes(x = x, y = beta2, col = "c")) +
geom_line(aes(x = x, y = beta3, col = "d")) +
geom_line(aes(x = x, y = beta4, col = "e")) +
geom_line(aes(x = x, y = beta5, col = "f")) +
scale_color_manual(values = colors,
labels = c(bquote(paste(alpha, " = ", 1, ", ", beta, " = ", 1)),
bquote(paste(alpha, " = ", 0.5, ", ", beta, " = ", 0.5)),
bquote(paste(alpha, " = ", 5, ", ", beta, " = ", 1)),
bquote(paste(alpha, " = ", 1, ", ", beta, " = ", 3)),
bquote(paste(alpha, " = ", 2, ", ", beta, " = ", 2)),
bquote(paste(alpha, " = ", 2, ", ", beta, " = ", 5))),
name = "Beta Distributions") +
ylab("Probability Density") + xlab("p") + theme_tufte()
#+ theme(legend.position = c(1, 1), legend.justification = c(1.2, 1.1))
```
These kinds of priors are probably appropriate if you want to infer the probability of getting heads in a coin toss. The beta family also includes skewed densities, which is appropriate if you think that $p$ the probability of success in ths binomial trial is close to zero or one.
Bayes' rule is a machine to turn one's prior beliefs into posterior beliefs. With binomial data you start with whatever beliefs you may have about $p$, then you observe data in the form of the number of head, say 20 tosses of a coin with 15 heads.
Next, Bayes' rule tells you how the data changes your opinion about $p$. The same principle applies to all other inferences. You start with your prior probability distribution over some parameter, then you use data to update that distribution to become the posterior distribution that expresses your new belief.
These rules ensure that the change in distributions from prior to posterior is the uniquely rational solution. So, as long as you begin with the prior distribution that reflects your true opinion, you can hardly go wrong.
However, expressing that prior can be difficult. There are proofs and methods whereby a rational and coherent thinker can self-illicit their true prior distribution, but these are impractical and people are rarely rational and coherent.
The good news is that with the few simple conditions no matter what part distribution you choose. If enough data are observed, you will converge to an accurate posterior distribution. So, two bayesians, say the reference Thomas Bayes and the agnostic Ajay Good can start with different priors but, observe the same data. As the amount of data increases, they will converge to the same posterior distribution.
Here is a summary of the key ideas in this section:
1. Bayesians express their uncertainty through probability distributions.
2. One can think about the situation and self-elicit a probability distribution that approximately reflects his/her personal probability.
3. One's personal probability should change according Bayes' rule, as new data are observed.
4. The beta family of distribution can describe a wide range of prior beliefs.
### Conjugacy
Next, let's introduce the concept of conjugacy in Bayesian statistics.
Suppose we have the prior beliefs about the data as below:
* Binomial distribution $\text{Bin}(n,p)$ with $n$ known and $p$ unknown
* Prior belief about $p$ is $\text{beta}(\alpha,\beta)$
Then we observe $x$ success in $n$ trials, and it turns out the Bayes' rule implies that our new belief about the probability density of $p$ is also the beta distribution, but with different parameters. In mathematical terms,
\begin{equation}
p|x \sim \text{beta}(\alpha+x, \beta+n-x).
(\#eq:beta-binomial)
\end{equation}
This is an example of conjugacy. Conjugacy occurs when the **posterior distribution** is in the **same family** of probability density functions as the prior belief, but with **new parameter values**, which have been updated to reflect what we have learned from the data.
Why are the beta binomial families conjugate? Here is a mathematical explanation.
Recall the discrete form of the Bayes' rule:
$$P(A_i|B) = \frac{P(B|A_i)P(A_i)}{\sum^n_{j=1}P(B|A_j)P(A_j)}$$
However, this formula does not apply to continuous random variables, such as the $p$ which follows a beta distribution, because the denominator sums over all possible values (must be finitely many) of the random variable.
But the good news is that the $p$ has a finite range -- it can take any value **only** between 0 and 1. Hence we can perform integration, which is a generalization of the summation. The Bayes' rule can also be written in continuous form as:
$$\pi^*(p|x) = \frac{P(x|p)\pi(p)}{\int^1_0 P(x|p)\pi(p) dp}.$$
This is analogus to the discrete form, since the integral in the denominator will also be equal to some constant, just like a summation. This constant ensures that the total area under the curve, i.e. the posterior density function, equals 1.
Note that in the numerator, the first term, $P(x|p)$, is the data likelihood -- the probability of observing the data given a specific value of $p$. The second term, $\pi(p)$, is the probability density function that reflects the prior belief about $p$.
In the beta-binomial case, we have $P(x|p)=\text{Bin}(n,p)$ and $\pi(p)=\text{beta}(\alpha,\beta)$.
Plugging in these distributions, we get
$$\begin{aligned}
\pi^*(p|x) &= \frac{1}{\text{some number}} \times P(x|p)\pi(p) \\
&= \frac{1}{\text{some number}} [\left( \begin{array}{c} n \\ x \end{array} \right) p^x (1-p)^{n-x}] [\frac{\Gamma(\alpha+\beta)}{\Gamma(\alpha)\Gamma(\beta)} p^{\alpha-1} (1-p)^{\beta-1}] \\
&= \frac{\Gamma(\alpha+\beta+n)}{\Gamma(\alpha+x)\Gamma(\beta+n-x)} \times p^{\alpha+x-1} (1-p)^{\beta+n-x-1}
\end{aligned}$$
Let $\alpha^* = \alpha + x$ and $\beta^* = \beta+n-x$, and we get
$$\pi^*(p|x) = \text{beta}(\alpha^*,\beta^*) = \text{beta}(\alpha+x, \beta+n-x),$$
same as the posterior formula in Equation \@ref(eq:beta-binomial).
We can recognize the posterior distribution from the numerator $p^{\alpha+x-1}$ and $(1-p)^{\beta+n-x-1}$. Everything else are just constants, and they must take the unique value, which is needed to ensure that the area under the curve between 0 and 1 equals 1. So they have to take the values of the beta, which has parameters $\alpha+x$ and $\beta+n-x$.
This is a cute trick. We can find the answer without doing the integral simply by looking at form of the numerator.
Without conjugacy, one has to do the integral. Often, the integral is impossible to evaluate. That obstacle is the primary reason that most statistical theory in the 20th century was not Bayesian. The situation did not change until modern computing allowed researchers to compute integrals numerically.
In summary, some pairs of distributions are conjugate. If your prior is in one and your data comes from the other, then your posterior is in the same family as the prior, but with new parameters. We explored this in the context of the beta-binomial conjugate families. And we saw that conjugacy meant that we could apply the continuous version of Bayes' rule without having to do any integration.