-
Notifications
You must be signed in to change notification settings - Fork 49
/
16_bayes.tex
279 lines (237 loc) · 13.3 KB
/
16_bayes.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
\chapter{Bayes analysis}
\section{Introduction to Bayesian analysis}
Bayesian analysis is a form of statistical inference
relying on Baye's rule. The general version of Baye's rule states
that
$$
f(y | x) = f(x | y) f(y) / f(x)
$$
where we're using $f$ (loosely) as the appropriate density, mass function
or probability and $x$ and $y$ represent random variables or events.
In the context of Bayesian analysis, Baye's rule is used in the following
way. Let ${\cal L}(\theta; y)$ be the likelihood associated with data, $y$,
and parameter $\theta$. We codify our prior knowledge about $\theta$ with
a prior distribution, $\pi(\theta)$. Then, a Bayesian analysis is performed
via the posterior distribution
$$
\pi(\theta ~|~ y) = f(y ~|~ \theta) \pi(\theta) / f(y)
\propto_\theta f(y ~|~ \theta) \pi(\theta) \propto_\theta {\cal L}(\theta; y) \pi(\theta).
$$
Therefore, one obtains the posterior, up to multiplicative constants,
by multiplying the likelihood times the prior.
Coupled with Bayesian analysis is Bayesian interpretation of the probabilities.
The prior is viewed a belief and the posterior is then an updated belief
coupling the objective evidence (the likelihood) with the subjective
belief (the prior). By viewing probabilities as personal quantifications
of beliefs, a Bayesian can talk about the probability of things that frequentists
cant. So, for example, if I roll a die and don't show you the result, you as a Bayesian
can say that the probability that this specific roll is a six is one sixth. You as a frequentist,
in contrast, must say that in one sixth of repetitions of this experiment, the result
will be a six. To a frequentist, this specific roll is either six or not.
This distinction in probabilistic interpretation
has consequences in statistical interpretations. For example in
diagnostic tests, a Bayesian can talk about the probability that a person has a disease, whereas
a frequency interpretation relies on the percentage of diseased people in a
population of those similar.
Personally, I've never minded either interpretation,
but to many, the Bayesian interpretation seems more natural. In contrast, many
practitioners dislike Bayesian analysis because of the prior specification, and
the heavy reliance on fully specified models.
It should be noted that the discussion up to this point contrasted classical
frequency thinking with classical subjective Bayesian thinking. In fact, most modern
applied statisticians use hybrid approaches. They might, for example, develop a
procedure with Bayesian tools (the manipulation of conditional distributions with
priors on the parameters), but evaluate the procedure using frequency error rates.
For all intents and purposes, such a procedure is frequentist, just developed
with a Bayesian mindset. In contrast, many frequency statistical practitioners
interpret their results with an approximate Bayesian mindset. Such procedures
are simply Bayesian without the formalism. Even between these approaches there's
continuous shades of gray. Therefore, saying a modern
statistician is either Bayesian or frequentist is usually misleading, unless
that person does research or writes on statistical foundations. Nonetheless, foundational
thinking is useful for understanding and clarifying thinking. It's worth then
reading and internatlizing the literature on foundations for this reason alone. It's a lot like
working on core drills to get better at a sport. That and it's quite a bit of fun! Some of my favorite modern writers on the topic include (heavily
emphasizing people I know pretty well or have run into recently):
Jim Berger, Nancy Reid,
Richard Royall, Deborah Mayo, David Cox, Charles Rohde, Andrew Gelman,
Larry Wasserman
and Jeff Blume. Their work will point to many others (Basu, Birnbaum, De Finetti, Lindley all also come to mind).
Finally, it should also be noted that Bayes versus frequency is far from the only
schism in statistical foundations. Personally, I find the distinction between direct
use of the design in frequency analysis to obtain robustness, like is often done is randomization testing and survey sampling, versus fully specified
modeling a larger distinction than
how one uses the model (Bayes versus frequentist). In addition, causal analysis
versus association (non-causal) analysis forms a large distinction and one can perform Bayes or
frequentists causal analysis and non-causal analyses. Furthermore, the likelihood
paradigm \citep{royall1997statistical} offers a third inferential
technique given a model over Bayes and frequency interpretations.
\section{Basic Bayesian models}
\subsection{Binomial}
We'll begin our discussion of Bayesian models by using some count outcome cases to build
intuition. First, consider a series of coin flips, $X_1, \ldots, X_n \sim \mbox{Bernoulli}(\theta)$.
The likelihood associated with this experiment is
$$
{\cal L}(\theta) \propto \theta^{\sum_i x_i} (1 - \theta)^{n-\sum_i x_i} = \theta^x (1 - \theta)^{n-x}
$$
where $x = \sum_i x_i$. Notice the likelihood depends only on the total number of successes.
Consider putting a Beta$(\alpha, \beta)$ prior on $\theta$. The the posterior is
$$
\pi(\theta ~|~ x) \propto_\theta {\cal L}(\theta) \times \pi(\theta) \propto \theta^{x + \alpha - 1} (1 - \theta)^{n - x + \beta - 1}
$$
therefore the posterior distribution is Beta$(x + \alpha, n - x + \beta)$. The posterior mean is
$$
E[\theta ~|~ x] = \frac{x + \alpha}{n + \alpha + \beta}
= \delta \hat p + (1 - \delta) \frac{\alpha}{\alpha + \beta}
$$
Therefore, the posterior mean is a weighted average of the MLE ($\hat p$) and the prior
mean $\frac{\alpha}{\alpha + \beta}$. The weight is
$$
\delta = \frac{n}{n + \alpha + \beta}.
$$
Notice that, as $n \rightarrow \infty$ for fixed $\alpha$ and $\beta$, $\delta \rightarrow 1$ and
the MLE
dominates. That is, as we collect more data, the prior becomes less relevant
and the data, in the form of the likelihood, dominates. On the other hand,
for fixed $n$, as either $\alpha$ or $\beta$ go to infinity (or both), the prior
dominates ($\delta \rightarrow 0$). For the Beta distribution $\alpha$ or $\beta$ going to infinity makes
the distribution much more peaked. Thus, if we are more certain of our prior
distribution, the data matters less.
\subsection{Poisson}
Let $X \sim \mbox{Poisson}(t\lambda)$. Then
$$
{\cal L}(\lambda) \propto \lambda^x e^{-t \lambda}.
$$
Consider putting a Gamma($\alpha, \tau^{-1}$) prior on $\lambda$. Then
we have that
$$
\pi(\lambda ~|~ x) \propto \lambda^{x + \alpha - 1} e^{-\lambda (t + \tau)}
$$
and thus the posterior is Gamma$(x + \alpha, (t + \tau)^{-1})$. Because of
the inversion of the second scale parameter of the Gamma, often Bayesians
specify it in the terms of the inverse (as in Gamma($x + \alpha, t + \tau$)).
Often to avoid confusion, the mean of the gamma will be given to ensure no
confusion over the parameterization.
The posterior mean is:
$$
E[ \lambda ~|~ x] = \frac{x + \alpha}{t + \tau}
= \delta \hat \lambda + (1 - \delta) \frac{\alpha}{\tau}
$$
where $\hat \lambda = x / t$ is the MLE (the observed rate) and
$\alpha / \tau$ is the prior estimate. In this case
$$
\delta = \frac{t}{t + \tau}
$$
so that as $t \rightarrow \infty$ the MLE dominates while the prior
dominates as $\tau \rightarrow \infty$.
\section{Bayesian Linear Models}
\subsection{Bayesian Linear Models}
Recall our standard Gaussian linear model, where $\bY ~|~ \bX, \bbeta, \sigma^2
\sim N(\bX \bbeta, \sigma^2 \bI)$. Consider three common prior specifications:
\begin{enumerate}
\item $\bbeta ~|~ \sigma^2 \sim N(\bbeta_0, \sigma^2 \bSigma_0)$ and $\sigma^{-2} \sim \mbox{Gamma}(\alpha_0, \tau^{-1}_0)$.
\item $\bbeta \sim N(\bbeta_0, \bSigma_0)$ and $\sigma^{-2} \sim \mbox{Gamma}(\alpha_0, \tau^{-1}_0)$.
\item $(\bbeta, \sigma^2) \sim \sigma^{-2}$.
\end{enumerate}
The final prior specification is not a proper density. It doesn't have a defined integral for
the elements of $\bbeta$ from $-\infty$ to $\infty$ and for $0\leq \sigma^2 < \infty$. However,
proceeding as if it were a proper density yields a proper distribution for the posterior.
Such ``improper'' priors are often used to specify putatively uninformative distributions that
yield valid posteriors. In this case, the posterior has the property of the posterior mode
being centered around $\hat \bbeta$.
The distinction between the first case and the second is the inclusion of $\sigma^2$ in
the prior specification for $\bbeta$. This is useful for making all posterior distributions
tractable, including that of $\bbeta$ integrated over $\sigma^2$. However, it may or may
not reflect the desired prior distribution.
The second specification has tractable full conditionals. That is, we can easily figure out
$\bbeta ~|~ \sigma^2, \by, \bX$ and $\sigma^2 ~|~ \bbeta, \by, \bX$. However, the posterior
marginals of the parameters ($\bbeta ~|~ \by, \bX$ in particular) are not tractable. This posterior
is often explored using Monte Carlo.
The third specification is also completely tractable.
\section{Monte Carlo sampling}
Even though many of our Bayesian models are completely tractable, we will explore the posteriors
via Monte Carlo. The reason for this is to get students familiar with Monte Carlo so that they
can apply it in the more complex settings that they are likely to encounter in practice.
Specifically, usually fully tractable posteriors are more of an exception than the rule. For the
most part, for linear models, one should use the fully tractable results as they are much faster.
We now give some strategies for Monte Carlo sampling from a posterior.
\subsection{Sequential Monte Carlo}
Notice for three variables, $X$, $Y$ and $Z$, sampling $f_z(z)$, $f_{y|z}(y)$ and $f_{x|y,z}(x ~|~ y, z)$ yields a multivariate draw from the joint distribution $f(x, y, z)$. So, for example, consider
setting 3 of our prior specifications
$$
\bbeta ~|~ \sigma^2, \by, \bX \sim
N(\hat \bbeta, \bX^t\bX \sigma^2) ~~\mbox{and}~~
\sigma^{-2} ~|~ \by, \bX \sim \mbox{Gamma}\{(n - p) / 2, 2 / (n - p) S^2\}
$$
Notice that $E[\sigma^{-2} ~|~ \bX, \by] = 1 / S^2$. To simualte from this distribution,
we first simulate from $\sigma^{-2} ~|~ \bX, \by$
then plug that simulation into $\bbeta ~|~ \sigma^2, \by, \bX$
and simulate an $\bbeta$. The pair is a draw from the joint posterior
distribution of $\bbeta$ and $\sigma^{-2}$.
\subsection{Gibbs sampling}
Consider again our three random variables. Suppose that an initial value of $x$ and $y$, say $x^(1)$ and $y^{(1)}$ was obtained. Then consider simulating
\begin{enumerate}
\item $z^{(1)} \sim f_{z|x,y}(z ~|~ x^{(1)}, y^{(1)})$
\item $x^{(2)} \sim f_{x|y,z}(x ~|~ y^{(1)},z^{(2)})$
\item $y^{(2)} \sim f_{y|x,z}(y ~|~ x^{(2)}, z^{(2)})$
\item $z^{(2)} \sim f_{z|x,y}(z ~|~ x^{(2)}, y^{(2)})$
\item $x^{(3)} \sim f_{x|y,z}(x ~|~ y^{(2)}, z^{(3)})$
\end{enumerate}
and so on. In other words, always update a simulated variable using the most recently simulated
version of the other variables. In fact, one need not use the full conditionals. Any collection
of conditionals would work. Moreover, any random order works, or even randomizing the order
each iteration. However, some conditions have to be met for the asymptotics of the
sampler to work. For example, you have to update every variable infinitely often and the
whole space has to be explorable by the sampler. If the conditions are met, the sampler
is a Markov chain whose stationary distribution, i.e. the limiting distribution, is
$f(x, y, z)$. Moreover, there's lots of results saying that you can use the ouput of the
sampler in much the same way one uses iid samples. For example approximating posterior
means with the average of the simulated variables. However, the Markovian nature of the sampler
makes using the samples a little trickier. One could try to combat this by running the chain for a while and throwing out all of the early simulations used to ``burn in'' the sampler. This throws away
a lot of data. Our preferred method is to use good starting values (why not start at the MLE?)
and use all of the simulated data.
Let's illustrate the sampler with prior specification prior 2. Consider the
simplified model:
$$
\bY ~|~ \bbeta, \bX, \theta \sim N(\bX \bbeta, \theta^{-1}\bI)
~~\mbox{and}~~
\bbeta \sim N(\bzero, \psi^{-1} \bI)
~~\mbox{and}~~
\theta \sim \mbox{Gamma}(\alpha / 2, \tau^{-1}.
$$
Under this specification, the full conditionals are:
\begin{align*}
\bbeta & ~|~ \by, \bX, \theta \sim N\{(\xtx \theta + \psi \bI)^{-1} \bX^t\by, (\xtx \theta + \psi \bI)^{-1}\} \\
\theta & ~|~ \by, \bX, \bbeta \sim \mbox{Gamma}\{(n + \alpha) / 2, 2(||\by - \bX \bbeta||^2 + \tau)^{-1} \}
\end{align*}
\begin{verbatim}
data("mtcars")
y = mtcars$mpg - mean(mtcars$mpg)
x = cbind(1, mtcars$wt - mean(mtcars$wt))
n = length(y)
p = ncol(x)
fitML = lm(y ~ x - 1)
xtx = t(x) %*% x
xty = as.vector(t(x) %*% y)
nosim = 10000
rmvnorm = function(mu, Sigma) as.vector(mu + chol(Sigma) %*% rnorm(length(mu)))
thetaCurrent = 1 / summary(fitML)$sigma^2
beta = NULL
theta = thetaCurrent * 100
psi = .01
alpha = .01
tau = .01 * summary(fitML)$sigma^2
for (i in 1 : nosim){
V = solve(xtx * thetaCurrent + psi * diag(1, p, p))
mu = V %*% xty * thetaCurrent
betaCurrent = rmvnorm(mu, V)
sumesq = sum((y - x %*% betaCurrent)^2)
thetaCurrent = rgamma(1, (n + alpha) / 2, rate = (sumesq + tau)/2)
theta = c(theta, thetaCurrent)
beta = rbind(beta, betaCurrent)
}
sigma = sqrt(1/ theta)
quantile(beta[,1], c(.025, .975))
quantile(beta2[,2], c(.025, .975))
quantile(sigma, c(0.025, .975))
\end{verbatim}