16_bayes.tex

\chapter{Bayes analysis}

\section{Introduction to Bayesian analysis}

Bayesian analysis is a form of statistical inference
relying on Baye's rule. The general version of Baye's rule states
that 
$$
f(y | x) = f(x | y) f(y) / f(x)
$$
where we're using $f$ (loosely) as the appropriate density, mass function
or probability and $x$ and $y$ represent random variables or events. 

In the context of Bayesian analysis, Baye's rule is used in the following
way. Let ${\cal L}(\theta; y)$ be the likelihood associated with data, $y$,
and  parameter $\theta$. We codify our prior knowledge about $\theta$ with
a prior distribution, $\pi(\theta)$. Then, a Bayesian analysis is performed
via the posterior distribution 
$$
\pi(\theta ~|~ y) = f(y ~|~ \theta) \pi(\theta) / f(y) 
\propto_\theta f(y ~|~ \theta) \pi(\theta) \propto_\theta {\cal L}(\theta; y) \pi(\theta).
$$
Therefore, one obtains the posterior, up to multiplicative constants, 
by multiplying the likelihood times the prior. 

Coupled with Bayesian analysis is Bayesian interpretation of the probabilities.
The prior is viewed a belief and the posterior is then an updated belief
coupling the objective evidence (the likelihood) with the subjective 
belief (the prior). By viewing probabilities as personal quantifications
of beliefs, a Bayesian can talk about the probability of things that frequentists
cant. So, for example, if I roll a die and don't show you the result, you as a Bayesian
can say that the probability that this specific roll is a six is one sixth. You as a frequentist,
in contrast, must say that in one sixth of repetitions of this experiment, the result
will be a six. To a frequentist, this specific roll is either six or not. 

This distinction in probabilistic interpretation 
has consequences in statistical interpretations. For example in
diagnostic tests,  a Bayesian can talk about the probability that a person has a disease, whereas
a frequency interpretation relies on the percentage  of diseased people in a 
population of those similar. 

Personally, I've never minded either interpretation,
but to many, the Bayesian interpretation seems more natural. In contrast, many
practitioners dislike Bayesian analysis because of the prior specification, and
the heavy reliance on fully specified models. 

It should be noted that the discussion up to this point contrasted classical
frequency thinking with classical subjective Bayesian thinking. In fact, most modern
applied statisticians use hybrid approaches. They might, for example, develop a
procedure with Bayesian tools (the manipulation of conditional distributions with 
priors on the parameters), but evaluate the procedure using frequency error rates. 
For all intents and purposes, such a procedure is frequentist, just developed
with a Bayesian mindset. In contrast, many frequency statistical practitioners 
interpret their results with an approximate Bayesian mindset. Such procedures
are simply Bayesian without the formalism. Even between these approaches there's
continuous shades of gray. Therefore, saying a modern
statistician is either Bayesian or frequentist is usually misleading, unless
that person does research or writes on statistical foundations. Nonetheless, foundational
thinking is useful for understanding and clarifying thinking. It's worth then
reading and internatlizing the literature on foundations for this reason alone. It's a lot like
working on core drills to get better at a sport. That and it's quite a bit of fun! Some of my favorite modern writers on the topic include (heavily
emphasizing people I know pretty well or have run into recently):
Jim Berger, Nancy Reid, 
Richard Royall, Deborah Mayo, David Cox, Charles Rohde, Andrew Gelman,
Larry Wasserman
and  Jeff Blume. Their work will point to many others (Basu, Birnbaum, De Finetti, Lindley all also come to mind).


Finally, it should also be noted that Bayes versus frequency is far from the only
schism in statistical foundations. Personally, I find the distinction between direct
use of the design in frequency analysis to obtain robustness, like is often done is randomization testing and survey sampling,  versus fully specified 
modeling a larger distinction than
how one uses the model (Bayes versus frequentist). In addition, causal analysis
versus association (non-causal) analysis forms a large distinction and one can perform Bayes or
frequentists causal analysis and non-causal analyses. Furthermore, the likelihood
paradigm \citep{royall1997statistical} offers a third inferential 
technique given a model over Bayes and frequency interpretations.


\section{Basic Bayesian models}
\subsection{Binomial}
We'll begin our discussion of Bayesian models by using some count outcome cases to build 
intuition. First, consider a series of coin flips, $X_1, \ldots, X_n \sim \mbox{Bernoulli}(\theta)$.
The likelihood associated with this experiment is
$$
{\cal L}(\theta) \propto \theta^{\sum_i x_i} (1 - \theta)^{n-\sum_i x_i} = \theta^x (1 - \theta)^{n-x}
$$
where $x = \sum_i x_i$. Notice the likelihood depends only on the total number of successes.
Consider putting a Beta$(\alpha, \beta)$ prior on $\theta$. The the posterior is 
$$
\pi(\theta ~|~ x) \propto_\theta {\cal L}(\theta) \times \pi(\theta) \propto \theta^{x + \alpha - 1} (1 - \theta)^{n - x + \beta - 1}
$$
therefore the posterior distribution is Beta$(x + \alpha, n - x + \beta)$. The posterior mean is
$$
E[\theta ~|~ x] = \frac{x + \alpha}{n + \alpha + \beta} 
= \delta \hat p + (1 - \delta) \frac{\alpha}{\alpha + \beta}
$$
Therefore, the posterior mean is a weighted average of the MLE ($\hat p$) and the prior
mean $\frac{\alpha}{\alpha + \beta}$. The weight is 
$$
\delta = \frac{n}{n + \alpha + \beta}.
$$
Notice that, as $n \rightarrow \infty$ for fixed $\alpha$ and $\beta$, $\delta \rightarrow 1$ and
the MLE
dominates. That is, as we collect more data, the prior becomes less relevant
and the data, in the form of the likelihood, dominates. On the other hand,
for fixed $n$, as either $\alpha$ or $\beta$ go to infinity (or both), the prior
dominates ($\delta \rightarrow 0$). For the Beta distribution $\alpha$ or $\beta$ going to infinity makes
the distribution much more peaked. Thus, if we are more certain of our prior
distribution, the data matters less. 


\subsection{Poisson}
Let $X \sim \mbox{Poisson}(t\lambda)$. Then 
$$
{\cal L}(\lambda) \propto \lambda^x e^{-t \lambda}.
$$
Consider putting a Gamma($\alpha, \tau^{-1}$) prior on $\lambda$. Then
we have that
$$
\pi(\lambda ~|~ x) \propto \lambda^{x + \alpha - 1} e^{-\lambda (t + \tau)} 
$$
and thus the posterior is Gamma$(x + \alpha, (t + \tau)^{-1})$. Because of
the inversion of the second scale parameter of the Gamma, often Bayesians
specify it in the terms of the inverse (as in Gamma($x + \alpha, t + \tau$)). 
Often to avoid confusion, the mean of the gamma will be given to ensure no
confusion over the parameterization.

The posterior mean is:
$$
E[ \lambda ~|~ x]  = \frac{x + \alpha}{t + \tau}
= \delta \hat \lambda + (1 - \delta) \frac{\alpha}{\tau}
$$
where $\hat \lambda = x / t$ is the MLE (the observed rate) and
$\alpha / \tau$ is the prior estimate. In this case 
$$
\delta = \frac{t}{t + \tau}
$$
so that as $t \rightarrow \infty$ the MLE dominates while the prior
dominates as $\tau \rightarrow \infty$. 


\section{Bayesian Linear Models}
\subsection{Bayesian Linear Models}

Recall our standard Gaussian linear model, where $\bY ~|~ \bX, \bbeta, \sigma^2
\sim N(\bX \bbeta, \sigma^2 \bI)$. Consider three common prior specifications:

\begin{enumerate}
\item $\bbeta ~|~ \sigma^2 \sim N(\bbeta_0, \sigma^2 \bSigma_0)$ and $\sigma^{-2} \sim \mbox{Gamma}(\alpha_0, \tau^{-1}_0)$.
\item $\bbeta \sim N(\bbeta_0, \bSigma_0)$ and $\sigma^{-2} \sim \mbox{Gamma}(\alpha_0, \tau^{-1}_0)$.
\item $(\bbeta, \sigma^2) \sim \sigma^{-2}$.
\end{enumerate}
The final prior specification is not a proper density. It doesn't have a defined integral for
the elements of $\bbeta$ from $-\infty$ to $\infty$ and for $0\leq \sigma^2 < \infty$. However,
proceeding as if it were a proper density yields a proper distribution for the posterior. 
Such ``improper'' priors are often used to specify putatively uninformative distributions that
yield valid posteriors. In this case, the posterior has the property of the posterior mode
being centered around $\hat \bbeta$. 

The distinction between the first case and the second is the inclusion of $\sigma^2$ in
the prior specification for $\bbeta$. This is useful for making all posterior distributions
tractable, including that of $\bbeta$ integrated over $\sigma^2$. However, it may or may
not reflect the desired prior distribution. 

The second specification has tractable full conditionals. That is, we can easily figure out
$\bbeta ~|~ \sigma^2, \by, \bX$ and $\sigma^2 ~|~ \bbeta, \by, \bX$. However, the posterior
marginals of the parameters ($\bbeta ~|~ \by, \bX$ in particular) are not tractable. This posterior
is often explored using Monte Carlo.

The third specification is also completely tractable. 


\section{Monte Carlo sampling}

Even though many of our Bayesian models are completely tractable, we will explore the posteriors
via Monte Carlo. The reason for this is to get students familiar with Monte Carlo so that they
can apply it in the more complex settings that they are likely to encounter in practice. 
Specifically, usually fully tractable posteriors are more of an exception than the rule. For the
most part, for linear models, one should use the fully tractable results as they are much faster.

We now give some strategies for Monte Carlo sampling from a posterior.

\subsection{Sequential Monte Carlo}

Notice for three variables, $X$, $Y$ and $Z$, sampling $f_z(z)$, $f_{y|z}(y)$ and $f_{x|y,z}(x ~|~ y, z)$ yields a multivariate draw from the joint distribution $f(x, y, z)$. So, for example, consider
setting 3 of our prior specifications
$$
\bbeta ~|~ \sigma^2, \by, \bX \sim 
N(\hat \bbeta, \bX^t\bX \sigma^2) ~~\mbox{and}~~
\sigma^{-2} ~|~ \by, \bX \sim \mbox{Gamma}\{(n - p) / 2, 2 / (n - p) S^2\}
$$
Notice that $E[\sigma^{-2} ~|~ \bX, \by] = 1 / S^2$. To simualte from this distribution,
we first simulate from $\sigma^{-2} ~|~ \bX, \by$ 
then plug that simulation into $\bbeta ~|~ \sigma^2, \by, \bX$
and simulate an $\bbeta$. The pair is a draw from the joint posterior 
distribution of $\bbeta$ and $\sigma^{-2}$.

\subsection{Gibbs sampling}

Consider again our three random variables. Suppose that an initial value of $x$ and $y$, say $x^(1)$ and $y^{(1)}$ was obtained. Then consider simulating 
\begin{enumerate}
\item $z^{(1)} \sim f_{z|x,y}(z ~|~ x^{(1)}, y^{(1)})$
\item $x^{(2)} \sim f_{x|y,z}(x ~|~ y^{(1)},z^{(2)})$
\item $y^{(2)} \sim f_{y|x,z}(y ~|~ x^{(2)}, z^{(2)})$
\item $z^{(2)} \sim f_{z|x,y}(z ~|~ x^{(2)}, y^{(2)})$
\item $x^{(3)} \sim f_{x|y,z}(x ~|~ y^{(2)}, z^{(3)})$
\end{enumerate}
and so on. In other words, always update a simulated variable using the most recently simulated
version of the other variables. In fact, one need not use the full conditionals. Any collection
of conditionals would work. Moreover, any random order works, or even randomizing the order
each iteration. However, some conditions have to be met for the asymptotics of the 
sampler to work. For example, you have to update every variable infinitely often and the
whole space has to be explorable by the sampler. If the conditions are met, the sampler
is a Markov chain whose stationary distribution, i.e. the limiting distribution, is 
$f(x, y, z)$. Moreover, there's lots of results saying that you can use the ouput of the
sampler in much the same way one uses iid samples. For example approximating posterior
means with the average of the simulated variables. However, the Markovian nature of the sampler
makes using the samples a little trickier. One could try to combat this by running the chain for a while and throwing out all of the early simulations used to ``burn in'' the sampler. This throws away
a lot of data. Our preferred method is to use good starting values (why not start at the MLE?)
and use all of the simulated data. 

Let's illustrate the sampler with prior specification prior 2. Consider the
simplified model:
$$
\bY ~|~ \bbeta, \bX, \theta \sim N(\bX \bbeta, \theta^{-1}\bI)
~~\mbox{and}~~
\bbeta \sim N(\bzero, \psi^{-1} \bI)
~~\mbox{and}~~
\theta \sim \mbox{Gamma}(\alpha / 2, \tau^{-1}.
$$
Under this specification, the full conditionals are:

\begin{align*}
\bbeta & ~|~  \by, \bX, \theta \sim N\{(\xtx \theta + \psi \bI)^{-1} \bX^t\by, (\xtx \theta + \psi \bI)^{-1}\} \\
\theta & ~|~ \by, \bX, \bbeta \sim \mbox{Gamma}\{(n + \alpha) / 2, 2(||\by - \bX \bbeta||^2 + \tau)^{-1} \}
\end{align*}


\begin{verbatim}
data("mtcars")
y = mtcars$mpg - mean(mtcars$mpg)
x = cbind(1, mtcars$wt - mean(mtcars$wt))

n = length(y)
p = ncol(x)

fitML  = lm(y ~ x - 1)


xtx = t(x) %*% x
xty = as.vector(t(x) %*% y)

nosim = 10000

rmvnorm = function(mu, Sigma) as.vector(mu + chol(Sigma) %*% rnorm(length(mu)))

thetaCurrent = 1 / summary(fitML)$sigma^2
beta = NULL
theta = thetaCurrent * 100
psi = .01
alpha = .01
tau = .01 * summary(fitML)$sigma^2
for (i in 1 : nosim){
  V = solve(xtx * thetaCurrent + psi * diag(1, p, p))  
  mu = V %*% xty * thetaCurrent
  betaCurrent = rmvnorm(mu, V)
  sumesq = sum((y - x %*% betaCurrent)^2)
  thetaCurrent = rgamma(1, (n + alpha) / 2, rate = (sumesq + tau)/2)
  theta = c(theta, thetaCurrent)
  beta = rbind(beta, betaCurrent)
}
sigma = sqrt(1/ theta)
quantile(beta[,1], c(.025, .975))
quantile(beta2[,2], c(.025, .975))
quantile(sigma, c(0.025, .975))
\end{verbatim}