-
Notifications
You must be signed in to change notification settings - Fork 36
/
priors.Rmd
285 lines (224 loc) · 12.3 KB
/
priors.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
# Priors
> Yeah, well, you know, that’s just, like, your opinion, man. — The Dude (*The Big Lebowski*)
## Levels of Priors
The levels of
1. Flat prior
1. Vague but proper prior, e.g. $\dnorm(. | 0, 1e6)$
1. Weakly informative prior, but very weak $\dnorm(0, 10)$
1. Generic weakly informative prior: $\dnorm(0, 1)$
1. Specific informative prior
## Conjugate Priors
In a few cases, the posterior distribution,
$$
p(\theta | y) = \frac{p(y | \theta) p(\theta)}{\int p(y | \theta') p(\theta') d\theta'},
$$
has a [closed-form solution](https://en.wikipedia.org/wiki/Closed-form_expression) and can be calculated exactly.
In those cases, the posterior distribution is calculated exactly, and more costly numerical approximation methods do not need to be used.
Unfortunately, these cases are few.
Most of those cases involve **conjugate priors**.
In the case of a conjugate prior, the posterior distribution is in the same family as the prior distribution.
Here is a diagram of a few common conjugate priors.[^conjugate]
```{r conjugate-gamma, echo=FALSE}
DiagrammeR::grViz("diagrams/conjugate_gamma.gv")
DiagrammeR::grViz("diagrams/conjugate_beta.gv")
```
[^conjugate]: Based on John Cook's a [Diagram of Conjugate Prior distributions](https://www.johndcook.com/blog/conjugate_prior_diagram/).
The table in the Wikipedia page for [Conjugate priors](https://en.wikipedia.org/wiki/Conjugate_prior#cite_note-beta-interp-4) is as complete as any out there.
@Fink1997a for a compendium of references.
Also see [Distributions] for more information about probability distributions.
### Binomial-Beta
Binomial distribution: If $N \in \Nats$ (number of trials), $\theta \in (0, 1)$ (success probability in each trial),
then for $n \in \{0, \dots, N\}$,
$$
\dBinom(n | N, \theta) = \binom{N}{n} \theta^{n} (1 - \theta)^{N - n} .
$$
Beta distribution: If $\alpha \in \RealPos$ (shape) and $\beta \in \RealPos$ (shape), then for $\theta \in (0, 1)$,
$$
\dbeta(\theta | \alpha, \beta) = \frac{1}{\mathrm{B}(\alpha, \beta)} \theta^{\alpha - 1} (1 - \theta)^{\beta - 1},
$$
where $\mathrm{B}$ is the beta function,
$$
\mathrm{B}(\alpha, \beta) = \frac{\Gamma(\alpha) \Gamma(\beta)}{\Gamma(\alpha + \beta)} .
$$
Then,
$$
\begin{aligned}[t]
p(\theta | \alpha, \beta) &= \dbeta(\theta | \alpha, \beta) && \text{Beta prior} \\
p(y | \theta) &= \dBinom(y | n, \theta) && \text{Binomial likelihood} \\
p \theta | y, \alpha, \beta) &= \dbeta(\theta | \alpha + y, \beta + n - y) && \text{Beta posterior}
\end{aligned}
$$
### Categorical-Dirichlet
The Dirichlet distribution is a multivariate generalization of the Beta,
If $K \in \N$ and $\alpha \in (\R^{+})^{K}$, then for the $\theta \in K-\text{simplex}$ and $\theta_k > 0$ for all $k$,
$$
\ddirichlet(\theta | \alpha) = \frac{\Gamma(\sum_{k = 1}^K \alpha_k)}{\prod_{k = 1}^K \Gamma(\alpha_k)} \prod_{k = 1}^K \theta_{k}^{\alpha_k - 1}
$$
The multinomial distribution is a generalization of the binomial distribution with $K$ categories instead of 2.
If $K \in \n$, $N \in \N$, and $\theta \in K-\text{simplex}$, then for $y \in \N^{K}$ such that $\sum_{k = 1}^K y_k = N$,
$$
\dmultinom(y | \theta) = \binom{N}{y_1, \dots, y_K} \prod_{k = 1}^{K} \theta_k^{y_k},
$$
where the multinomial coefficient is defined as,
$$
\binom{N}{y_1, \dots, y_K} = \frac{N!}{\prod_{k = 1}^K y_k!}
$$
$$
\begin{aligned}[t]
p(\theta | \alpha) &= \ddirichlet(\theta | \alpha) && \text{Dirichlet prior} \\
p(y | \theta) &= \dmultinom(y | n, \theta) && \text{Multinomial likelihood} \\
p(\theta | y, \alpha) &= \ddirichlet(\theta | \alpha + y) && \text{Dirichlet posterior}
\end{aligned}
$$
### Poisson-Gamma
Let $\lambda$ be the rate parameter of the Poisson distribution.
If $\lambda \in \R^+$ (rate parameter), then for $n \in \N$,
$$
\dpois(n|\lambda) = \frac{1}{n!} \lambda^n \exp(-\lambda)
$$
If $\alpha \in \R^{+}$ (shape parameter), $\beta \in \R^{+}$ (inverse scale parameter), then for $y \in \R^{+}$,
$$
\dgamma(y | \alpha, \beta) = \frac{\beta^{\alpha}}{\Gamma(\alpha)} y^{\alpha - 1} \exp(- \beta y)
$$
Then,
$$
\begin{aligned}[t]
p(\lambda) &= \dgamma(\lambda | \alpha, \beta) \\
p(n | \lambda) &= \dpois(n | \lambda) \\
p(\lambda | n, \alpha, \beta) &= \dgamma(\lambda | \alpha + n, \beta + 1)
\end{aligned}
$$
### Normal with known variance
$$
\begin{aligned}[t]
p(\mu | \mu_0, \sigma_0) &= \dnorm(\mu | \mu, \sigma_0^2) && \text{Normal prior} \\
p(y | \mu) &= \dnorm(y | \mu, \sigma^2) && \text{Normal likelihood} \\
p(\mu | y, \mu, \sigma^2, \mu_0, \sigma_0^2) &= \dbeta(\mu | \tilde{\mu}, \tilde{\sigma}^2) && \text{Normal posterior} \\
\tilde{\mu} &= \tilde{\sigma}^{2} \left(\frac{\mu_0}{\sigma_0^2} + \frac{y}{\sigma^2} \right) \\
\tilde{\sigma}^2 &= \left(\frac{1}{\sigma_0^2} +\frac{1}{\sigma^2}\right)^{-1} \\
\end{aligned}
$$
### Exponential Family
Likelihood functions in the [exponential family](https://en.wikipedia.org/wiki/Exponential_family) have conjugate priors, often also in the exponential family.[^expconj]
## Improper Priors
If prior distributions are given an [improper uniform prior](https://en.wikipedia.org/wiki/Prior_probability), $p(\theta) \propto 1$, then the posterior distribution is proportional to the likelihood,
$$
p(\theta | y) \propto p(y | \theta) p(\theta) \propto p(y | \theta)
$$
## Cromwell's Rule
The use of priors should placing a probability of 0 or 1 on events be avoided except where those events are excluded by logical impossibility.
If a prior places probabilities of 0 or 1 on an event, then no amount of data can update that prior.
The name, Cromwell's Rule, comes from a quote of Oliver Cromwell,
> I beseech you, in the bowels of Christ, think it possible that you may be mistaken.
Lindley (1991) describes it as
> Leave a little probability for the moon being made of green cheese; it can be as small as 1 in a million, but have it there since otherwise an army of astronauts returning with samples of the said cheese will leave you unmoved.
If $p(\theta = x) = 0$, then for a value of $x$, then the posterior distribution is always zero.
$$
p(\theta = x | y) \propto p(y | \theta = x) p(\theta = x) = 0
$$
## Asymptotics
As the sample size increases, the Bayesian distribution converges to a normal distribution centered on the true value of the parameter.
Suppose data $y_1, \dots, y_n \sim$ are an iid sample from the distribution $f(y)$.
Suppose that the data are modeled with a parametric family $p(y | \theta)$ and a prior distribution $p(\theta)$.
If the data distribution is included in the parametric family, meaning that there exists a $\theta_0$ such that $p(y | \theta_0) = f(y)$, then the posterior distribution is *consistent* in that it converges to the true parameter value $\theta_0$ as $n \to \infty$.
Otherwise, the posterior convergences to the distribution $p(y | \theta)$ closes to the true distribution.
As $n \to infty$, the likelihood dominates the posterior distribution.
There are cases in which the normal approximation is incorrect.
1. parameters are non-identified
1. the number of parameters increases with sample size
1. aliasing or non-identified parameters due to label switching
1. unbounded likelihoods. This can happen if variance parameters go to zero.
1. improper posterior distributions
1. prior distributions that exclude the point of convergence. See Cromwell's Rule.
1. convergence to the edge of the parameter space
1. tails of the distribution can be inaccurate even if the normal approximation converges to the correct value; e.g. the normal approximation will still place a non-zero density on negative values of a non-negative parameter.
## Proper and Improper Priors
A prior distribution $p(\theta)$ is an improper when it is not a probability distribution, meaning
$$
\int p(\theta) \,d\theta = \infty .
$$
Perhaps the most common improper distribution is an unbounded uniform distribution,
$$
p(\theta) \propto 1
$$
for $-\infty < \theta < \infty$.
Improper priors can be used, because in some cases, the posterior distribution can still be proper even if the prior is not.
| Prior | Posterior |
| -------- | --------- |
| Improper | ? |
| Proper | Proper |
One common case of this is a linear regression model with improper priors.
$$
p(\beta, \sigma | y, X) =
\begin{cases}
y &\sim \dnorm(X \beta, \sigma^{2} I)
\end{cases}
$$
If the number of observations (rows of $X$) is less than the number of
independent columns in $X$ (variables plus the constant), then the MLE of
$\beta$ is undefined, and also the posterior distribution is improper.
But if alter that model to include a proper prior,
$$
p(\beta, \sigma | y, X) =
\begin{cases}
y &\sim \dnorm(X \beta, \sigma^{2} I) \\
\beta & \sim \dnorm(\mu_\beta, \Sigma_\beta) \\
\gamma &\sim p(\gamma)
\end{cases}
$$
then we can estimate $p(\beta, \sigma | y, X)$ even if the number of
observations is less than the number of variables. Consider the case if we
observe *no* data; then the posterior distribution is equal to the prior, and
since the prior is a proper distribution, so is the posterior.
However, because a proper posterior allows for estimating a posterior, does
imply that the posterior distribution is any "good". But that is the role of
the evaluation step in Bayesian analysis. In the cases where an improper prior
would lead to an improper posterior, the choice of the prior is important
because the prior will dominate the shape of the posterior distribution.
One way of thinking about many "identification" assumptions in MLE models is
that they can loosely be considered "priors". What is called the likelihood and
what is called the the prior is not well-defined, and often the choice of of
likelihood functions is both subjective and the most important part of the
analysis.
Regarding improper priors, also see the asymptotic results that the posterior
distribution increasingly depends on the likelihood as sample size increases.
Stan: If no prior distributions is specified for a parameter, it is given an
improper prior distribution on $(-\infty, +\infty)$ after transforming the
parameter to its constrained scale.
## Hyperpriors and Hyperparameters
A hyperparameter is a parameter in a prior.
A hyperprior is a term for a prior on
Consider the case of a binomial likelihood with a beta prior on the proportion parameter $\theta$.
The observed value of $n$ and the $N$
$$
\begin{aligned}[t]
p(n | \theta) &\sim \dBinom(N, \theta) && \text{likelihood for } n \\
p(\theta) &\sim \dbeta(a, b) && \text{prior on } \theta \\
\end{aligned}
$$
This is a model of the posterior distribution of $\theta$ given the data,
where the data consists of the $n$ successes, the total number of trails $N$, *and* $a$ and $b$, the assumed shape parameters of the beta prior on $\theta$.
Thus our posterior distribution is,
$$
p(\theta | n, N, a, b) .
$$
However, suppose we decided that since we did not have a good reason to choose any particular values of $a$ and $b$ for the prior distribution, we would treat the shape parameters of the beta distribution as parameters and assign them their own prior distributions,
$$
\begin{aligned}[t]
p(n | \theta) &\sim \dBinom(N, \theta) && \text{likelihood for } n \\
p(\theta) &\sim \dbeta(\alpha, \beta) && \text{prior on } \theta \\
p(\alpha) &\sim \dexp(a^*) && \text{hyperprior} \\
p(\beta) &\sim \dexp(b^*) && \text{hyperprior}
\end{aligned}
$$
Now the parameters of the model are $\theta$, $\alpha$, and $\beta$, and the data are $n$, $N$, $a^{*}$, and $b^{*}$.
Since one parameter in the model ($\theta$) is a function of other parameters, $\alpha$ and $\beta$, we call $\alpha$ and $\beta$ hyperparameters.
## References
- [Stan Wiki](https://github.com/stan-dev/stan/wiki/Prior-Choice-Recommendations) and the [rstanarm](https://cran.r-project.org/web/packages/rstanarm/vignettes/priors.html) vignette includes comprehensive advice for prior choice recommendations.
- @Betancourt2017a provides numerical simulation of how the shapes of weakly informative priors affects inferences.
- @Stan2016a for discussion of some types of priors in regression models
- @ChungRabe-HeskethDorieEtAl2013a discuss scale priors in penalized MLE models
- @GelmanJakulinPittauEtAl2008a discusses using Cauchy(0, 2.5) for prior distributions
- @Gelman2006a provides a prior distribution on variance parameters in hierarchical models.
- @PolsonScott2012a on using Half-Cauchy priors for scale parameters
[^expconj]: <https://en.wikipedia.org/wiki/Exponential_family#Bayesian_estimation:_conjugate_distributions>