-
Notifications
You must be signed in to change notification settings - Fork 36
/
hierarchical.Rmd
215 lines (186 loc) · 8.43 KB
/
hierarchical.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
# Shrinkage and Hierarchical Models
```{r setup,message=FALSE}
library("tidyverse")
library("rstan")
library("loo")
```
## Hierarchical Models
- *Hierarchical models:* often groups of parameters, $\{\theta_1, \dots, \theta_J\}$, are related.
- E.g. countries, states, counties, years, etc. Even the regression coefficients, $\beta_1, \dots, \beta_k$ seen the in the [Shrinkage and Regularization] chapter.
- We can treat those $\theta_j$ as drawn from a *population distribution*, $\theta_j \sim p(\theta)$.
- The prior distribution $p(\theta)$ is called a *hyperprior* and its parameters are *hyperparameters*
*Exchangeability:*
- parameters $(\theta_1, \dots, \theta_J)$ are *exchangeable* if $p(\theta_1, \dots, \theta_J)$ don't depend on the indexes.
- i.i.d. models are a special case of exchangeability.
## Baseball Hits
@EfronMorris1975a analyzed data from 18 players in the 1970 season.
The goal was to predict the batting average of these 18 players from their first 45 at-bats for the remainder of the 1970 season.
The following example is based on @CarpenterGabryGoodrich2017a and the `r rpkg("rstanarm")` vignette [Hierarchical Partial Pooling for Repeated Binary Trials](https://cran.r-project.org/web/packages/rstanarm/vignettes/pooling.html).
The hitting data used in @EfronMorris1975a is included in `r rpkg("rstanarm")` as `r rdoc("rstanarm", "bball1970")`:
```{r}
data("bball1970", package = "rstanarm")
bball1970 <-
mutate(bball1970,
BatAvg1 = Hits / AB,
BatAvg2 = RemainingHits / RemainingAB)
head(bball1970)
```
Let $y_i$ be the number of hits in the first 45 at bats for player $i$,
$$
\begin{aligned}[t]
y_i & \sim \dBinom(45, \mu_i),
\end{aligned}
$$
where $\mu_i \in (0, 1)$ is the player-specific batting average.
Priors will be placed on the log-odds parameter, $\eta \in \R$,
$$
\begin{aligned}[t]
\mu_i &\sim \frac{1}{1 + \exp(-\eta_i)} . \\
\end{aligned}
$$
This example considers three ways of modeling $\mu_i$:
1. **Complete Pooling:** All players have the same batting average parameter.
$$
\eta_i = \eta .
$$
The common (log-odds) batting average is given a weakly informative prior,
$$
\eta \sim \dnorm(0, 2.5)
$$
On the log odds scale, this places 95% of the probability mass between `r round(plogis(-5) * 100, 1)` and `r round(plogis(5) * 100, 1)` on the proportion scale.
1. **Non-pooled:** Each players (log-odds) batting average is independent, with each assigned a separate weak prior.
$$
\begin{aligned}[t]
\eta_i &\sim \dnorm(0, 2.5)
\end{aligned}
$$
1. **Partial-pooling:** Each player has a separate (log-odds) batting average, but these batting average parameters are drawn from a common normal distribution.
$$
\begin{aligned}[t]
\eta_i &\sim \dnorm(0, \tau) \\
\tau &\sim \dnorm(0, 1)
\end{aligned}
$$
```{r}
bball1970_data <- list(
N = nrow(bball1970),
k = bball1970$AB,
y = bball1970$Hits,
k_new = bball1970$RemainingAB,
y_new = bball1970$RemainingHits
)
```
Create a list to store models:
```{r}
models <- list()
```
```{r results='hide'}
models[["nopool"]] <- stan_model("stan/binomial-no-pooling.stan")
```
```{r echo=FALSE, comment=NA, prompt=FALSE}
cat(models[["nopool"]]@model_code)
```
```{r results='hide'}
models[["pool"]] <- stan_model("stan/binomial-complete-pooling.stan")
```
```{r echo=FALSE, comment=NA, prompt=FALSE}
cat(models[["pool"]]@model_code)
```
```{r results='hide'}
models[["partial"]] <- stan_model("stan/binomial-partial-pooling.stan")
```
```{r echo=FALSE, comment=NA, prompt=FALSE}
cat(models[["partial"]]@model_code)
```
Draw a sample for all three models:
```{r results='hide', message=FALSE, warning=FALSE}
fits <- map(models, sampling, data = bball1970_data,
refresh = -1) %>%
set_names(names(models))
```
For each model calculate the posterior mean of $\mu$ for each player:
```{r}
bball1970 <-
map2_df(names(fits), fits,
function(nm, fit) {
mu <- broom::tidy(fit) %>%
filter(str_detect(term, "^mu"))
if (nrow(mu) == 1) {
out <- tibble(estimate = rep(mu$estimate, 18L))
} else {
out <- select(mu, estimate)
}
out$model <- nm
out$.id <- seq_len(nrow(out))
out
}) %>%
spread(model, estimate) %>%
bind_cols(bball1970)
```
The partially pooled estimates are shrunk towards the overall average, and are between the no-pooling and pooled estimates.
```{r}
select(bball1970,
Player, nopool, partial, pool) %>%
mutate(Player = factor(Player, levels = Player)) %>%
gather(variable, value, -Player) %>%
ggplot(aes(y = value, x = factor(variable), group = Player)) +
geom_point() +
geom_line() +
labs(x = "", y = expression(mu))
```
We can plot the actual batting averages (`BatAvg1` and `BatAvg2`) and the model estimates:
```{r}
select(bball1970,
Player, nopool, partial, pool, BatAvg1, BatAvg2) %>%
mutate(Player = factor(Player, levels = Player)) %>%
gather(variable, value, -Player) %>%
ggplot(aes(y = Player, x = value, colour = variable)) +
geom_point() +
labs(x = expression(mu), y = "")
```
The estimates of the no-pooling model is almost exactly the same as `BatAvg1`.
The out-of-sample batting averages `BatAvg2` show regression to the mean.
For these models, compare the overall out-of-sample performance by calculating the actual average out-of-sample log-pointwise predictive density (lppd), and the expected lppd using LOO-PSIS.
The LOO-PSIS estimates of the out-of-sample lppd are optimistic.
However, they still show the pooling and partial estimates as superior to the no-pooling estimates.
The actual out-of-sample average lppd for the partial pooled model is the best fitting.
```{r warning=FALSE,message=FALSE}
map2_df(names(fits), fits,
function(nm, fit) {
loo <- loo(extract_log_lik(fit, "log_lik"))
ll_new <- rstan::extract(fit)[["log_lik_new"]]
tibble(model = nm,
loo = loo$elpd_loo / bball1970_data$N,
ll_out = mean(log(colMeans(exp(ll_new)))))
})
```
To see why this is the case, plot the average errors for each observation in- and out-of-sample.
In-sample for the no-pooling model is zero, but it over-estimates (under-estimates) the players with the highest (lowest) batting averages in their first 45 at bats---this is regression to the mean.
In sample, the partially pooling model shrinks the estimates towards the mean and reducing error.
Out of sample, the errors of the partially pooled model are not much different than the no-pooling model, except that the extreme observations have lower errors.
```{r}
select(bball1970,
Player, nopool, partial, pool, BatAvg1, BatAvg2) %>%
mutate(Player = as.integer(factor(Player, levels = Player))) %>%
gather(variable, value, -Player, -matches("BatAvg")) %>%
mutate(`In-sample Errors` = value - BatAvg1,
`Out-of-sample Errors` = value - BatAvg2) %>%
select(-matches("BatAvg"), -value) %>%
gather(sample, error, -variable, -Player) %>%
ggplot(aes(y = error, x = Player, colour = variable)) +
geom_hline(yintercept = 0, colour = "white", size = 2) +
geom_point() +
geom_line() +
facet_wrap(~ sample, ncol = 1) +
theme(legend.position = "bottom")
```
Extensions:
- Use a beta distribution for the prior of $\mu_i$. How would you specify the prior beta distribution so that it is uninformative?
- If you used the beta distribution, how would you specify the beta distribution as a function of the mean?
- The lowest batting average of the modern era is approximately 0.16 and the highest is approximately 0.4. Use this information for an informative prior distribution.
- There may be some truly exceptional players. Model this by replacing the normal prior for $\eta$ with a wide tailed distribution.
- The distribution of batting averages may be asymmetric - since there may be a few great players, but a player can only be so bad before they are relegated to the minor league. Find a skewed distribution to use as a prior.
### References
- Albert, Jim. [Revisiting Efron and Morris’s Baseball Study](https://baseballwithr.wordpress.com/2016/02/15/revisiting-efron-and-morriss-baseball-study/) Feb 15, 2016
- Bob Carpenter. [Hierarchical Bayesian Batting Ability, with Multiple Comparisons](https://lingpipe-blog.com/2009/11/04/hierarchicalbayesian-batting-ability-with-multiple-comparisons/). November 4, 2009.
- John Kruschke. [Shrinkage in multi-level hierarchical models](http://doingbayesiandataanalysis.blogspot.com/2012/11/shrinkage-in-multi-level-hierarchical.html). November 27, 2012.