-
Notifications
You must be signed in to change notification settings - Fork 38
/
13.Rmd
1857 lines (1433 loc) · 90.9 KB
/
13.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F, cache = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
options(width = 110)
```
# Models With Memory
> **Multilevel models**... remember features of each cluster in the data as they learn about all of the clusters. Depending upon the variation among clusters, which is learned from the data as well, the model pools information across clusters. This pooling tends to improve estimates about each cluster. This improved estimation leads to several, more pragmatic sounding, benefits of the multilevel approach. [@mcelreathStatisticalRethinkingBayesian2020, p. 400, **emphasis** in the original]
These benefits include:
* better estimates for repeated sampling (i.e., in longitudinal data),
* better estimates when there are imbalances among subsamples,
* estimates of the variation across subsamples, and
* avoiding simplistic averaging by retaining variation across subsamples.
> All of these benefits flow out of the same strategy and model structure. You learn one basic design and you get all of this for free.
>
> When it comes to regression, multilevel regression deserves to be the default approach. There are certainly contexts in which it would be better to use an old-fashioned single-level model. But the contexts in which multilevel models are superior are much more numerous. It is better to begin to build a multilevel analysis, and then realize it's unnecessary, than to overlook it. And once you grasp the basic multilevel strategy, it becomes much easier to incorporate related tricks such as allowing for measurement error in the data and even modeling missing data itself ([Chapter 15][Missing Data and Other Opportunities]). (p. 400)
I'm totally on board with this. After learning about the multilevel model, I see it everywhere. For more on the sentiment it should be the default, check out McElreath's blog post, [*Multilevel regression as default*](https://elevanth.org/blog/2017/08/24/multilevel-regression-as-default/).
## Example: Multilevel tadpoles
Let's load the `reedfrogs` data [see @voneshCompensatoryLarvalResponses2005] and fire up **brms**.
```{r, message = F, warning = F}
library(brms)
data(reedfrogs, package = "rethinking")
d <- reedfrogs
rm(reedfrogs)
```
Go ahead and acquaint yourself with the `reedfrogs`.
```{r, message = F, warning = F}
library(tidyverse)
d %>%
glimpse()
```
Making the `tank` cluster variable is easy.
```{r}
d <-
d %>%
mutate(tank = 1:nrow(d))
```
Here's the formula for the un-pooled model in which each `tank` gets its own intercept:
\begin{align*}
\text{surv}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\text{tank}[i]} \\
\alpha_j & \sim \operatorname{Normal} (0, 1.5) & \text{for } j = 1, \dots, 48,
\end{align*}
where $n_i$ is indexed by the `density` column. Its values are distributed like so.
```{r}
d %>%
count(density)
```
Now fit this simple aggregated binomial model much like we practiced in [Chapter 11][Aggregated binomial: Chimpanzees again, condensed.].
```{r b13.1}
b13.1 <-
brm(data = d,
family = binomial,
surv | trials(density) ~ 0 + factor(tank),
prior(normal(0, 1.5), class = b),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 13,
file = "fits/b13.01")
```
We don't need a `depth=2` argument to discover we have 48 different intercepts. The default `print()` behavior will do.
```{r}
print(b13.1)
```
This is much like the models we've fit in earlier chapters using McElreath's index approach, but on steroids. It'll be instructive to take a look at distribution of the $\alpha_j$ parameters in density plots. We'll plot them in both their log-odds and probability metrics.
For kicks and giggles, let's use a [FiveThirtyEight-like theme](https://github.com/alex23lemm/theme_fivethirtyeight) for this chapter's plots. An easy way to do so is with help from the [**ggthemes** package](https://cran.r-project.org/package=ggthemes).
```{r, fig.width = 7, fig.height = 3.5, warning = F, message = F}
library(ggthemes)
library(tidybayes)
# change the default
theme_set(theme_gray() + theme_fivethirtyeight())
tibble(estimate = fixef(b13.1)[, 1]) %>%
mutate(p = inv_logit_scaled(estimate)) %>%
pivot_longer(estimate:p) %>%
mutate(name = if_else(name == "p", "expected survival probability", "expected survival log-odds")) %>%
ggplot(aes(x = value, fill = name)) +
stat_dots(size = 0) +
scale_fill_manual(values = c("orange1", "orange4")) +
scale_y_continuous(breaks = NULL) +
labs(title = "Tank-level intercepts from the no-pooling model",
subtitle = "Notice now inspecting the distributions of the posterior means can offer insights you\nmight not get if you looked at them one at a time.") +
theme(legend.position = "none",
panel.grid = element_blank()) +
facet_wrap(~ name, scales = "free_x")
```
Even though it seems like we can derive important insights from how the `tank`-level intercepts are distributed, that information is not explicitly encoded in the statistical model. Keep that in mind as we now consider the multilevel alternative. Its formula is
\begin{align*}
\text{surv}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\text{tank}[i]} \\
\alpha_j & \sim \operatorname{Normal}(\color{#CD8500}{\bar \alpha}, \color{#CD8500} \sigma) \\
\color{#CD8500}{\bar \alpha} & \color{#CD8500} \sim \color{#CD8500}{\operatorname{Normal}(0, 1.5)} \\
\color{#CD8500} \sigma & \color{#CD8500} \sim \color{#CD8500}{\operatorname{Exponential}(1)},
\end{align*}
where
> the prior for the tank intercepts is now a function of two parameters, $\bar \alpha$ and $\sigma$. You can say $\bar \alpha$ like "bar alpha." The bar means average. These two parameters inside the prior is where the "multi" in multilevel arises. The Gaussian distribution with mean $\bar \alpha$ standard deviation $\sigma$ is the prior for each tank’s intercept. But that prior itself has priors for $\bar \alpha$ and $\sigma$. So there are two levels in the model, each resembling a simpler model. (p. 403, *emphasis* in the original)
With **brms**, you might specify the corresponding multilevel model like this.
```{r b13.2}
b13.2 <-
brm(data = d,
family = binomial,
surv | trials(density) ~ 1 + (1 | tank),
prior = c(prior(normal(0, 1.5), class = Intercept), # alpha bar
prior(exponential(1), class = sd)), # sigma
iter = 5000, warmup = 1000, chains = 4, cores = 4,
sample_prior = "yes",
seed = 13,
file = "fits/b13.02")
```
The syntax for the varying effects follows the [**lme4** style](https://cran.r-project.org/package=brms/vignettes/brms_overview.pdf), `( <varying parameter(s)> | <grouping variable(s)> )`. In this case `(1 | tank)` indicates only the intercept, `1`, varies by `tank`. The extent to which parameters vary is controlled by the prior, `prior(exponential(1), class = sd)`, which is <u>parameterized in the standard deviation metric</u>. Do note that last part. It's common in multilevel software to model in the variance metric, instead. For technical reasons we won't really get into until [Chapter 14][Adventures in Covariance], Stan parameterizes this as a standard deviation.
Let's compute the WAIC comparisons.
```{r, message = F}
b13.1 <- add_criterion(b13.1, "waic")
b13.2 <- add_criterion(b13.2, "waic")
w <- loo_compare(b13.1, b13.2, criterion = "waic")
print(w, simplify = F)
```
The `se_diff` is small relative to the `elpd_diff`. If we convert the $\text{elpd}$ difference to the WAIC metric, the message stays the same.
```{r}
cbind(waic_diff = w[, 1] * -2,
se = w[, 2] * 2)
```
Here are the WAIC weights.
```{r}
model_weights(b13.1, b13.2, weights = "waic") %>%
round(digits = 2)
```
I'm not going to show it here, but if you'd like a challenge, try comparing the models with the PSIS-LOO. You'll get some great practice with high `pareto_k` values and the moment matching for problematic observations [see @paananenMomentMatching2020; @paananenImplicitlyAdaptiveImportance2020].
```{r, eval = F, echo = F}
b13.1 <- add_criterion(b13.1, "loo")
b13.2 <- add_criterion(b13.2, "loo")
```
But back on track, McElreath commented on the number of effective parameters for the two models. This, recall, is listed in the column for $p_\text{WAIC}$.
```{r}
w[, "p_waic"]
```
And indeed, even though out multilevel model (`b13.2`) technically had two more parameters than the conventional single-level model (`b13.1`), its $p_\text{WAIC}$ is substantially smaller, due to the regularizing level-2 $\sigma$ parameter. Speaking of which, let's examine the model summary.
```{r}
print(b13.2)
```
This time we don't get a list of 48 separate `tank`-level parameters. However, we do get a description of their distribution in terms of $\bar \alpha$ (i.e., `Intercept`) and $\sigma$ (i.e., `sd(Intercept)`). If you'd like the actual `tank`-level parameters, don't worry; they're coming in Figure 13.1. We'll need to do a little prep work, though.
```{r}
post <- as_draws_df(b13.2)
post_mdn <-
coef(b13.2, robust = T)$tank[, , ] %>%
data.frame() %>%
bind_cols(d) %>%
mutate(post_mdn = inv_logit_scaled(Estimate))
head(post_mdn)
```
Here's the **ggplot2** code to reproduce Figure 13.1.
```{r, fig.width = 5, fig.height = 4}
post_mdn %>%
ggplot(aes(x = tank)) +
geom_hline(yintercept = inv_logit_scaled(median(post$b_Intercept)), linetype = 2, linewidth = 1/4) +
geom_vline(xintercept = c(16.5, 32.5), linewidth = 1/4, color = "grey25") +
geom_point(aes(y = propsurv), color = "orange2") +
geom_point(aes(y = post_mdn), shape = 1) +
annotate(geom = "text",
x = c(8, 16 + 8, 32 + 8), y = 0,
label = c("small tanks", "medium tanks", "large tanks")) +
scale_x_continuous(breaks = c(1, 16, 32, 48)) +
scale_y_continuous(breaks = 0:5 / 5, limits = c(0, 1)) +
labs(title = "Multilevel shrinkage!",
subtitle = "The empirical proportions are in orange while the model-\nimplied proportions are the black circles. The dashed line is\nthe model-implied average survival proportion.") +
theme(panel.grid.major = element_blank())
```
Here is the code for our version of Figure 13.2.a, where we visualize the model-implied population distribution of log-odds survival (i.e., the population distribution yielding all the `tank`-level intercepts).
```{r}
# this makes the output of `slice_sample()` reproducible
set.seed(13)
p1 <-
post %>%
slice_sample(n = 100) %>%
expand_grid(x = seq(from = -4, to = 5, length.out = 100)) %>%
mutate(density = dnorm(x, mean = b_Intercept, sd = sd_tank__Intercept)) %>%
ggplot(aes(x = x, y = density, group = .draw)) +
geom_line(alpha = .2, color = "orange2") +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Population survival distribution",
subtitle = "log-odds scale") +
coord_cartesian(xlim = c(-3, 4))
```
Now we make our Figure 13.2.b and then bind the two subplots with **patchwork**.
```{r, fig.width = 6, fig.height = 3.25}
set.seed(13)
p2 <-
post %>%
slice_sample(n = 8000, replace = T) %>%
mutate(sim_tanks = rnorm(n(), mean = b_Intercept, sd = sd_tank__Intercept)) %>%
ggplot(aes(x = inv_logit_scaled(sim_tanks))) +
geom_density(linewidth = 0, fill = "orange2", adjust = 0.1) +
scale_y_continuous(NULL, breaks = NULL) +
labs(title = "Probability of survival",
subtitle = "transformed by the inverse-logit function")
library(patchwork)
(p1 + p2) &
theme(plot.title = element_text(size = 12),
plot.subtitle = element_text(size = 10))
```
Both plots show different ways in expressing the model uncertainty in terms of both location $\alpha$ and scale $\sigma$.
#### Rethinking: Varying intercepts as over-dispersion.
> In the previous chapter ([page 369][Over-dispersed counts]), the beta-binomial and gamma-Poisson models were presented as ways for coping with **over-dispersion** of count data. Varying intercepts accomplish the same thing, allowing count outcomes to be over-dispersed. They accomplish this, because when each observed count gets its own unique intercept, but these intercepts are pooled through a common distribution, the predictions expect over-dispersion just like a beta-binomial or gamma-Poisson model would. Multilevel models are also mixtures. Compared to a beta-binomial or gamma-Poisson model, a binomial or Poisson model with a varying intercept on every observed outcome will often be easier to estimate and easier to extend. (p. 407, **emphasis** in the original)
#### Overthinking: Prior for variance components.
Yep, you can use the half-Normal distribution for your priors in **brms**, too. Here it is for model `b13.2`.
```{r b13.2b, message = F}
b13.2b <-
update(b13.2,
prior = c(prior(normal(0, 1.5), class = Intercept),
prior(normal(0, 1), class = sd)),
iter = 5000, warmup = 1000, chains = 4, cores = 4,
sample_prior = "yes",
seed = 13,
file = "fits/b13.02b")
```
McElreath mentioned how one might set a lower bound at zero for the half-Normal prior when using `rethinking::ulam()`. There's no need to do so when using `brms::brm()`. The lower bounds for priors of `class = sd` are already set to zero by default.
Check the model summary.
```{r}
print(b13.2b)
```
If you're curious how the exponential and half-Normal priors compare to one another and to their posteriors, you might just plot.
```{r, fig.width = 5.5, fig.height = 2.75}
# for annotation
text <-
tibble(value = c(0.5, 2.4),
density = c(1, 1.85),
distribution = factor(c("prior", "posterior"), levels = c("prior", "posterior")),
prior = "Exponential(1)")
# gather and wrangle the prior and posterior draws
tibble(`prior_Exponential(1)` = prior_draws(b13.2) %>% pull(sd_tank),
`posterior_Exponential(1)` = as_draws_df(b13.2) %>% pull(sd_tank__Intercept),
`prior_Half-Normal(0, 1)` = prior_draws(b13.2b) %>% pull(sd_tank),
`posterior_Half-Normal(0, 1)` = as_draws_df(b13.2b) %>% pull(sd_tank__Intercept)) %>%
pivot_longer(everything(),
names_sep = "_",
names_to = c("distribution", "prior")) %>%
mutate(distribution = factor(distribution, levels = c("prior", "posterior"))) %>%
# plot!
ggplot(aes(x = value, fill = distribution)) +
geom_density(linewidth = 0, alpha = 2/3, adjust = 0.25) +
geom_text(data = text,
aes(y = density, label = distribution, color = distribution)) +
scale_fill_manual(NULL, values = c("orange4", "orange2")) +
scale_color_manual(NULL, values = c("orange4", "orange2")) +
scale_y_continuous(NULL, breaks = NULL) +
labs(subtitle = expression(Hierarchical~sigma~parameter)) +
coord_cartesian(xlim = c(0, 4)) +
theme(legend.position = "none") +
facet_wrap(~ prior)
```
By the way, this is why we set `iter = 5000` and `sample_prior = "yes"` for the last two models. Neither were necessary to fit the models, but both helped us out with this plot.
## Varying effects and the underfitting/overfitting trade-off
> Varying intercepts are just regularized estimates, but adaptively regularized by estimating how diverse the clusters are while estimating the features of each cluster. This fact is not easy to grasp....
>
> A major benefit of using varying effects estimates, instead of the empirical raw estimates, is that they provide more accurate estimates of the individual cluster (tank) intercepts. On average, the varying effects actually provide a better estimate of the individual tank (cluster) means. The reason that the varying intercepts provide better estimates is that they do a better job of trading off underfitting and overfitting. (p. 408)
In this section, we explicate this by contrasting three perspectives:
* complete pooling (i.e., a single-$\alpha$ model),
* no pooling (i.e., the single-level $\alpha_{\text{tank}[i]}$ model), and
* partial pooling [i.e., the multilevel model for which $\alpha_j \sim \operatorname{Normal} (\bar \alpha, \sigma)$].
> To demonstrate [the magic of the multilevel model], we'll simulate some tadpole data. That way, we'll know the true per-pond survival probabilities. Then we can compare the no-pooling estimates to the partial pooling estimates, by computing how close each gets to the true values they are trying to estimate. The rest of this section shows how to do such a simulation. (p. 409)
### The model.
The simulation formula should look familiar.
\begin{align*}
\text{surv}_i & \sim \operatorname{Binomial}(n_i, p_i) \\
\operatorname{logit}(p_i) & = \alpha_{\text{pond}[i]} \\
\alpha_j & \sim \operatorname{Normal}(\bar \alpha, \sigma) \\
\bar \alpha & \sim \operatorname{Normal}(0, 1.5) \\
\sigma & \sim \operatorname{Exponential}(1)
\end{align*}
### Assign values to the parameters.
Here we follow along with McElreath and "assign specific values representative of the actual tadpole data" (p. 409). Because he included a `set.seed()` line in his **R** code 13.8, our results should match his exactly.
```{r}
a_bar <- 1.5
sigma <- 1.5
n_ponds <- 60
set.seed(5005)
dsim <-
tibble(pond = 1:n_ponds,
ni = rep(c(5, 10, 25, 35), each = n_ponds / 4) %>% as.integer(),
true_a = rnorm(n = n_ponds, mean = a_bar, sd = sigma))
head(dsim)
```
McElreath twice urged us to inspect the contents of this simulation. In addition to looking at the data with `head()`, we might well plot.
```{r, fig.width = 5, fig.height = 3}
dsim %>%
mutate(ni = factor(ni)) %>%
ggplot(aes(x = true_a, y = ni)) +
stat_dotsinterval(fill = "orange2", slab_size = 0, .width = .5) +
ggtitle("Log-odds varying by # tadpoles per pond") +
theme(plot.title = element_text(size = 14))
```
### Sumulate survivors.
> Each pond $i$ has $n_i$ potential survivors, and nature flips each tadpole's coin, so to speak, with probability of survival $p_i$. This probability $p_i$ is implied by the model definition, and is equal to:
>
> $$p_i = \frac{\exp (\alpha_i)}{1 + \exp (\alpha_i)}$$
>
> The model uses a logit link, and so the probability is defined by the [`inv_logit_scaled()`] function. (p. 411)
Although McElreath shared his `set.seed()` number in the last section, he didn't share it for this bit. We'll go ahead and carry over the one from last time. However, in a moment we'll see this clearly wasn't the one he used here. As a consequence, our results will deviate a bit from his.
```{r}
set.seed(5005)
(
dsim <-
dsim %>%
mutate(si = rbinom(n = n(), prob = inv_logit_scaled(true_a), size = ni))
)
```
### Compute the no-pooling estimates.
The no-pooling estimates (i.e., $\alpha_{\text{tank}[i]}$) are the results of simple algebra.
```{r}
(
dsim <-
dsim %>%
mutate(p_nopool = si / ni)
)
```
"These are the same no-pooling estimates you'd get by fitting a model with a dummy variable for each pond and flat priors that induce no regularization" (p. 411). That is, these are the same kinds of estimates we got back when we fit `b13.1`.
### Compute the partial-pooling estimates.
Fit the multilevel (partial-pooling) model.
```{r b13.3}
b13.3 <-
brm(data = dsim,
family = binomial,
si | trials(ni) ~ 1 + (1 | pond),
prior = c(prior(normal(0, 1.5), class = Intercept),
prior(exponential(1), class = sd)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 13,
file = "fits/b13.03")
```
Here's our standard **brms** summary.
```{r}
print(b13.3)
```
I'm not aware that you can use McElreath's `depth=2` trick in **brms** for `summary()` or `print()`. However, you can get most of that information and more with the Stan-like summary using the `$fit` syntax.
```{r}
b13.3$fit
```
As an aside, notice how this summary still reports the old-style `n_eff` values, rather than the updated `Bulk_ESS` and `Tail_ESS` values. I suspect this will change sometime soon. In the meantime, here's a [thread on the Stan Forums](https://discourse.mc-stan.org/t/new-r-hat-and-ess/8165) featuring members of the Stan team discussing how.
Let's get ready for the diagnostic plot of Figure 13.3. First we add the partially-pooled estimates, as summarized by their posterior means, to the `dsim` data. Then we compute error values.
```{r}
# we could have included this step in the block of code below, if we wanted to
p_partpool <-
coef(b13.3)$pond[, , ] %>%
data.frame() %>%
transmute(p_partpool = inv_logit_scaled(Estimate))
dsim <-
dsim %>%
bind_cols(p_partpool) %>%
mutate(p_true = inv_logit_scaled(true_a)) %>%
mutate(nopool_error = abs(p_nopool - p_true),
partpool_error = abs(p_partpool - p_true))
dsim %>%
glimpse()
```
Here is our code for Figure 13.3. The extra data processing for `dfline` is how we get the values necessary for the horizontal summary lines.
```{r, fig.width = 5, fig.height = 5, warning = F, message = F}
dfline <-
dsim %>%
select(ni, nopool_error:partpool_error) %>%
pivot_longer(-ni) %>%
group_by(name, ni) %>%
summarise(mean_error = mean(value)) %>%
mutate(x = c( 1, 16, 31, 46),
xend = c(15, 30, 45, 60))
dsim %>%
ggplot(aes(x = pond)) +
geom_vline(xintercept = c(15.5, 30.5, 45.4),
color = "white", linewidth = 2/3) +
geom_point(aes(y = nopool_error), color = "orange2") +
geom_point(aes(y = partpool_error), shape = 1) +
geom_segment(data = dfline,
aes(x = x, xend = xend,
y = mean_error, yend = mean_error),
color = rep(c("orange2", "black"), each = 4),
linetype = rep(1:2, each = 4)) +
annotate(geom = "text",
x = c(15 - 7.5, 30 - 7.5, 45 - 7.5, 60 - 7.5), y = .45,
label = c("tiny (5)", "small (10)", "medium (25)", "large (35)")) +
scale_x_continuous(breaks = c(1, 10, 20, 30, 40, 50, 60)) +
labs(title = "Estimate error by model type",
subtitle = "The horizontal axis displays pond number. The vertical axis measures\nthe absolute error in the predicted proportion of survivors, compared to\nthe true value used in the simulation. The higher the point, the worse\nthe estimate. No-pooling shown in orange. Partial pooling shown in black.\nThe orange and dashed black lines show the average error for each kind\nof estimate, across each initial density of tadpoles (pond size).",
y = "absolute error") +
theme(panel.grid.major = element_blank(),
plot.subtitle = element_text(size = 10))
```
If you wanted to quantify the difference in simple summaries, you might execute something like this.
```{r, warning = F, message = F}
dsim %>%
select(ni, nopool_error:partpool_error) %>%
pivot_longer(-ni) %>%
group_by(name) %>%
summarise(mean_error = mean(value) %>% round(digits = 3),
median_error = median(value) %>% round(digits = 3))
```
Although many years of work in statistics have shown that partially pooled estimates are better, on average, this is not always the case. Our results are an example of this. McElreath addressed this directly:
> But there are some cases in which the no-pooling estimates are better. These exceptions often result from ponds with extreme probabilities of survival. The partial pooling estimates shrink such extreme ponds towards the mean, because few ponds exhibit such extreme behavior. But sometimes outliers really are outliers. (p. 414)
I originally learned about the multilevel in order to work with [longitudinal data](https://gseacademic.harvard.edu/alda/). In that context, I found the basic principles of a multilevel structure quite intuitive. The concept of partial pooling, however, took me some time to wrap my head around. If you're struggling with this, be patient and keep chipping away.
When McElreath [lectured on this topic in 2015](https://youtu.be/82TaniPgzQc?t=2048), he traced partial pooling to statistician [Charles M. Stein](https://imstat.org/2017/05/15/obituary-charles-m-stein-1920-2016/). Efron and Morris [-@efronSteinParadoxStatistics1977] wrote the now classic paper, [*Stein's paradox in statistics*](https://statweb.stanford.edu/~ckirby/brad/other/Article1977.pdf), which does a nice job breaking down why partial pooling can be so powerful. One of the primary examples they used in the paper was of 1970 batting average data. If you'd like more practice seeing how partial pooling works--or if you just like baseball--, check out my blog post, [*Stein's paradox and what partial pooling can do for you*](https://solomonkurz.netlify.app/blog/2019-02-23-stein-s-paradox-and-what-partial-pooling-can-do-for-you/).
#### Overthinking: Repeating the pond simulation.
Within the **brms** workflow, we can reuse a compiled model with `update()`. But first, we'll simulate new data.
```{r}
a_bar <- 1.5
sigma <- 1.5
n_ponds <- 60
set.seed(1999) # for new data, set a new seed
new_dsim <-
tibble(pond = 1:n_ponds,
ni = rep(c(5, 10, 25, 35), each = n_ponds / 4) %>% as.integer(),
true_a = rnorm(n = n_ponds, mean = a_bar, sd = sigma)) %>%
mutate(si = rbinom(n = n(), prob = inv_logit_scaled(true_a), size = ni)) %>%
mutate(p_nopool = si / ni)
glimpse(new_dsim)
```
Fit the new model.
```{r b13.3_new}
b13.3_new <-
update(b13.3,
newdata = new_dsim,
chains = 4, cores = 4,
seed = 13,
file = "fits/b13.03_new")
```
```{r}
print(b13.3_new)
```
Why not plot the first simulation versus the second one?
```{r, fig.width = 8, fig.height = 5}
bind_rows(as_draws_df(b13.3),
as_draws_df(b13.3_new)) %>%
mutate(model = rep(c("b13.3", "b13.3_new"), each = n() / 2)) %>%
ggplot(aes(x = b_Intercept, y = sd_pond__Intercept)) +
stat_density_2d(geom = "raster",
aes(fill = after_stat(density)),
contour = F, n = 200) +
geom_vline(xintercept = a_bar, color = "orange3", linetype = 3) +
geom_hline(yintercept = sigma, color = "orange3", linetype = 3) +
scale_fill_gradient(low = "grey25", high = "orange3") +
ggtitle("Our simulation posteriors contrast a bit",
subtitle = expression(alpha*" is on the x and "*sigma*" is on the y, both in log-odds. The dotted lines intersect at the true values.")) +
coord_cartesian(xlim = c(.7, 2),
ylim = c(.8, 1.9)) +
theme(legend.position = "none",
panel.grid.major = element_blank()) +
facet_wrap(~ model, ncol = 2)
```
If you'd like the `stanfit` portion of your `brm()` object, subset with `$fit`. Take `b13.3`, for example. You might check out its structure via `b13.3$fit %>% str()`. Here's the actual Stan code.
```{r}
b13.3$fit@stanmodel
```
## More than one type of cluster
"We can use and often should use more than one type of cluster in the same model" (p. 415).
#### Rethinking: Cross-classification and hierarchy.
> The kind of data structure in `data(chimpanzees)` is usually called a **cross-classified multilevel** model. It is cross-classified, because actors are not nested within unique blocks. If each chimpanzee had instead done all of his or her pulls on a single day, within a single block, then the data structure would instead be *hierarchical*. However, the model specification would typically be the same. So the model structure and code you'll see below will apply both to cross-classified designs and hierarchical designs. (p. 415, **emphasis** in the original)
### Multilevel chimpanzees.
The initial multilevel update from model `b11.4` from [Section 11.1.1][Logistic regression: Prosocial chimpanzees.] follows the statistical formula
\begin{align*}
\text{left_pull}_i & \sim \operatorname{Binomial}(n_i = 1, p_i) \\
\operatorname{logit} (p_i) & = \alpha_{\text{actor}[i]} + \color{#CD8500}{\gamma_{\text{block}[i]}} + \beta_{\text{treatment}[i]} \\
\beta_j & \sim \operatorname{Normal}(0, 0.5) \;\;\; , \text{for } j = 1, \dots, 4 \\
\alpha_j & \sim \operatorname{Normal}(\bar \alpha, \sigma_\alpha) \;\;\; , \text{for } j = 1, \dots, 7 \\
\color{#CD8500}{\gamma_j} & \color{#CD8500} \sim \color{#CD8500}{\operatorname{Normal}(0, \sigma_\gamma) \;\;\; , \text{for } j = 1, \dots, 6} \\
\bar \alpha & \sim \operatorname{Normal}(0, 1.5) \\
\sigma_\alpha & \sim \operatorname{Exponential}(1) \\
\color{#CD8500}{\sigma_\gamma} & \color{#CD8500} \sim \color{#CD8500}{\operatorname{Exponential}(1)}.
\end{align*}
`r emo::ji("warning")` WARNING `r emo::ji("warning")`
I am so sorry, but we are about to head straight into a load of confusion. If you follow along linearly in the text, we won't have the language to parse this all out until [Section 13.4][Divergent transitions and non-centered priors]. In short, our difficulties will have to do with what are called the centered and the non-centered parameterizations for multilevel models. For the next several models in the text, McElreath used the **centered parameterization**. As we'll learn in Section 13.4, this often causes problems when you use Stan to fit your multilevel models. Happily, the solution to those problems is often the **non-centered parameterization**, which is well known among the Stan team. This issue is so well known, in fact, that Bürkner only supports the non-centered parameterization with **brms** (see [here](https://discourse.mc-stan.org/t/disable-non-centered-parameterization-in-brms/7184/7)). To my knowledge, there is no easy way around this. In the long run, this is a good thing. Your **brms** models will likely avoid some of the problems McElreath highlighted in this part of the text. In the short term, this also means that our results will not completely match up with those in the text. If you really want to reproduce McElreath's models `m13.4` through `m13.6`, you'll have to fit them with the **rethinking** package or directly in Stan. Our models `b13.4` through `b13.6` will be the non-centered **brms** alternatives. Either way, the models make the same predictions, but the nuts and bolts and gears we'll use to construct our multilevel golems will look a little different. With all that in mind, here's how we might express our statistical model using the non-centered parameterization more faithful to the way it will be expressed with `brms::brm()`:
\begin{align*}
\text{left_pull}_i & \sim \operatorname{Binomial}(n_i = 1, p_i) \\
\operatorname{logit} (p_i) & = \bar \alpha + \beta_{\text{treatment}[i]} + \color{#CD8500}{z_{\text{actor}[i]} \sigma_\alpha + x_{\text{block}[i]} \sigma_\gamma} \\
\bar \alpha & \sim \operatorname{Normal}(0, 1.5) \\
\beta_j & \sim \operatorname{Normal}(0, 0.5) \;\;\; , \text{for } j = 1, \dots, 4 \\
\color{#CD8500}{z_j} & \color{#CD8500}\sim \color{#CD8500}{\operatorname{Normal}(0, 1)} \\
\color{#CD8500}{x_j} & \color{#CD8500}\sim \color{#CD8500}{\operatorname{Normal}(0, 1)} \\
\sigma_\alpha & \sim \operatorname{Exponential}(1) \\
\sigma_\gamma & \sim \operatorname{Exponential}(1).
\end{align*}
If you jump ahead to [Section 13.4.2][Non-centered chimpanzees.], you'll see this is just re-write of the formula on the top of page 424. For now, let's load the data.
```{r}
data(chimpanzees, package = "rethinking")
d <- chimpanzees
rm(chimpanzees)
```
Wrangle and view.
```{r}
d <-
d %>%
mutate(actor = factor(actor),
block = factor(block),
treatment = factor(1 + prosoc_left + 2 * condition))
glimpse(d)
```
Even when using the non-centered parameterization, McElreath's `m13.4` is a bit of an odd model to translate into **brms** syntax. To my knowledge, it can't be done with conventional syntax. But we can fit the model with careful use of the non-linear syntax, which might look like this.
```{r b13.4}
b13.4 <-
brm(data = d,
family = binomial,
bf(pulled_left | trials(1) ~ a + b,
a ~ 1 + (1 | actor) + (1 | block),
b ~ 0 + treatment,
nl = TRUE),
prior = c(prior(normal(0, 0.5), nlpar = b),
prior(normal(0, 1.5), class = b, coef = Intercept, nlpar = a),
prior(exponential(1), class = sd, group = actor, nlpar = a),
prior(exponential(1), class = sd, group = block, nlpar = a)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 13,
file = "fits/b13.04")
```
The `b ~ 0 + treatment` part of the `formula` is our expression of what we wrote above as $\beta_{\text{treatment}[i]}$. There's a lot going on with the `a ~ 1 + (1 | actor) + (1 | block)` part of the formula. The initial `1` outside of the parenthesis is $\bar \alpha$. The `(1 | actor)` and `(1 | block)` parts correspond to $z_{\text{actor}[i]} \sigma_\alpha$ and $x_{\text{block}[i]} \sigma_\gamma$, respectively.
Check the trace plots.
```{r, fig.width = 8, fig.height = 6, warning = F, message = F}
library(bayesplot)
color_scheme_set("orange")
as_draws_df(b13.4) %>%
mcmc_trace(pars = vars(b_a_Intercept:`r_block__a[6,Intercept]`),
facet_args = list(ncol = 4),
linewidth = 0.15) +
theme(legend.position = "none")
```
They all look fine. In the text (e.g., page 416), McElreath briefly mentioned warnings about divergent transitions. We didn't get any warnings like that. Keep following along and you'll soon learn why.
Here's a look at the summary when using `print()`.
```{r}
print(b13.4)
```
When you use the `(1 | <group>)` syntax within `brm()`, the group-specific parameters are not shown with `print()`. You only get the hierarchical $\sigma_\text{<group>}$ summaries, shown here as the two rows for `sd(a_Intercept)`. However, you can get a summary of all the parameters with the `posterior_summary()` function.
```{r}
posterior_summary(b13.4) %>% round(digits = 2)
```
We might make the coefficient plot of Figure 13.4.a with `bayesplot::mcmc_plot()`.
```{r, fig.width = 3.75, fig.height = 4}
b13.4 %>%
mcmc_plot(variable = c("^r_", "^b_", "^sd_"), regex = T) +
theme(axis.text.y = element_text(hjust = 0))
```
For a little more control, we might switch to a **tidybayes**-oriented approach.
```{r, fig.width = 3.75, fig.height = 4, warning = F}
# extract the posterior draws
post <- as_draws_df(b13.4)
# this is all stylistic fluff
levels <-
c("sd_block__a_Intercept", "sd_actor__a_Intercept",
"b_a_Intercept",
str_c("r_block__a[", 6:1, ",Intercept]"),
str_c("r_actor__a[", 7:1, ",Intercept]"),
str_c("b_b_treatment", 4:1))
text <-
tibble(x = posterior_summary(b13.4, probs = c(0.055, 0.955),)["r_actor__a[2,Intercept]", c(3, 1)],
y = c(13.5, 16.5),
label = c("89% CI", "mean"),
hjust = c(.5, 0))
arrow <-
tibble(x = posterior_summary(b13.4, probs = c(0.055, 0.955),)["r_actor__a[2,Intercept]", c(3, 1)] + c(- 0.3, 0.2),
xend = posterior_summary(b13.4, probs = c(0.055, 0.955),)["r_actor__a[2,Intercept]", c(3, 1)],
y = c(14, 16),
yend = c(14.8, 15.35))
# here's the main event
post %>%
pivot_longer(b_a_Intercept:`r_block__a[6,Intercept]`)%>%
mutate(name = factor(name, levels = levels)) %>%
ggplot(aes(x = value, y = name)) +
stat_pointinterval(point_interval = mean_qi,
.width = .89, shape = 21, size = 1, point_size = 2, point_fill = "blue") +
geom_text(data = text,
aes(x = x, y = y, label = label, hjust = hjust)) +
geom_segment(data = arrow,
aes(x = x, xend = xend,
y = y, yend = yend),
arrow = arrow(length = unit(0.15, "cm"))) +
theme(axis.text.y = element_text(hjust = 0),
panel.grid.major.y = element_line(linetype = 3))
```
Regardless of whether we use a **bayesplot**- or **tidybayes**-oriented workflow, a careful look at our coefficient plots will show the parameters are a little different from those McElreath reported. Again, this is because of the subtle differences between our non-centered parameterization and McElreath's centered parameterization. This will all make more sense in Section 13.4.
Now use `post` to compare the group-level $\sigma$ parameters as in Figure 13.4.b.
```{r, fig.width = 3, fig.height = 3, warning = F}
post %>%
pivot_longer(starts_with("sd")) %>%
ggplot(aes(x = value, fill = name)) +
geom_density(linewidth = 0, alpha = 3/4, adjust = 2/3, show.legend = F) +
annotate(geom = "text", x = 0.67, y = 2, label = "block", color = "orange4") +
annotate(geom = "text", x = 2.725, y = 0.5, label = "actor", color = "orange1") +
scale_fill_manual(values = str_c("orange", c(1, 4))) +
scale_y_continuous(NULL, breaks = NULL) +
ggtitle(expression(sigma["<group>"])) +
coord_cartesian(xlim = c(0, 4))
```
Since both the coefficient plots and the density plots indicate there is much more variability among the `actor` parameters than in the `block` parameters, we might fit a model that ignores the variation among the levels of `block`.
```{r b13.5}
b13.5 <-
brm(data = d,
family = binomial,
bf(pulled_left | trials(1) ~ a + b,
a ~ 1 + (1 | actor),
b ~ 0 + treatment,
nl = TRUE),
prior = c(prior(normal(0, 0.5), nlpar = b),
prior(normal(0, 1.5), class = b, coef = Intercept, nlpar = a),
prior(exponential(1), class = sd, group = actor, nlpar = a)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 13,
file = "fits/b13.05")
```
We might compare our models by their WAIC estimates.
```{r, message = F}
b13.4 <- add_criterion(b13.4, "waic")
b13.5 <- add_criterion(b13.5, "waic")
loo_compare(b13.4, b13.5, criterion = "waic") %>%
print(simplify = F)
model_weights(b13.4, b13.5, weights = "waic") %>%
round(digits = 2)
```
The two models yield nearly-equivalent WAIC estimates. Just as in the text, our `p_waic` column shows the models differ by about 2 effective parameters due to the shrinkage from the multilevel partial pooling. Yet recall what McElreath wrote:
> There is nothing to gain here by selecting either model. The comparison of the two models tells a richer story... Since this is an experiment, there is nothing to really select. The experimental design tells us the relevant causal model to inspect. (pp. 418--419)
### Even more clusters.
We can extend partial pooling to the `treatment` conditions, too. With **brms**, it will be more natural to revert to the conventional `formula` syntax.
```{r b13.6}
b13.6 <-
brm(data = d,
family = binomial,
pulled_left | trials(1) ~ 1 + (1 | actor) + (1 | block) + (1 | treatment),
prior = c(prior(normal(0, 1.5), class = Intercept),
prior(exponential(1), class = sd)),
iter = 2000, warmup = 1000, chains = 4, cores = 4,
seed = 13,
file = "fits/b13.06")
```
Recall that with **brms**, we don't have a `coeftab()` like with McElreath's **rethinking**. For us, one approach would be to compare the relevent rows from `fixef(b13.4)` to the relevant elements from `ranef(b13.6)`.
```{r}
tibble(parameter = str_c("b[", 1:4, "]"),
`b13.4` = fixef(b13.4)[2:5, 1],
`b13.6` = ranef(b13.6)$treatment[, 1, "Intercept"]) %>%
mutate_if(is.double, round, digits = 2)
```
Like in the text, "these are not identical, but they are very close" (p. 419). We might compare the group-level $\sigma$ parameters with a plot.
```{r, fig.width = 5.25, fig.height = 3, warning = F}
as_draws_df(b13.6) %>%
pivot_longer(starts_with("sd")) %>%
mutate(group = str_remove(name, "sd_") %>% str_remove(., "__Intercept")) %>%
mutate(parameter = str_c("sigma[", group,"]")) %>%
ggplot(aes(x = value, y = parameter)) +
stat_halfeye(.width = .95, size = 1, fill = "orange", adjust = 0.1) +
scale_y_discrete(labels = ggplot2:::parse_safe, expand = expansion(add = 0.1)) +
labs(subtitle = "The variation among treatment levels is small, but the\nvariation among the levels of block is still the smallest.") +
theme(axis.text.y = element_text(hjust = 0))
```
Among the three $\sigma_\text{<group>}$ parameters, $\sigma_\text{block}$ is the smallest. Now we'll compare `b13.6` to the last two models with the WAIC.
```{r, message = F}
b13.6 <- add_criterion(b13.6, "waic")
loo_compare(b13.4, b13.5, b13.6, criterion = "waic") %>%
print(simplify = F)
model_weights(b13.4, b13.5, b13.6, weights = "loo") %>%
round(digits = 2)
```
The models show little difference "on purely predictive criteria. This is the typical result, when each cluster (each treatment here) has a lot of data to inform its parameters" (p. 419). Unlike in the text, we didn't have a problem with divergent transitions. We'll see why in the next section.
Before we move on, this section just hints at a historical software difficulty. In short, it's not uncommon to have a theory-based model that includes multiple sources of clustering (i.e., requiring many `( <varying parameter(s)> | <grouping variable(s)> )` parts in the model `formula`). This can make for all kinds of computational difficulties and result in software error messages, inadmissible solutions, and so on. One of the practical solutions to difficulties like these has been to simplify the statistical models by removing some of the clustering terms. Even though such simpler models were not the theory-based ones, at least they yielded solutions. Nowadays, Stan (via **brms** or otherwise) is making it easier to fit the full theoretically-based model. To learn more about this topic, check out this nice blog post by [Michael Frank](https://web.stanford.edu/~mcfrank/), [*Mixed effects models: Is it time to go Bayesian by default?*](http://babieslearninglanguage.blogspot.com/2018/02/mixed-effects-models-is-it-time-to-go.html). Make sure to check out the discussion in the comments section, which includes all-stars like Bürkner and [Douglas Bates](http://pages.stat.wisc.edu/~bates/). You can get more context for the issue from @barrRandomEffectsStructure2013, [*Random effects structure for confirmatory hypothesis testing: Keep it maximal*](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3881361/).
## Divergent transitions and non-centered priors
Although we did not get divergent transitions warnings in from our last few models the way McElreath did with his, the issues is still relevant for **brms**.
> One of the best things about Hamiltonian Monte Carlo is that it provides internal checks of efficiency and accuracy. One of these checks comes free, arising from the constraints on the physics simulation. Recall that HMC simulates the frictionless flow of a particle on a surface. In any given transition, which is just a single flick of the particle, the total energy at the start should be equal to the total energy at the end. That's how energy in a closed system works. And in a purely mathematical system, the energy is always conserved correctly. It's just a fact about the physics.
>
> But in a numerical system, it might not be. Sometimes the total energy is not the same at the end as it was at the start. In these cases, the energy is *divergent.* How can this happen? It tends to happen when the posterior distribution is very steep in some region of parameter space. Steep changes in probability are hard for a discrete physics simulation to follow. When that happens, the algorithm notices by comparing the energy at the start to the energy at the end. When they don't match, it indicates numerical problems exploring that part of the posterior distribution.
>
> Divergent transitions are rejected. They don't directly damage your approximation of the posterior distribution. But they do hurt it indirectly, because the region where divergent transitions happen is hard to explore correctly. (p. 420, *emphasis* in the original)
Two primary ways to handle divergent transitions are by increasing the `adapt_delta` parameter, which we've already done a few times in previous chapters, or reparameterizing the model. As McElreath will cover in a bit, switching from the centered to the non-centered parameterization will often work when using multilevel models.
### The Devil's Funnel.
McElreath posed a joint distribution
\begin{align*}
v & \sim \operatorname{Normal}(0, 3) \\
x & \sim \operatorname{Normal}(0, \exp(v)),
\end{align*}
where the scale of $x$ depends on another variable, $v$. In **R** code 13.26, McElreath then proposed fitting the following model with `rethinking::ulam()`.
```{r, eval = F}
m13.7 <-
ulam(
data = list(N = 1),
alist(
v ~ normal(0, 3),
x ~ normal(0, exp(v))
),
chains = 4
)
```
I'm not aware that you can do something like this with **brms**. If you think I'm in error, [please share your solution](https://github.com/ASKurz/Statistical_Rethinking_with_brms_ggplot2_and_the_tidyverse_2_ed/issues). We can at least get a sense of the model by simulating from the joint distribution and plotting.
```{r, fig.width = 5, fig.height = 2.75, warning = F}
set.seed(13)
tibble(v = rnorm(1e3, mean = 0, sd = 3)) %>%
mutate(x = rnorm(1e3, mean = 0, sd = exp(v))) %>%
ggplot(aes(x = x)) +
geom_histogram(binwidth = 1, fill = "orange2") +
annotate(geom = "text",
x = -100, y = 490, hjust = 0,
label = expression(italic(v)%~%Normal(0, 3))) +
annotate(geom = "text",
x = -100, y = 440, hjust = 0,
label = expression(italic(x)%~%Normal(0, exp(italic(v))))) +
coord_cartesian(xlim = c(-100, 100)) +
scale_y_continuous(breaks = NULL)
```
The distribution looks something like a Student-$t$ with a very low $\nu$ parameter. We can express the joint likelihood of $p(v, x)$ as
$$p(v, x) = p(x | v)\ p(v).$$
Here that is in a plot.
```{r, fig.width = 3.5, fig.height = 3.5}
# define the parameter space
parameter_space <- seq(from = -4, to = 4, length.out = 200)
# simulate
crossing(v = parameter_space,
x = parameter_space) %>%
mutate(likelihood_v = dnorm(v, mean = 0, sd = 3),
likelihood_x = dnorm(x, mean = 0, sd = exp(v))) %>%
mutate(joint_likelihood = likelihood_v * likelihood_x) %>%
# plot!
ggplot(aes(x = x, y = v, fill = joint_likelihood)) +
geom_raster(interpolate = T) +
scale_fill_viridis_c(option = "B") +
labs(subtitle = "Centered parameterization") +
theme(legend.position = "none")
```
This ends up as a version of McElreath's Figure 13.5.a.
> At low values of $v$, the distribution of $x$ contracts around zero. This forms a very steep valley that the Hamiltonian particle needs to explore. Steep surfaces are hard to simulate, because the simulation is not actually continuous. It happens in discrete steps. If the steps are too big, the simulation will overshoot. (p. 421)
To avoid the divergent transitions than can arise from steep valleys like this, we can switch from our original formula to a non-centered parameterization, such as:
\begin{align*}
v & \sim \operatorname{Normal}(0, 3) \\
z & \sim \operatorname{Normal}(0, 1) \\
x & = z \exp(v),
\end{align*}
where $x$ is now the product of two independent distributions, $v$ and $z$. With this parameterization, we can express the joint likelihood $p(v, z)$ as
$$p(v, z) = p(z) \ p(v),$$
where $p(z)$ is not conditional on $v$ and $p(v)$ is not conditional on $z$. Here's what that looks like in a plot.
```{r, fig.width = 3.5, fig.height = 3.5}
# simulate
crossing(v = parameter_space,
z = parameter_space / 2) %>%
mutate(likelihood_v = dnorm(v, mean = 0, sd = 3),
likelihood_z = dnorm(z, mean = 0, sd = 1)) %>%
mutate(joint_likelihood = likelihood_v * likelihood_z) %>%
# plot!
ggplot(aes(x = z, y = v, fill = joint_likelihood)) +
geom_raster(interpolate = T) +
scale_fill_viridis_c(option = "B") +
labs(subtitle = "Non-centered parameterization") +
theme(legend.position = "none")
```
This is our version of the right-hand panel of McElreath's Figure 13.5. No nasty funnel--just a friendly glowing likelihood orb.
### Non-centered chimpanzees.
At the top of the section, McElreath reported the `rethinking::ulam()` default is to set `adapt_delta = 0.95`. Readers should be aware that the `brms::brm()` default is `adapt_delta = 0.80`. A consequence of this difference is `rethinking::ulam()` will tend to take smaller step sizes than `brms::brm()`, at the cost of slower exploration of the posterior. I don't know that one is inherently better than the other. They're just defaults.
Recall that due to how **brms** only supports the non-centered parameterization, we have already fit our version of McElreath's `m13.4nc`. We called it `b13.4`. Here is the model summary, again.
```{r}
print(b13.4)
```
Because we only fit this model using the non-centered parameterization, we won't be able to fully reproduce McElreath's Figure 13.6. But we can still plot our effective sample sizes. Recall that unlike the way **rethinking** only reports `n_eff`, **brms** now reports both `Bulk_ESS` and `Tail_ESS` [see @vehtariRanknormalizationFoldingLocalization2019]. At the moment, **brms** does not offer a convenience function that allows users to collect those values in a data frame. However you can do so with help from the [**posterior** package](https://github.com/stan-dev/posterior). For our purposes, the function of interest is `summarise_draws()`, which will take the output from `as_draws_df()` as input.
```{r, warning = F, message = F}
library(posterior)
as_draws_df(b13.4) %>%
summarise_draws()
```
Note how the last three columns are the `rhat`, the `ess_bulk`, and the `ess_tail`. Here we summarize those two effective sample size columns in a scatter plot similar to Figure 13.6, but based only on our `b13.4`, which used the non-centered parameterization.
```{r, fig.width = 3.5, fig.height = 3.5, warning = F}
as_draws_df(b13.4) %>%
summarise_draws() %>%
ggplot(aes(x = ess_bulk, y = ess_tail)) +
geom_abline(linetype = 2) +
geom_point(color = "blue") +
xlim(0, 4700) +