forked from paul-buerkner/brms
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathbrms_phylogenetics.Rmd
365 lines (302 loc) · 11.7 KB
/
brms_phylogenetics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
---
title: "Estimating Phylogenetic Multilevel Models with brms"
author: "Paul Bürkner"
date: "`r Sys.Date()`"
output:
rmarkdown::html_vignette:
toc: yes
vignette: >
%\VignetteIndexEntry{Estimating Phylogenetic Multilevel Models with brms}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
params:
EVAL: !r identical(Sys.getenv("NOT_CRAN"), "true")
---
```{r, SETTINGS-knitr, include=FALSE}
stopifnot(require(knitr))
options(width = 90)
opts_chunk$set(
comment = NA,
message = FALSE,
warning = FALSE,
eval = if (isTRUE(exists("params"))) params$EVAL else FALSE,
dev = "jpeg",
dpi = 100,
fig.asp = 0.8,
fig.width = 5,
out.width = "60%",
fig.align = "center"
)
library(brms)
ggplot2::theme_set(theme_default())
```
## Introduction
In the present vignette, we want to discuss how to specify phylogenetic
multilevel models using **brms**. These models are relevant in evolutionary
biology when data of many species are analyzed at the same time. The usual
approach would be to model species as a grouping factor in a multilevel model
and estimate varying intercepts (and possibly also varying slopes) over species.
However, species are not independent as they come from the same phylogenetic
tree and we thus have to adjust our model to incorporate this dependency. The
examples discussed here are from chapter 11 of the book *Modern Phylogenetic
Comparative Methods and the application in Evolutionary Biology* (de Villemeruil
& Nakagawa, 2014). The necessary data can be downloaded from the corresponding
website (https://www.mpcm-evolution.com/). Some of these models may take a few
minutes to fit.
## A Simple Phylogenetic Model
Assume we have measurements of a phenotype, `phen` (say the body size), and a
`cofactor` variable (say the temperature of the environment). We prepare the
data using the following code.
```{r}
phylo <- ape::read.nexus("https://paul-buerkner.github.io/data/phylo.nex")
data_simple <- read.table(
"https://paul-buerkner.github.io/data/data_simple.txt",
header = TRUE
)
head(data_simple)
```
The `phylo` object contains information on the relationship between species.
Using this information, we can construct a covariance matrix of species
(Hadfield & Nakagawa, 2010).
```{r}
A <- ape::vcv.phylo(phylo)
```
Now we are ready to fit our first phylogenetic multilevel model:
```{r, results='hide'}
model_simple <- brm(
phen ~ cofactor + (1|gr(phylo, cov = A)),
data = data_simple,
family = gaussian(),
data2 = list(A = A),
prior = c(
prior(normal(0, 10), "b"),
prior(normal(0, 50), "Intercept"),
prior(student_t(3, 0, 20), "sd"),
prior(student_t(3, 0, 20), "sigma")
)
)
```
With the exception of `(1|gr(phylo, cov = A))` instead of `(1|phylo)` this is a
basic multilevel model with a varying intercept over species (`phylo` is an
indicator of species in this data set). However, by using `cov = A` in the `gr`
function, we make sure that species are correlated as specified by the
covariance matrix `A`. We pass `A` itself via the `data2` argument which can be
used for any kinds of data that does not fit into the regular structure of the
`data` argument. Setting priors is not required for achieving good convergence
for this model, but it improves sampling speed a bit. After fitting, the results
can be investigated in detail.
```{r}
summary(model_simple)
plot(model_simple, N = 2, ask = FALSE)
plot(conditional_effects(model_simple), points = TRUE)
```
The so called phylogenetic signal (often symbolize by $\lambda$) can be computed
with the `hypothesis` method and is roughly $\lambda = 0.7$ for this example.
```{r}
hyp <- "sd_phylo__Intercept^2 / (sd_phylo__Intercept^2 + sigma^2) = 0"
(hyp <- hypothesis(model_simple, hyp, class = NULL))
plot(hyp)
```
Note that the phylogenetic signal is just a synonym of the intra-class
correlation (ICC) used in the context phylogenetic analysis.
## A Phylogenetic Model with Repeated Measurements
Often, we have multiple observations per species and this allows to fit more
complicated phylogenetic models.
```{r}
data_repeat <- read.table(
"https://paul-buerkner.github.io/data/data_repeat.txt",
header = TRUE
)
data_repeat$spec_mean_cf <-
with(data_repeat, sapply(split(cofactor, phylo), mean)[phylo])
head(data_repeat)
```
The variable `spec_mean_cf` just contains the mean of the cofactor for each
species. The code for the repeated measurement phylogenetic model looks as
follows:
```{r, results='hide'}
model_repeat1 <- brm(
phen ~ spec_mean_cf + (1|gr(phylo, cov = A)) + (1|species),
data = data_repeat,
family = gaussian(),
data2 = list(A = A),
prior = c(
prior(normal(0,10), "b"),
prior(normal(0,50), "Intercept"),
prior(student_t(3,0,20), "sd"),
prior(student_t(3,0,20), "sigma")
),
sample_prior = TRUE, chains = 2, cores = 2,
iter = 4000, warmup = 1000
)
```
The variables `phylo` and `species` are identical as they are both identifiers
of the species. However, we model the phylogenetic covariance only for `phylo`
and thus the `species` variable accounts for any specific effect that would be
independent of the phylogenetic relationship between species (e.g.,
environmental or niche effects). Again we can obtain model summaries as well as
estimates of the phylogenetic signal.
```{r}
summary(model_repeat1)
```
```{r}
hyp <- paste(
"sd_phylo__Intercept^2 /",
"(sd_phylo__Intercept^2 + sd_species__Intercept^2 + sigma^2) = 0"
)
(hyp <- hypothesis(model_repeat1, hyp, class = NULL))
plot(hyp)
```
So far, we have completely ignored the variability of the cofactor within
species. To incorporate this into the model, we define
```{r}
data_repeat$within_spec_cf <- data_repeat$cofactor - data_repeat$spec_mean_cf
```
and then fit it again using `within_spec_cf` as an additional predictor.
```{r, results='hide'}
model_repeat2 <- update(
model_repeat1, formula = ~ . + within_spec_cf,
newdata = data_repeat, chains = 2, cores = 2,
iter = 4000, warmup = 1000
)
```
The results are almost unchanged, with apparently no relationship between the
phenotype and the within species variance of `cofactor`.
```{r}
summary(model_repeat2)
```
Also, the phylogenetic signal remains more or less the same.
```{r}
hyp <- paste(
"sd_phylo__Intercept^2 /",
"(sd_phylo__Intercept^2 + sd_species__Intercept^2 + sigma^2) = 0"
)
(hyp <- hypothesis(model_repeat2, hyp, class = NULL))
```
## A Phylogenetic Meta-Analysis
Let's say we have Fisher's z-transformed correlation coefficients $Zr$ per
species along with corresponding sample sizes (e.g., correlations between male
coloration and reproductive success):
```{r}
data_fisher <- read.table(
"https://paul-buerkner.github.io/data/data_effect.txt",
header = TRUE
)
data_fisher$obs <- 1:nrow(data_fisher)
head(data_fisher)
```
We assume the sampling variance to be known and as $V(Zr) = \frac{1}{N - 3}$ for
Fisher's values, where $N$ is the sample size per species. Incorporating the
known sampling variance into the model is straight forward. One has to keep in
mind though, that **brms** requires the sampling standard deviation (square root
of the variance) as input instead of the variance itself. The group-level effect
of `obs` represents the residual variance, which we have to model explicitly in
a meta-analytic model.
```{r, results='hide'}
model_fisher <- brm(
Zr | se(sqrt(1 / (N - 3))) ~ 1 + (1|gr(phylo, cov = A)) + (1|obs),
data = data_fisher, family = gaussian(),
data2 = list(A = A),
prior = c(
prior(normal(0, 10), "Intercept"),
prior(student_t(3, 0, 10), "sd")
),
control = list(adapt_delta = 0.95),
chains = 2, cores = 2, iter = 4000, warmup = 1000
)
```
A summary of the fitted model is obtained via
```{r}
summary(model_fisher)
plot(model_fisher)
```
The meta-analytic mean (i.e., the model intercept) is $0.16$ with a credible
interval of $[0.08, 0.25]$. Thus the mean correlation across species is positive
according to the model.
## A phylogenetic count-data model
Suppose that we analyze a phenotype that consists of counts instead of being a
continuous variable. In such a case, the normality assumption will likely not be
justified and it is recommended to use a distribution explicitly suited for
count data, for instance the Poisson distribution. The following data set (again
retrieved from mpcm-evolution.org) provides an example.
```{r}
data_pois <- read.table(
"https://paul-buerkner.github.io/data/data_pois.txt",
header = TRUE
)
data_pois$obs <- 1:nrow(data_pois)
head(data_pois)
```
As the Poisson distribution does not have a natural overdispersion parameter, we
model the residual variance via the group-level effects of `obs` (e.g., see
Lawless, 1987).
```{r, results='hide'}
model_pois <- brm(
phen_pois ~ cofactor + (1|gr(phylo, cov = A)) + (1|obs),
data = data_pois, family = poisson("log"),
data2 = list(A = A),
chains = 2, cores = 2, iter = 4000,
control = list(adapt_delta = 0.95)
)
```
Again, we obtain a summary of the fitted model via
```{r}
summary(model_pois)
plot(conditional_effects(model_pois), points = TRUE)
```
Now, assume we ignore the fact that the phenotype is count data and fit a linear
normal model instead.
```{r, results='hide'}
model_normal <- brm(
phen_pois ~ cofactor + (1|gr(phylo, cov = A)),
data = data_pois, family = gaussian(),
data2 = list(A = A),
chains = 2, cores = 2, iter = 4000,
control = list(adapt_delta = 0.95)
)
```
```{r}
summary(model_normal)
```
We see that `cofactor` has a positive relationship with the phenotype in both
models. One should keep in mind, though, that the estimates of the Poisson model
are on the log-scale, as we applied the canonical log-link function in this
example. Therefore, estimates are not comparable to a linear normal model even
if applied to the same data. What we can compare, however, is the model fit, for
instance graphically via posterior predictive checks.
```{r}
pp_check(model_pois)
pp_check(model_normal)
```
Apparently, the distribution of the phenotype predicted by the Poisson model
resembles the original distribution of the phenotype pretty closely, while the
normal models fails to do so. We can also apply leave-one-out cross-validation
for direct numerical comparison of model fit.
```{r}
loo(model_pois, model_normal)
```
Since smaller values of loo indicate better fit, it is again evident that the
Poisson model fits the data better than the normal model. Of course, the Poisson
model is not the only reasonable option here. For instance, you could use a
negative binomial model (via family `negative_binomial`), which already contains
an overdispersion parameter so that modeling a varying intercept of `obs`
becomes obsolete.
## Phylogenetic models with multiple group-level effects
In the above examples, we have only used a single group-level effect (i.e., a
varying intercept) for the phylogenetic grouping factors. In **brms**, it is
also possible to estimate multiple group-level effects (e.g., a varying
intercept and a varying slope) for these grouping factors. However, it requires
repeatedly computing Kronecker products of covariance matrices while fitting the
model. This will be very slow especially when the grouping factors have many
levels and matrices are thus large.
## References
de Villemeruil P. & Nakagawa, S. (2014) General quantitative genetic methods for
comparative biology. In:
*Modern phylogenetic comparative methods and their application in evolutionary biology: concepts and practice*
(ed. Garamszegi L.) Springer, New York. pp. 287-303.
Hadfield, J. D. & Nakagawa, S. (2010) General quantitative genetic methods for
comparative biology: phylogenies, taxonomies, and multi-trait models for
continuous and categorical characters. *Journal of Evolutionary Biology*. 23.
494-508.
Lawless, J. F. (1987). Negative binomial and mixed Poisson regression.
*Canadian Journal of Statistics*, 15(3), 209-225.