This repository has been archived by the owner on Jan 26, 2019. It is now read-only.
forked from mjskay/tidybayes
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathREADME.Rmd
executable file
·439 lines (340 loc) · 20.2 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
---
output:
github_document:
df_print: kable
---
# tidybayes: Bayesian analysis + tidy data + geoms <img id="tidybayes_logo" src="man/figures/logo.svg" align="right" />
[![Build Status](https://travis-ci.org/mjskay/tidybayes.png?branch=master)](https://travis-ci.org/mjskay/tidybayes)
[![DOI](https://zenodo.org/badge/116701609.svg)](https://zenodo.org/badge/latestdoi/116701609)
![Preview of tidybayes plots](man/figures/preview.png)
`tidybayes` is an R package that aims to make it easy to integrate popular Bayesian
modelling methods into a tidy data + ggplot workflow.
[Tidy](http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)
data frames (one observation per row) are particularly convenient for use
in a variety of R data manipulation and visualization packages. However,
when using MCMC / Bayesian samplers like JAGS or Stan in R, we often have
to translate this data into a form the sampler understands, and then after
running the model, translate the resulting sample (or predictions) into a more tidy
format for use with other R functions. `tidybayes` aims to simplify these
two common (often tedious) operations:
* __Composing data__ for use with the sampler. This often means translating
data from a `data.frame` into a `list` , making sure `factors` are encoded as
numerical data, adding variables to store the length of indices, etc. This
package helps automate these operations using the `compose_data` function, which
automatically handles data types like `numeric`, `logical`, `factor`, and `ordinal`,
and allows easy extensions for converting other datatypes into a format the
sampler understands by providing your own implementation of the generic `as_data_list`.
* __Extracting tidy samples__ from the sampler. This often means extracting indices
from parameters with names like `"b[1,1]"`, `"b[1,2]"` into separate columns
of a data frame, like `i = c(1,1,..)` and `j = c(1,2,...)`. More tediously,
sometimes these indices actually correspond to levels of a factor in the original
data; e.g. `"x[1]"` might correspond to a value of `x` for the first level of
some factor. We provide several straightforward ways to convert samples of a
variable with indices into useful long-format
("[tidy](http://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html)")
data frames, with automatic back-conversion of common data types (factors, logicals)
using the `spread_samples` and `gather_sampels` functions, including automatic
recovery of factor levels corresponding to variable indices. In most cases this
kind of long-format data is much easier to use with other data-manipulation and
plotting packages (e.g., `dplyr`, `tidyr`, `ggplot2`) than the format provided
by default from the sampler.
`tidybayes` also provides some additional functionality for data manipulation
and visualization tasks common to many models:
* __Extracting tidy fits and predictions__ from models. For models like those
provided by `rstanarm` and `brms`, `tidybayes` provides a tidy analog of the
`fitted` and `predict` functions, called `add_fitted_samples` and
`add_predicted_samples`. These functions are modeled after the `modelr::add_predictions`
function, and turn a grid of predictions into a long-format data frame of
samples from either the fits or predictions from a model. These functions make
it straightforward to generate arbitrary fit lines from a model.
* __Summarizing posterior distributions__ from models. The `point_interval`
family of functions (`mean_qi`, `median_qi`, `mode_hdi`, etc) are methods
for generating estimates and intervals that are designed with tidy workflows
in mind. They can generate estimates plus an arbitrary number of probability
intervals *from* tidy data frames of samples, they *return* tidy data frames,
and they **respect data frame groups**.
* __Visualizing posteriors__. The focus on tidy data makes the output from tidybayes
easy to visualize using `ggplot`. Existing `geom`s (like `geom_pointrange` and
`geom_linerange`) can give useful output, but `tidybayes` also includes several
geoms to simplify common combinations of `stats` and `geoms` with sensible
defaults suitable for visualizing posterior estimates and intervals (`geom_pointinterval`,
`geom_pointintervalh`, `stat_pointinterval`, `stat_pointintervalh`), visualizing
densities with point estimates and intervals ("eye plots", `geom_eye` and `geom_eyeh`;
or "half-eye plots", `geom_halfeyeh`), and visualizing fit lines with an arbitrary
number of uncertainty bands (`geom_lineribbon` and `stat_lineribbon`). Combining the
base-R `quantile` function with `geom_dotplot` also facilitates the contruction
of quantile dotplots of posteriors (see example in this document).
* __Comparing a variable across levels of a factor__, which often means first
generating pairs of levels of a factor (according to some desired set of
comparisons) and then computing a function over the value of the comparison
variable for those pairs of levels. Assuming your data is in the format
returned by `spread_samples`, the `compare_levels` function allows comparison
across levels to be made easily.
Finally, `tidybayes` aims to fit into common workflows through compatibility with
other packages:
* __Compatibility with other tidy packages__. Default column names in the output
have also been chosen for compatibility with `broom::tidy`, which
makes comparison with estimates from non-Bayesian models straightforward.
* __Compatibility with non-tidy packages__. The `unspread_samples` and
`ungather_samples` functions invert `spread_samples` and `gather_samples`,
aiding compatiblity with other Bayesian plotting packages (notably `bayesplot`).
The `gather_emmeans_samples` function turns the output from `emmeans::emmeans`
(formerly `lsmeans`) into long-format data frames (when applied to supported
model types, like `MCMCglmm` and `rstanarm` models).
## Supported model types
`tidybayes` aims to support a variety of models with a uniform interface. Currently supported models include [rstan](https://cran.r-project.org/package=rstan), [coda::mcmc and coda::mcmc.list](https://cran.r-project.org/package=coda), [runjags](https://cran.r-project.org/package=runjags), [rstanarm](https://cran.r-project.org/package=rstanarm), [brms](https://cran.r-project.org/package=brms), [MCMCglmm](https://cran.r-project.org/package=MCMCglmm), and anything with its own `as.mcmc.list` implementation. If you install the [tidybayes.rethinking](https://github.com/mjskay/tidybayes.rethinking) package, models from the [rethinking](https://github.com/rmcelreath/rethinking) package are also supported.
## Installation
`tidybayes` is not yet on CRAN, but I am in the process of preparing a CRAN submission. You can track progress towards the CRAN submission at the [CRAN Milestone](https://github.com/mjskay/tidybayes/milestone/1).
In the mean time, you can install the latest development version from GitHub with these R
commands:
```{r, eval=FALSE}
install.packages("devtools")
devtools::install_github("mjskay/tidybayes")
```
## Examples
This example shows the use of tidybayes with the Stan modeling language; however, tidybayes supports many other samplers and models, such as JAGS, brm, rstanarm, and (theoretically) any model type supported by `coda::as.mcmc.list`.
```{r setup, message = FALSE, warning = FALSE}
library(magrittr)
library(dplyr)
library(ggplot2)
library(ggstance)
library(rstan)
library(tidybayes)
library(emmeans)
library(broom)
library(brms)
library(modelr)
library(forcats)
library(dotwhisker)
```
```{r hidden_options, include=FALSE}
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
#ggplot options
theme_set(theme_light())
#misc options
options(width = 90)
```
Imagine this dataset:
```{r}
set.seed(5)
n = 10
n_condition = 5
ABC =
data_frame(
condition = rep(c("A","B","C","D","E"), n),
response = rnorm(n * 5, c(0,1,2,1,-1), 0.5)
)
ABC %>%
ggplot(aes(x = response, y = condition)) +
geom_point(alpha = 0.5) +
ylab("condition")
```
A hierarchical model of this data might estimate an overall mean across the conditions (`overall_mean`), the standard deviation of the condition means (`condition_mean_sd`), the mean within each condition (`condition_mean[condition]`) and the standard deviation of the responses given a condition mean (`response_sd`):
```{stan, output.var="ABC_stan", eval=FALSE}
data {
int<lower=1> n;
int<lower=1> n_condition;
int<lower=1, upper=n_condition> condition[n];
real response[n];
}
parameters {
real overall_mean;
vector[n_condition] condition_zoffset;
real<lower=0> response_sd;
real<lower=0> condition_mean_sd;
}
transformed parameters {
vector[n_condition] condition_mean;
condition_mean = overall_mean + condition_zoffset * condition_mean_sd;
}
model {
response_sd ~ cauchy(0, 1); // => half-cauchy(0, 1)
condition_mean_sd ~ cauchy(0, 1); // => half-cauchy(0, 1)
overall_mean ~ normal(0, 5);
condition_zoffset ~ normal(0, 1); // => condition_mean ~ normal(overall_mean, condition_mean_sd)
for (i in 1:n) {
response[i] ~ normal(condition_mean[condition[i]], response_sd);
}
}
```
```{r echo=FALSE, eval=TRUE}
ABC_stan <- stan_model(model_code =
'data {
int<lower=1> n;
int<lower=1> n_condition;
int<lower=1, upper=n_condition> condition[n];
real response[n];
}
parameters {
real overall_mean;
vector[n_condition] condition_zoffset;
real<lower=0> response_sd;
real<lower=0> condition_mean_sd;
}
transformed parameters {
vector[n_condition] condition_mean;
condition_mean = overall_mean + condition_zoffset * condition_mean_sd;
}
model {
response_sd ~ cauchy(0, 1); // => half-cauchy(0, 1)
condition_mean_sd ~ cauchy(0, 1); // => half-cauchy(0, 1)
overall_mean ~ normal(0, 5);
condition_zoffset ~ normal(0, 1); // => condition_mean ~ normal(overall_mean, condition_mean_sd)
for (i in 1:n) {
response[i] ~ normal(condition_mean[condition[i]], response_sd);
}
}
')
```
### Composing data for input to model: `compose_data`
We have compiled and loaded this model into the variable `ABC_stan`. Rather than munge the data into a format Stan likes ourselves, we will use the `tidybayes::compose_data` function, which takes our `ABC` data frame and automatically generates a list of the following elements:
* `n`: number of observations in the data frame
* `n_condition`: number of levels of the condition factor
* `condition`: a vector of integers indicating the condition of each observation
* `response`: a vector of observations
So we can skip right to modeling:
```{r}
m = sampling(ABC_stan, data = compose_data(ABC), control = list(adapt_delta=0.99))
```
### Getting tidy samples from the model: `spread_samples`
We decorate the fitted model using `tidybayes::recover_types`, which will ensure that numeric indices (like `condition`) are back-translated back into factors when we extract data:
```{r}
m %<>% recover_types(ABC)
```
Now we can extract parameters of interest using `spread_samples`, which automatically parses indices, converts them back into their original format, and turns them into data frame columns. This function accepts a symbolic specification of Stan variables using the same syntax you would to index columns in Stan. For example, we can extract the condition means and the residual standard deviation:
```{r}
m %>%
spread_samples(condition_mean[condition], response_sd) %>%
head(15) # just show the first few rows
```
The condition numbers are automatically turned back into text ("A", "B", "C", ...) and split into their own column. A long-format data frame is returned with a row for every iteration $\times$ every combination of indices across all variables given to `spread_samples`; for example, because `response_sd` here is not indexed by `condition`, within the same iteration it has the same value for each row corresponding to a different `condition` (some other formats supported by `tidybayes` are discussed in `vignette("tidybayes")`; in particular, the format returned by `gather_samples`).
### Plotting posteriors as eye plots: `geom_eye` / `geom_eyeh`
Automatic splitting of indices into columns makes it easy to plot the condition means here. We will employ the `tidybayes::geom_eyeh` geom (horizontal version of `tidybayes::geom_eye`), which combines a violin plot of the posterior density, mean, and 95% quantile interval to give an "eye plot" of the posterior. The point and interval types are customizable using the `point_interval` family of functions. A "half-eye" plot (non-mirrored density) is also available as `tidybayes::geom_halfeyeh`.
```{r}
m %>%
spread_samples(condition_mean[condition]) %>%
ggplot(aes(x = condition_mean, y = condition)) +
geom_eyeh()
```
Or one can employ the similar "half-eye" plot:
```{r}
m %>%
spread_samples(condition_mean[condition]) %>%
ggplot(aes(x = condition_mean, y = condition)) +
geom_halfeyeh()
```
### Plotting posteriors as quantile dotplots
Intervals are nice if the alpha level happens to line up with whatever decision you are trying to make, but getting a shape of the posterior is better (hence eye plots, above). On the other hand, making inferences from density plots is imprecise (estimating the area of one shape as a proportion of another is a hard perceptual task). Reasoning about probability in frequency formats is easier, motivating [quantile dotplots](https://github.com/mjskay/when-ish-is-my-bus/blob/master/quantile-dotplots.md), which also allow precise estimation of arbitrary intervals (down to the dot resolution of the plot, here 100):
```{r}
m %>%
spread_samples(condition_mean[condition]) %>%
do(data_frame(condition_mean = quantile(.$condition_mean, ppoints(100)))) %>%
ggplot(aes(x = condition_mean)) +
geom_dotplot(binwidth = .04) +
facet_grid(fct_rev(condition) ~ .) +
scale_y_continuous(breaks = NULL)
```
The idea is to get away from thinking about the posterior as indicating one canonical point or interval, but instead to represent it as (say) 100 approximately equally likely points.
### Model comparison via compatibility with `broom`
The output of the `tidybayes::mean_qi` function (and other `point_interval` functions) is compatible with `broom::tidy`, so we can compare parameter estimates easily to models supported by `broom`.
For example, let's compare to ordinary least squares (OLS) regression:
```{r}
linear_estimates =
lm(response ~ condition, data = ABC) %>%
emmeans(~ condition) %>%
tidy() %>%
mutate(model = "OLS")
linear_estimates
```
The output from `mean_qi` when given a single parameter uses `conf.low` and `conf.high` for interval names so that it lines up with `tidy`:
```{r}
bayes_estimates = m %>%
spread_samples(condition_mean[condition]) %>%
mean_qi(estimate = condition_mean) %>%
mutate(model = "Bayes")
bayes_estimates
```
This makes it easy to bind the two estimates together and plot them:
```{r}
bind_rows(linear_estimates, bayes_estimates) %>%
ggplot(aes(y = condition, x = estimate, xmin = conf.low, xmax = conf.high, color = model)) +
geom_pointrangeh(position = position_dodgev(height = .3))
```
Shrinkage towards the overall mean is visible in the Bayesian estimates.
Comptability with `broom::tidy` also gives compatibility with `dotwhisker::dwplot`:
```{r, warning = FALSE}
bind_rows(linear_estimates, bayes_estimates) %>%
rename(term = condition) %>%
dotwhisker::dwplot()
```
### Posterior prediction and complex custom plots
The tidy data format returned by `spread_samples` also facilitates additional computation on parameters followed by the construction of more complex custom plots. For example, we can generate posterior predictions easily, and use the `.prob` argument (passed intervally to `mean_qi`) to generate any number of intervals from the posterior predictions, then plot them alongside parameter estimates and the data:
```{r}
m %>%
spread_samples(condition_mean[condition], response_sd) %>%
mutate(pred = rnorm(n(), condition_mean, response_sd)) %>%
ggplot(aes(y = condition)) +
# posterior predictive intervals
stat_intervalh(aes(x = pred), .prob = c(.5, .8, .95)) +
scale_color_brewer() +
# mean and quantile intervals of condition mean
stat_pointintervalh(aes(x = condition_mean), .prob = c(.66, .95), position = position_nudge(y = -0.2)) +
# data
geom_point(data = ABC, aes(x = response, y = condition), inherit.aes = FALSE)
```
This plot shows 66% and 95% quantile credible intervals of posterior mean for each condition (point + black line); 95%, 80%, and 50% posterior predictive intervals (blue); and the data.
### Fit curves
For models that support it (like `rstanarm` and `brms` models), We can also use the `add_fitted_samples` or `add_predicted_samples` functions to generate posterior fits or predictions. Combined with the functions from the `modelr` package, this makes it easy to generate fit curves.
Let's fit a slightly naive model to miles per gallon versus horsepower in the `mtcars` dataset:
```{r, results = "hide", message = FALSE, warning = FALSE}
m_mpg = brm(mpg ~ log(hp), data = mtcars, family = lognormal)
```
Now we will use `modelr::data_grid`, `tidybayes::add_predicted_samples`, and `tidybayes::stat_lineribbon` to generate a fit curve with multiple probability bands:
```{r}
mtcars %>%
data_grid(hp = seq_range(hp, n = 101)) %>%
add_predicted_samples(m_mpg) %>%
ggplot(aes(x = hp)) +
stat_lineribbon(aes(x = hp, y = pred), .prob = c(.99, .95, .8, .5), inherit.aes = FALSE) +
geom_point(aes(y = mpg, x = hp), data = mtcars, inherit.aes = FALSE) +
scale_fill_brewer()
```
`stat_lineribbon(aes(y = pred), .prob = c(.99, .95, .8, .5))` is one of several shortcut geoms that simplify common combinations of `tidybayes` functions and `ggplot` goems. It is roughly equivalent to the following:
```{r, eval=FALSE}
stat_summary(
aes(y = pred, fill = forcats::fct_rev(ordered(...prob..)), group = -...prob..),
geom = "ribbon", fun.data = median_qi, fun.args = list(.prob = c(.99, .95, .8, .5))
) +
stat_summary(aes(y = pred), fun.y = median, geom = "line", color = "red", size = 1.25)
```
Because this is all tidy data, if you wanted to build a model with interactions among different categorical variables (say a different curve for automatic and manual transmissions), you can easily generate predictions faceted over that variable (say, different curves for different transmission types). Then you could use the existing faceting features built in to ggplot to plot them.
Such a model might be:
```{r, results = "hide", message = FALSE, warning = FALSE}
m_mpg_am = brm(mpg ~ log(hp)*am, data = mtcars, family = lognormal)
```
Then we can generate and plot predictions as before (differences from above are highlighted as comments):
```{r}
mtcars %>%
data_grid(hp = seq_range(hp, n = 101), am) %>% # add am to the prediction grid
add_predicted_samples(m_mpg_am) %>%
ggplot() +
stat_lineribbon(aes(x = hp, y = pred), .prob = c(.99, .95, .8, .5), inherit.aes = FALSE) +
geom_point(data = mtcars, aes(x = hp, y = mpg), inherit.aes = FALSE) +
scale_fill_brewer() +
facet_wrap(~ am) # facet by am
```
Or, if you would like overplotted posterior fit lines, you can instead use `tidybayes::add_fitted_samples` to get samples from fit lines (instead of predictions), select some reasonable number of them (say `n = 100`), and then plot them:
```{r}
mtcars %>%
data_grid(hp = seq_range(hp, n = 101), am) %>%
add_fitted_samples(m_mpg_am, n = 100) %>% # sample 100 fits from the posterior
ggplot() +
geom_line(aes(x = hp, y = estimate, group = .iteration), alpha = 0.2, color = "red") +
geom_point(data = mtcars, aes(x = hp, y = mpg), inherit.aes = FALSE) +
facet_wrap(~ am)
```
See `vignette("tidybayes")` for a variety of additional examples and more explanation of how it works.
## Feedback, issues, and contributions
I welcome feedback, suggestions, issues, and contributions! Contact me at <[email protected]>. If you have found a bug, please file it [here](https://github.com/mjskay/tidybayes/issues/new) with minimal code to reproduce the issue. Pull requests should be filed against the [dev](https://github.com/mjskay/tidybayes/tree/dev) branch.
`tidybayes` grew out of helper functions I wrote to make my own analysis pipelines tidier. Over time it has expanded to cover more use cases I have encountered, but I would love to make it cover more!