broom.Rmd

# 模型输出结果的规整 {#broom}


## 案例

还是用第 \@ref(ggplot2-geom) 章的`gapminder`案例

```{r broom-1}
library(tidyverse)
library(gapminder)
gapminder
```


### 可视化探索

画个简单的图
```{r broom-2}
gapminder %>%
  ggplot(aes(x = log(gdpPercap), y = lifeExp)) +
  geom_point(alpha = 0.2)
```

我们想用**不同的模型**拟合`log(gdpPercap)`与`lifeExp`的关联

```{r broom-3}
library(colorspace)

model_colors <- colorspace::qualitative_hcl(4, palette = "dark 2")
# model_colors <- c("darkorange", "purple", "cyan4")

ggplot(
  data = gapminder,
  mapping = aes(x = log(gdpPercap), y = lifeExp)
) +
  geom_point(alpha = 0.2) +
  geom_smooth(
    method = "lm",
    aes(color = "OLS", fill = "OLS") # one
  ) +
  geom_smooth(
    method = "lm", formula = y ~ splines::bs(x, df = 3),
    aes(color = "Cubic Spline", fill = "Cubic Spline") # two
  ) +
  geom_smooth(
    method = "loess",
    aes(color = "LOESS", fill = "LOESS") # three
  ) +
  scale_color_manual(name = "Models", values = model_colors) +
  scale_fill_manual(name = "Models", values = model_colors) +
  theme(legend.position = "top")
```


### 简单模型

还是回到我们今天的主题。我们建立一个简单的线性模型
```{r broom-4}
out <- lm(
  formula = lifeExp ~ gdpPercap + pop + continent,
  data = gapminder
)
out
```


```{r broom-5, eval=FALSE}
str(out)
```


```{r broom-6}
summary(out)
```


模型的输出结果是一个复杂的list，图 \@ref(fig:lm-object-schematic)给出了`out`的结构
```{r lm-object-schematic, out.width = '35%', echo = FALSE, fig.cap = '线性模型结果的示意图'}
knitr::include_graphics("images/lm-object-schematic.png")
```

我们发现`out`对象包含了很多元素，比如系数、残差、模型残差自由度等等，用读取列表的方法可以直接读取
```{r broom-7, eval=FALSE}
out$coefficients
out$residuals
out$fitted.values
```

事实上，前面使用的`suammary()`函数只是选取和打印了`out`对象的一小部分信息，同时这些信息的结构不适合用`dplyr`操作和`ggplot2`画图。


## broom

为规整模型结果，这里我们推荐用[David Robinson](http://varianceexplained.org/about/) 开发的`broom`宏包。


```{r broom-8, message = FALSE, warning = FALSE}
library(broom)
```

`broom` 宏包将常用的100多种模型的输出结果规整成数据框
`tibble()`的格式，在模型比较和可视化中就可以方便使用`dplyr`函数了。
`broom` 提供了三个主要的函数:

- `tidy()` 提取模型输出结果的主要信息，比如 `coefficients` 和 `t-statistics`
- `glance()` 把模型视为一个整体，提取如 `F-statistic`，`model deviance` 或者 `r-squared`等信息
- `augment()` 模型输出的信息添加到建模用的数据集中，比如`fitted values` 和 `residuals` 


### tidy

```{r broom-9}
tidy(out)
```


```{r broom-10}
out %>%
  tidy() %>%
  ggplot(mapping = aes(
    x = term,
    y = estimate
  )) +
  geom_point() +
  coord_flip()
```


可以很方便的获取系数的置信区间
```{r broom-11}
out %>%
  tidy(conf.int = TRUE)
```


```{r broom-12}
out %>%
  tidy(conf.int = TRUE) %>%
  filter(!term %in% c("(Intercept)")) %>%
  ggplot(aes(
    x = reorder(term, estimate),
    y = estimate, ymin = conf.low, ymax = conf.high
  )) +
  geom_pointrange() +
  coord_flip() +
  labs(x = "", y = "OLS Estimate")
```


### augment

`augment()`会返回一个数据框，这个数据框是在原始数据框的基础上，增加了模型的拟合值（`.fitted`）, 拟合值的标准误（`.se.fit`）, 残差（`.resid`）等列。


```{r broom-13}
augment(out)
```


```{r broom-14}
out %>%
  augment() %>%
  ggplot(mapping = aes(x = lifeExp, y = .fitted)) +
  geom_point()
```


### glance

`glance()` 函数也会返回数据框，但这个数据框只有一行，内容实际上是`summary()`输出结果的最底下一行。


```{r broom-15}
glance(out)
```


## 应用

broom的三个主要函数在分组统计建模时，格外方便。

```{r broom-16}
penguins <-
  palmerpenguins::penguins %>%
  drop_na()
```


```{r broom-17}
penguins %>%
  group_nest(species) %>%
  mutate(model = purrr::map(data, ~ lm(bill_depth_mm ~ bill_length_mm, data = .))) %>%
  mutate(glance = purrr::map(model, ~ broom::glance(.))) %>%
  tidyr::unnest(glance)
```


```{r broom-18}
fit_ols <- function(df) {
  lm(body_mass_g ~ bill_depth_mm + bill_length_mm, data = df)
}


out_tidy <- penguins %>%
  group_nest(species) %>%
  mutate(model = purrr::map(data, fit_ols)) %>%
  mutate(tidy = purrr::map(model, ~ broom::tidy(.))) %>%
  tidyr::unnest(tidy) %>%
  dplyr::filter(!term %in% "(Intercept)")

out_tidy
```


```{r broom-19}
out_tidy %>%
  ggplot(aes(
    x = species, y = estimate,
    ymin = estimate - 2 * std.error,
    ymax = estimate + 2 * std.error,
    color = term
  )) +
  geom_pointrange(position = position_dodge(width = 0.25)) +
  theme(legend.position = "top") +
  labs(x = NULL, y = "Estimate", color = "系数")
```

## 练习

假定数据是
```{r broom-19-1}
df <- tibble(
  x = runif(30, 2, 10),
  y = -2*x + rnorm(30, 0, 5)
  )
df
```

用`broom::augment()`和ggplot2做出类似的残差图

```{r broom-19-2, echo=FALSE, out.width='90%', fig.align = "left"}
fitted_lm <- lm(y ~ x, data = df)
#fitted_lm %>% broom::augment()
#fitted_lm %>% broom::augment_columns(df, type = "lm")

# residuals plot adapted from: https://drsimonj.svbtle.com/visualising-residuals

fitted_lm %>% 
  broom::augment() %>% 
  select(x, y, predicted = .fitted, residuals = .resid) %>% 
  ggplot(aes(x = x, y = y)) +
  geom_smooth(method = "lm", se = FALSE, color = "gray50") +
  geom_segment(aes(xend= x, yend = predicted), alpha = 0.2) +
  geom_point(aes(size = abs(residuals), color = abs(residuals))) +
  scale_color_continuous(low = "grey", high = "#FFB612", aesthetics = c("fill", "color")) +

  theme(panel.grid.minor = element_blank(),
        panel.grid.major = element_line(color = "gray"),
        panel.background = element_rect(fill = "#f0f0f0", color = NA),
        plot.background = element_rect(fill = "#f0f0f0", color = NA),
        axis.ticks = element_blank(),
        legend.position = "none"
        )

```


```{r broom-20, echo = F}
# remove the objects
# rm(list=ls())
rm(out, out_tidy, penguins, model_colors, fit_ols, df)
```


```{r broom-21, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```