tidyverse_tips.Rmd

# tidyverse中的若干技巧 {#tips}

聊聊tidyverse中常用的一些小技巧

> "most of data science is counting, and sometimes dividing"
>  --- Hadley Wickham


```{r tidyverse-tips-1, message = FALSE, warning = FALSE}
library(tidyverse)
library(patchwork)  # install.packages("patchwork")
```


## count()

我之前多次用到`count()`函数，其功能就是统计某个变量中**各组**出现的次数

```{r tidyverse-tips-2}
df <- tibble(
  name = c("Alice", "Alice", "Bob", "Bob", "Carol", "Carol"),
  type = c("english", "math", "english", "math", "english", "math"),
  score = c(60.2, 90.5, 92.2, 98.8, 82.5, 74.6)
)

df
```

```{r tidyverse-tips-3}
df %>% count(name)
```

如果用之前讲的`group_by() + summarise()`来写，
```{r tidyverse-tips-4}
df %>% 
  group_by(name) %>% 
  summarise( n = n())
```


`count()` 还有更多强大的参数， 比如
```{r tidyverse-tips-5}
df %>% count(name,
  sort = TRUE,
  wt = score,
  name = "total_score"
)
```


如果不用`count()`，用`group_by() + summarise()`写，
```{r tidyverse-tips-6}
df %>%
  group_by(name) %>%
  summarise(
    n = n(),
    total_score = sum(score, na.rm = TRUE)
  ) %>%
  arrange(desc(total_score))
```

当然，`count()`在特定场合下的简便写法，遇到复杂的分组统计，还是得用用`group_by() + summarise()`组合。


## 在 count() 中创建新变量

可以在`count()`里构建新变量，并利用这个新变量完成统计
```{r tidyverse-tips-7}
df %>% count(range = 10 * (score %/% 10))
```


## add_count()

想增加一列，代表每人参加的考试次数

```{r tidyverse-tips-8}
df %>%
  group_by(name) %>%
  mutate(n = n()) %>%
  ungroup()
```


可以有更简单的方法
```{r tidyverse-tips-9}
df %>% add_count(name)
```


## nth(), first(),  last()


```{r tidyverse-tips-10}
v <- c("a", "c", "d", "k")
```


```{r tidyverse-tips-11}
v[1]
v[length(v)]
```


```{r tidyverse-tips-12}
c("a", "c", "d", "k") %>% nth(3)
```


```{r tidyverse-tips-13}
c("a", "c", "d", "k") %>% first()
c("a", "c", "d", "k") %>% last()
```


用在数据框中，同样可以使用
```{r tidyverse-tips-14}
df %>%
  filter(score == first(score))
```


```{r tidyverse-tips-15}
df %>%
  group_by(name) %>%
  filter(score == last(score))
```

## 列变量重新排序

比如想把score放在第一列
```{r tidyverse-tips-16}
df %>%
  select(score, everything())
```

这个方法，对列变量较多的情形非常适用。


## if_else


```{r tidyverse-tips-17}
df %>% mutate(
  assess = if_else(score > 85, "very_good", "good")
)
```


## case_when

```{r tidyverse-tips-18}
df %>% mutate(
  assess = case_when(
    score < 70 ~ "general",
    score >= 70 & score < 80 ~ "good",
    score >= 80 & score < 90 ~ "very_good",
    score >= 90 ~ "best",
    TRUE ~ "other"
  )
)
```


## 找出前几名
```{r tidyverse-tips-19}
df %>%
  top_n(2, score)
```


## 去除多余的空白

```{r tidyverse-tips-20}
library(stringr)

str_trim(" excess    whitespace in a string be gone!")
```


```{r tidyverse-tips-21}
# Use str_squish() to remove any leading, trailing, or excess whitespace
str_squish(" excess    whitespace in a string be gone!")
```

## 取反操作
```{r tidyverse-tips-22}
3:10 %in% c(1:5)
```

有时候需要一个**不属于**的操作符
```{r tidyverse-tips-23}
# 自定义一个不属于操作符
`%nin%` <- Negate(`%in%`)
3:10 %nin% c(1:5)
```

```{r tidyverse-tips-24}
# 使用purrr::negate()自定义反向操作符
`%nin%` <- purrr::negate(`%in%`)
3:10 %nin% c(1:5)
```


## drop_na()

```{r tidyverse-tips-25}
dt <- tribble(
  ~x, ~y,
  1, NA,
  2, NA,
  NA, -3,
  NA, -4,
  5, -5
)

dt
```


```{r tidyverse-tips-26}
dt %>% drop_na()
# dt %>% drop_na(x)
```


## replace_na()
```{r tidyverse-tips-27}
dt <- tribble(
  ~x, ~y,
  1, NA,
  2, NA,
  NA, -3,
  NA, -4,
  5, -5
)

dt %>% mutate(x = replace_na(x, 0))
```


```{r tidyverse-tips-28}
dt %>% mutate(
  x = replace_na(x, mean(x, na.rm = TRUE))
  )
```

之前讲正则表达式也有类似的函数`stringr::str_replace_na()`，


## coalesce

```{r tidyverse-tips-29}
dt <- tribble(
  ~x, ~y,
  1, NA,
  2, NA,
  NA, -3,
  NA, -4,
  5, -5
)

dt %>% mutate(
  z = coalesce(x, 0)
  # z = coalesce(x, y)
)
```

有时候，我们可能为了减少信息丢失，想填充NA
```{r tidyverse-tips-30}
dt <- tribble(
  ~name, ~age,
  "a", 1,
  "b", 2,
  "c", NA,
  "d", 2
)


dt %>%
  mutate(
    age_adj = ifelse(is.na(age), mean(age, na.rm = TRUE), age)
  )
```


## summarise() 生成 list-column
summarize()会生成一个value，

```{r tidyverse-tips-31}
library(gapminder)
gapminder %>%
  group_by(continent) %>%
  summarise(
    avg_gdpPercap = mean(gdpPercap)
  )
```

summarize()也可以生成一个list，

```{r tidyverse-tips-32}
library(gapminder)
gapminder %>%
  group_by(continent) %>%
  summarise(test = list(t.test(gdpPercap))) %>% # 单样本的t检验

  mutate(tidied = purrr::map(test, broom::tidy)) %>%
  unnest(tidied) %>%
  ggplot(aes(estimate, continent)) +
  geom_point() +
  geom_errorbarh(aes(
    xmin = conf.low,
    xmax = conf.high
  ))
```


```{r tidyverse-tips-33}
gapminder %>%
  group_by(continent) %>%
  summarise(test = list(lm(lifeExp ~ gdpPercap))) %>% # 线性回归

  mutate(tidied = purrr::map(test, broom::tidy, conf.int = TRUE)) %>%
  unnest(tidied) %>%
  filter(term != "(Intercept)") %>%
  ggplot(aes(estimate, continent)) +
  geom_point() +
  geom_errorbarh(aes(
    xmin = conf.low,
    xmax = conf.high,
    height = .3
  ))
```


以下两种方法，同样完成上面的工作，具体方法会在第 \@ref(advR) 章介绍
```{r tidyverse-tips-34, eval=FALSE}
gapminder %>%
  group_nest(continent) %>%
  mutate(test = map(data, ~ t.test(.x$gdpPercap))) %>%
  mutate(tidied = map(test, broom::tidy)) %>%
  unnest(tidied)
```


```{r tidyverse-tips-35, eval=FALSE}
gapminder %>%
  group_by(continent) %>%
  group_modify(
    ~ broom::tidy(t.test(.x$gdpPercap))
  )
```


## count() + fct_reorder() + geom_col() + coord_flip()

最好用的四件套

```{r tidyverse-tips-36}
gapminder %>%
  distinct(continent, country) %>%
  count(continent) %>%
  ggplot(aes(x = continent, y = n)) +
  geom_col()
```

```{r tidyverse-tips-37}
gapminder %>%
  distinct(continent, country) %>%
  count(continent) %>%
  ggplot(aes(x = fct_reorder(continent, n), y = n)) +
  geom_col() +
  coord_flip()
```


画图容易，但画出一张好图并不容易
```{r tidyverse-tips-38, eval=FALSE, include=FALSE}
c("#b3b3b3a0", "#D55E00", "#0072B2") %>% scales::show_col()
```

```{r tidyverse-tips-39}
gapminder %>%
  distinct(continent, country) %>%
  count(continent) %>% 
  mutate(coll = if_else(continent == "Asia", "red", "gray")) %>% 


  ggplot(aes(x = fct_reorder(continent, n), y = n)) +
  geom_text(aes(label = n), hjust = -0.25) +
  geom_col(width = 0.8, aes(fill = coll) ) +
  coord_flip() +
  theme_classic() +
  scale_fill_manual(values = c("#b3b3b3a0", "#D55E00")) +
  theme(legend.position = "none",
        axis.text = element_text(size = 11)
        ) +
  labs(title = "我的标题", x = "")
```


或者偷懒，将`continent == "Asia"`的结果直接赋值给`aes(fill = ___ )`， 效果与上面是一样的。
```{r tidyverse-tips-40}
gapminder %>%
  distinct(continent, country) %>%
  count(continent) %>% 


  ggplot(aes(x = fct_reorder(continent, n), y = n)) +
  geom_text(aes(label = n), hjust = -0.25) +
  geom_col(width = 0.8, aes(fill = continent == "Asia") ) +
  coord_flip() +
  theme_classic() +
  scale_fill_manual(values = c("#b3b3b3a0", "#D55E00")) +
  annotate("text", x = 3.8, y = 48, label = "this is important\ncase", 
           color = "#D55E00", size = 5) +
  annotate(
    geom = "curve", x = 4.1, y = 48, xend = 4.1, yend = 35, 
    curvature = .3, arrow = arrow(length = unit(2, "mm"))
  ) +
  theme(legend.position = "none",
        axis.text = element_text(size = 11)
        ) +
  labs(title = "我的标题", x = "")
```


## scale_x/y_log10

现实世界很多满足对数规则

- 各国人均GDP
- 各国人口
- 不同人士的收入
- 公司的营业额


```{r tidyverse-tips-41}
gapminder %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
  geom_point()
```


```{r tidyverse-tips-42}
gapminder %>%
  ggplot(aes(x = gdpPercap, y = lifeExp)) +
  geom_point() +
  scale_x_log10() # A better way to log transform
```

## fct_lump

门诊病症的流水记录

```{r tidyverse-tips-43}
tb <- tibble::tribble(
  ~disease,  ~n,
      "鼻塞", 112,
      "流涕", 130,
      "发热",  89,
      "腹泻",   5,
      "呕吐",  12,
      "咳嗽", 102,
      "咽痛",  98,
      "乏力",  15,
      "腹痛",   2,
      "妄想",   3,
      "幻听",   6,
      "失眠",   1,
      "贫血",   8,
      "多动",   2,
      "胸痛",   4,
      "胸闷",   5
  )
```

```{r tidyverse-tips-44}
p1 <- tb %>% 
  uncount(n) %>% 

  ggplot(aes(x = disease, fill = disease)) +
  geom_bar() +
  coord_flip() +
  theme(legend.position = "none")


p2 <- tb %>% 
  uncount(n) %>% 
  mutate(
    disease = forcats::fct_lump(disease, 5),
    disease = forcats::fct_reorder(disease, .x = disease, .fun = length)
  ) %>% 
  ggplot(aes(x = disease, fill = disease)) +
  geom_bar() +
  coord_flip() +
  theme(legend.position = "none")
  
```

```{r tidyverse-tips-45}
p1 + p2
```


## fct_reoder2

让图例的顺序与图的曲线顺序一致
```{r tidyverse-tips-46}
dat_wide <- tibble(
  x = 1:3,
  top = c(4.5, 4, 5.5),
  middle = c(4, 4.75, 5),
  bottom = c(3.5, 3.75, 4.5)
)


dat_wide %>%
  pivot_longer(
    cols = c(top, middle, bottom),
    names_to = "region",
    values_to = "awfulness")


dat <- dat_wide %>%
  pivot_longer(
    cols = c(top, middle, bottom),
    names_to = "region",
    values_to = "awfulness") %>%
  mutate(
    region_ABCD = factor(region),
    region_sane = fct_reorder2(region, x, awfulness)
  )

p_ABCD <- ggplot(dat, aes(x, awfulness, colour = region_ABCD)) +
  geom_line() + theme(legend.justification = c(1, 0.85))

p_sane <- ggplot(dat, aes(x, awfulness, colour = region_sane)) +
  geom_line() + theme(legend.justification = c(1, 0.85))

p_ABCD + p_sane +
  plot_annotation(
    title = 'Make the legend order = data order, with forcats::fct_reorder2()')
```


## unite

```{r tidyverse-tips-47}
dfa <- tribble(
   ~school, ~class,
  "chuansi", "01",
  "chuansi", "02",
  "shude", "07",
  "shude", "08",
  "huapulu", "101",
  "huapulu", "103"
)

dfa
```

```{r tidyverse-tips-48}
df_united <- dfa %>% 
   tidyr::unite(school, class, col = "school_plus_class", sep = "_", remove = FALSE)

df_united
```


当然，简单的情况也可以用`mutate()`实现
```{r tidyverse-tips-49}
dfa %>% mutate(newcol = str_c(school, "_", class))
```


## separate()

```{r tidyverse-tips-50, eval=FALSE}
df_united %>%
  tidyr::separate(school_plus_class, into = c("sch", "cls"), sep = "_", remove = F)
```


如果用mutate()来实现，语句就会比较复杂些
```{r tidyverse-tips-51}
df_united %>% 
  mutate(sch = str_split(school_plus_class, "_") %>% map_chr(1)) %>% 
  mutate(cls = str_split(school_plus_class, "_") %>% map_chr(2)) 
```


如果每行不是都恰好分隔成两部分呢？就需要`tidyr::extract()`, 使用方法和`tidyr::separate()`类似

```{r tidyverse-tips-52}
dfb <- tribble(
   ~school_class,
  "chuansi_01",
  "chuansi_02_03",
  "shude_07_0",
  "shude_08_0",
  "huapulu_101_u",
  "huapulu_103__p"
)
dfb
```


```{r tidyverse-tips-53, eval=FALSE}
dfb %>% tidyr::separate(school_class, 
                into = c("sch", "cls"), 
                sep = "_", 
                extra = "drop",
                remove = F)
```


## extract()

有时候分隔符搞不定的，可以用正则表达式，讲捕获的每组弄成一列
```{r tidyverse-tips-54}
dfc <- tibble(x = c("1-12week", "1-10wk", "5-12w", "01-05weeks"))
dfc
```


```{r tidyverse-tips-55}
dfc %>% tidyr::extract(
  x,
  c("start", "end", "letter"), "(\\d+)-(\\d+)([a-z]+)",
  remove = FALSE
)
```


## crossing()  

先看看效果
```{r tidyverse-tips-56}
tidyr::crossing(x = c("F", "M"), y = c("a", "b"), z = c(1:2))
```


这个函数在**数据模拟**的时候很方便，
```{r tidyverse-tips-57}
tidyr::crossing(trials = 1:10, m = 1:5) %>%
  group_by(trials) %>%
  mutate(
    guess = sample.int(5, n()),
    result = m == guess
  ) %>%
  summarise(score = sum(result) / n())
```


再来一个例子
```{r tidyverse-tips-58}
sim <- tribble(
  ~f, ~params,
  "rbinom", list(size = 1, prob = 0.5, n = 10)
)
sim %>%
  mutate(sim = invoke_map(f, params))
```


```{r tidyverse-tips-59}
rep_sim <- sim %>%
  crossing(rep = 1:1e5) %>%
  mutate(sim = invoke_map(f, params)) %>%
  unnest(sim) %>%
  group_by(rep) %>%
  summarise(mean_sim = mean(sim))

head(rep_sim)
```


```{r tidyverse-tips-60, fig.width= 6, fig.height= 4}
rep_sim %>% 
  ggplot(aes(x = mean_sim)) +
  geom_histogram(binwidth = 0.05,  fill = "skyblue") +
  theme_classic()
```


也可用在较复杂的模拟，比如下面介绍的**大数极限定理**， 
```{r tidyverse-tips-61}
sim <- tribble(
  ~n_tosses, ~f, ~params,
     10, "rbinom", list(size = 1, prob = 0.5, n = 15),
     30, "rbinom", list(size = 1, prob = 0.5, n = 30),
    100, "rbinom", list(size = 1, prob = 0.5, n = 100),
   1000, "rbinom", list(size = 1, prob = 0.5, n = 1000),
  10000, "rbinom", list(size = 1, prob = 0.5, n = 1e4)
)
sim_rep <- sim %>%
  crossing(replication = 1:50) %>%
  mutate(sims = invoke_map(f, params)) %>%
  unnest(sims) %>%
  group_by(replication, n_tosses) %>%
  summarise(avg = mean(sims))
```


```{r tidyverse-tips-62, fig.width = 8,  fig.height = 6}
sim_rep %>%
  ggplot(aes(x = factor(n_tosses), y = avg)) +
  ggbeeswarm::geom_quasirandom(color = "lightgrey") +
  scale_y_continuous(limits = c(0, 1)) +
  geom_hline(
    yintercept = 0.5,
    color = "skyblue", lty = 1, size = 1, alpha = 3 / 4
  ) +
  ggthemes::theme_pander() +
  labs(
    title = "50 Replicates Of Mean 'Heads' As Number Of Tosses Increase",
    y = "mean",
    x = "Number Of Tosses"
  )
```

数值模拟我们会在第 \@ref(sampling) 章专门介绍。


```{r tidyverse-tips-63, echo = F}
# remove the objects
# rm(list=ls())
rm(`%nin%`, dat, dat_wide, df, df_united, dfa, dfb, dfc, dt, long, p_ABCD, p_sane, p1, p2, plant_heigt, rep_sim, sim, sim_rep, tb, v, wide)
```

```{r tidyverse-tips-64, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```