chapter_5.Rmd

---
title: ""
subtitle: ""
author: ""
institute: ""
date: ""
output:
  xaringan::moon_reader:
    css: [default, css/Font_Style.css]
    lib_dir: libs
    nature:
      highlightStyle: github
      highlightLines: true
      countIncrementalSlides: false
---
class: center, middle
<span style="font-size: 60px;">第五章</span> <br>
<span style="font-size: 50px;">如何清理数据 二 数据的预处理</span> <br>
<span style="font-size: 30px;">如何使用 dplyr 和 tidyr</span> <br>
<br>
<br>
<span style="font-size: 30px;">胡传鹏</span> <br>
<span style="font-size: 30px;">`r Sys.Date()`</span> <br>

---
# <h1 lang="zh-CN">回顾</h1>
# <h2 lang="zh-CN">了解函数</h2>
# <h2 lang="zh-CN">向量类型</h2>
# <h2 lang="zh-CN">比较运算</h2>
# <h2 lang="zh-CN">数据筛选</h2>
<br>
# <h1 lang="zh-CN">本节课内容</h1>
# <h2 lang="zh-CN">函数拓展（批量读取数据）</h2>
# <h2 lang="zh-CN">dplyr包的函数应用（filter, select, mutate等）</h2>
# <h2 lang="zh-CN">tidyr包的函数应用（separate, drop_na等）</h2>

---
# <h1 lang="zh-CN">批量读取文件</h1>
在第二节课中，我们使用了`read.csv()`读取了一个.out文件，<br>
但是一些行为实验中，多个被试的数据存在一个文件夹中，一个一个读取效率就很低了。<br>
这时，我们可以批量读取这些文件。
<br>

# <h2 lang="en">Plan A: for loop</h2> 
_思维难度低，书写难度高_<br>
_更好理解，但是代码更长_<br>
_存在中间变量_<br>

# <h2 lang="en">Plan B: lapply</h2>
_思维难度高，书写难度低_<br>
_更难理解，但是代码更简洁_<br>
_存在"看不见"的局部变量_<br>

---
## <h1 lang="zh-CN">什么是通配符</h1>
<font size=5>
&emsp;&emsp;通配符是一种特殊字符，它可以在匹配文件名或其他文本字符串时代替其他字符。<br>
<br>
&emsp;&emsp;R中常使用的通配符包括"*""?"和"[]"。<br>
&emsp;&emsp;*：代表任意数量的字符，例如*.csv将匹配所有以.csv结尾的文件。<br>
&emsp;&emsp;?：代表单个字符，例如file?.txt将匹配file1.txt，file2.txt等文件，但不会匹配file10.txt。<br>
&emsp;&emsp;[]：用于匹配指定的一组字符。例如，file[123].txt将匹配file1.txt，file2.txt和file3.txt。
</font>

---
# <h1 lang="zh-CN">载入包</h1>
```{r used pacakge}
# 所有路径使用相对路径
library(here)
# 包含了dplyr和%>%等好用的包的集合
library(tidyverse)
```

---
# <h1 lang="zh-CN">设置工作路径</h1>
```{r Set Working Directory}
# 养成用相对路径的好习惯，便于其他人运行你的代码
WD <-  here::here()
getwd()
```

---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">for loop</h2>
```{r for loop list.files, error=FALSE}
# 把所有符合某种标题的文件全部读取到一个list中
files <- list.files(file.path("data/match"), pattern = "data_exp7_rep_match_.*\\.out$")


head(files, n = 10L)

str(files)
```

*P.S.尽管函数叫list.files，但它得到的变量的属性是value，而不是list*

---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">for loop</h2>
```{r df.mt.out.fl}
# 创建一个空的列表来存储读取的数据框
df_list <- list()
# 循环读取每个文件，处理数据并添加到列表中
for (i in seq_along(files)) { # 重复"读取到的.out个数"的次数
  # 对每个.out，使用read.table
  df <- read.table(file.path("data/match", files[i]), header = TRUE) #read.table似乎比read.csv更聪明，不需要指定分隔符
  # 给读取到的每个.out文件的每个变量统一变量格式
  df <- dplyr::filter(df, Date != "Date") %>% # 因为有些.out文件中部还有变量名，所需需要用filter把这些行过滤掉
    dplyr::mutate(Date = as.character(Date),Prac = as.character(Prac),
                  Sub = as.numeric(Sub),Age = as.numeric(Age),Sex = as.character(Sex),Hand = as.character(Hand),
                  Block = as.numeric(Block),Bin = as.numeric(Bin),Trial = as.numeric(Trial),
                  Shape = as.character(Shape),Label = as.character(Shape),Match = as.character(Match),
                  CorrResp = as.character(CorrResp),Resp = as.character(Resp),
                  ACC = as.numeric(ACC),RT = as.numeric(RT))
  # 将数据框添加到列表中
  df_list[[i]] <- df
}
# 合并所有数据框，只有当变量的属性一致时，才可以bind_rows
# bind_rows 意味着把list中的所有表格整合成一个大表格
df.mt.out.fl <- dplyr::bind_rows(df_list)
# 清除中间变量
rm(df,df_list,files,i)
# 如果你将这个步骤写成函数，则这些变量自然不会出现在全局变量中
```

---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">for loop</h2>
_应该是25920 obs 16 variables_
```{r df.mt.out DT, echo=FALSE}
DT::datatable(head(df.mt.out.fl, 100),
              fillContainer = TRUE, options = list(pageLength = 7))
```

---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">lapply</h2>
```{r df.mt.raw.la}
# 获取所有的.out文件名
df.mt.out.la <- list.files(file.path("data/match"), pattern = "data_exp7_rep_match_.*\\.out$") %>%
  # 对读取到的所有.out文件x都执行函数read.table
  lapply(function(x) read.table(file.path("data/match", x), header = TRUE)) %>% 
  # 对所有被read.table处理过的数据执行dplyr的清洗
  lapply(function(df) dplyr::filter(df, Date != "Date") %>% # 因为有些.out文件中部还有变量名，所需需要用filter把这些行过滤掉
                      dplyr::mutate(Date = as.character(Date),Prac = as.character(Prac),
                                    Sub = as.numeric(Sub),Age = as.numeric(Age),Sex = as.character(Sex),Hand = as.character(Hand),
                                    Block = as.numeric(Block),Bin = as.numeric(Bin),Trial = as.numeric(Trial),
                                    Shape = as.character(Shape),Label = as.character(Shape),Match = as.character(Match),
                                    CorrResp = as.character(CorrResp),Resp = as.character(Resp),
                                    ACC = as.numeric(ACC),RT = as.numeric(RT)
                                    ) # 有些文件里读出来的数据格式不同，在这里统一所有out文件中的数据格式
         ) %>%
  bind_rows()
```

---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">lapply</h2>
_应该是25920 obs 16 variables_
```{r df.mt.out.la DT, echo=FALSE}
DT::datatable(head(df.mt.out.la, 100),
              fillContainer = TRUE, options = list(pageLength = 7))
```

---
# <h1 lang="zh-CN">保存文件</h1>
```{r write.csv}
#for loop 或 lapply的都可以
#write.csv(df.mt.out.fl, file = "./data/match/match_raw.csv",row.names = FALSE)
write.csv(df.mt.out.la, file = "./data/match/match_raw.csv",row.names = FALSE)
```

---
class: center, middle
<span style="font-size: 60px;">5.1 数据预处理准备</span> <br>

---
# <h1 lang="zh-CN">读取原始数据 </h1>
## <h2 lang="en">Raw Data: Penguin </h2>
```{r Read Penguin RawData}
# 读取原始数据
df.pg.raw <-  read.csv('./data/penguin/penguin_rawdata.csv',
                       header = T, sep=",", stringsAsFactors = FALSE)
# 这里查看表格使用的是DT::datatable，为了PPT里好看
# 你可以直接点R Studio右边的环境变量来看，或者用str()或者head()
```
```{r Read Penguin RawData DT, echo=FALSE}
DT::datatable(head(df.pg.raw, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="zh-CN">读取原始数据 </h1>
## <h2 lang="en">Raw Data: Match Task </h2>
```{r Read Match Task RawData}
# 读取原始数据
df.mt.raw <-  read.csv('./data/match/match_raw.csv',
                       header = T, sep=",", stringsAsFactors = FALSE) 
```
```{r Read Match Task RawData DT, echo=FALSE}
DT::datatable(head(df.mt.raw, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
class: center, middle
<span style="font-size: 60px;">5.2 数据预处理的基本操作</span> <br>
---
# <h1 lang="en">dplyr</h1>
<br>

<body>
  <p lang="en">dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges: </p>
</body>

<br>

<img src="https://dplyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
---
# <h1 lang="en">dplyr::functions</h1>

- filter() 选择符合某个条件的行（可能代表被试） <br>

- mutate() 生成新的变量 <br>

- group_by() 依据某些变量产生的条件，给数据分组 <br>
  **如果你使用了 "group_by",** <br>
  **一定要在summarise后使用 "ungroup".** <br>
  
- summarise() 进行某些加减乘除的运算 <br>  

- ungroup() 取消刚刚进行的分组 <br>  

- select() 选择最终进行分析时需要用到的变量，同时也起到了为所有变量排序的功能 <br>

- arrange() 某一列的值，按照某个顺序排列（其他列也会随之变动） <br>

_当你清洗数据时，也基本上会按照这个顺序来使用_

---
class: center, middle
# <h1 lang="zh-CN">接下来就要正式讲dplyr了</h1>
<img src="https://dplyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
<br>


---
# <h1 lang="en">dplyr::filter</h1>
## <h2 lang="zh-CN">选择个案</h2>

```{r example of filter rawdata_penguin}
# 使用filter筛选出数据集中1995之后出生的被试
df.clean.filter <- df.pg.raw %>%
  dplyr::filter(.,age >= 1995)
```
```{r example of filter rawdata_penguin DT, echo=FALSE}
# 看看筛选后的数据是不是只有95后
DT::datatable(head(df.clean.filter, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">dplyr::select</h1>
## <h2 lang="zh-CN">选择变量</h2>

```{r example of select rawdata_penguin}
# 使用select选择age和ALEX的所有题目
df.clean.select <- df.pg.raw %>%
  dplyr::select(age, starts_with("ALEX"), eatdrink, avoidance)
#笨一点的方法，就是把16个ALEX都写出来
```
```{r example of select rawdata_penguin DT, echo=FALSE}
# 看看其他变量是不是都消失了
DT::datatable(head(df.clean.select, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">计算变量 对指定列求和</h2>
```{r example of mutate_1 rawdata_penguin}
# 把ALEX1 - 4求和
df.clean.mutate_1 <- df.pg.raw %>% 
  dplyr::mutate(ALEX_SUM = ALEX1 + ALEX2 + ALEX3 + ALEX4)
```
```{r example of mutate_1 rawdata_penguin DT, echo=FALSE}
# 看看是不是真的求和了
DT::datatable(head(df.clean.mutate_1, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">计算变量 对含有某个字符的列求和</h2>
```{r example of mutate_2 rawdata_penguin}
# 对所有含有ALEX的列求和
df.clean.mutate_2 <- df.pg.raw %>% 
  dplyr::mutate(ALEX_SUM = rowSums(select(., starts_with("ALEX"))))
```
```{r example of mutate_2 rawdata_penguin DT, echo=FALSE}
DT::datatable(head(df.clean.mutate_2, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```


---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">重新编码为不同变量</h2>

```{r example of mutate_3 rawdata_penguin}
df.clean.mutate_3 <- df.pg.raw %>% 
  dplyr::mutate(decade = case_when(age <= 1969 ~ 60,
                                   age >= 1970 & age <= 1979 ~ 70,
                                   age >= 1980 & age <= 1989 ~ 80,
                                   age >= 1990 & age <= 1999 ~ 90,
                                   TRUE ~ NA_real_)
                ) %>% #当括号多的时候注意括号的位置 
  dplyr::select(.,decade, everything())
```
```{r example of mutate_3 rawdata_penguin DT, echo=FALSE}
DT::datatable(head(df.clean.mutate_3, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">重新编码为相同变量（反向计分）</h2>

```{r example of mutate_4 rawdata_penguin}
df.clean.mutate_4 <- df.pg.raw %>% 
  dplyr::mutate(ALEX1 = case_when(ALEX1 == '1' ~ '5',
                                  ALEX1 == '2' ~ '4',
                                  ALEX1 == '3' ~ '3',
                                  ALEX1 == '4' ~ '2',
                                  ALEX1 == '5' ~ '1',
                                  TRUE ~ as.character(ALEX1))
                ) %>%
  dplyr::mutate(ALEX1 = as.numeric(ALEX1))
```
```{r example of mutate_4 rawdata_penguin DT, echo=FALSE}
DT::datatable(head(df.clean.mutate_4, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">dplyr::group_by & summarise</h1>
## <h2 lang="zh-CN">拆分文件 分组计算</h2>

```{r example of group_by rawdata_penguin}
df.clean.group_by <- df.clean.mutate_3 %>%
  dplyr::group_by(.,decade) %>% # 根据被试的出生年代，将数据拆分
  dplyr::summarise(mean_avoidance = mean(avoidance)) %>% # 计算不同年代下被试的平均avoidance
  dplyr::ungroup()
```
```{r example of group_by rawdata_penguin DT, echo=FALSE}
# 拆分文件并不会让源文件产生任何视觉上的变化
DT::datatable(head(df.clean.group_by, 4),
              fillContainer = TRUE, options = list(pageLength = 4))
```

---
# <h1 lang="en">dplyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>

Step1: 选择eatdrink为1的被试    [filter]<br>
<br>
Step2: 选择我们需要的变量   [select]<br>
<br>
Step3: 对反向计分题目重新编码   [mutate]<br>
<br>
Step4: 将出生年份编码为出生年代   [mutate]<br>
<br>
Step5: 按年代计算ALEX的平均值   [group_by, summarise]<br>
<br>

---
# <h1 lang="en">dplyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>

```{r example of total rawdata_penguin}
df.pg.clean <- df.pg.raw %>%
  dplyr::filter(eatdrink == 1) %>% # 选择eatdrink为1的被试
  dplyr::select(age, starts_with("ALEX"), eatdrink, avoidance) %>%
  dplyr::mutate(ALEX1 = case_when(ALEX1 == '1' ~ '5', # 反向计分
                                  ALEX1 == '2' ~ '4',
                                  ALEX1 == '3' ~ '3',
                                  ALEX1 == '4' ~ '2',
                                  ALEX1 == '5' ~ '1',
                                  TRUE ~ as.character(ALEX1))) %>%
  dplyr::mutate(ALEX1 = as.numeric(ALEX1)) %>%
  dplyr::mutate(ALEX_SUM = rowSums(select(., starts_with("ALEX"))), # 把所有ALEX的题目分数求和
                decade = case_when(age <= 1969 ~ 60, # 把出生年份转换为年代
                                   age >= 1970 & age <= 1979 ~ 70,
                                   age >= 1980 & age <= 1989 ~ 80,
                                   age >= 1990 & age <= 1999 ~ 90,
                                   TRUE ~ NA_real_)) %>%
  dplyr::group_by(decade) %>% # 按照年代将数据拆分
  dplyr::summarise(mean_ALEX = mean(ALEX_SUM)) %>% # 计算每个年代的被试的平均的ALEX_SUM
  dplyr::ungroup() # 解除对数据的拆分
```

---
# <h1 lang="en">dplyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>

```{r result of total, echo=FALSE}
DT::datatable(head(df.pg.clean, 5),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
class: center, middle
<span style="font-size: 60px;">5.3 数据预处理的进阶操作</span> <br>

---
# <h1 lang="en">tidyr</h1>

**The goal of tidyr is to help you create tidy data. Tidy data is data where:** <br>

- Every column is variable. <br>
- Every row is an observation. <br>
- Every cell is a single value. <br>

<img src="https://tidyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
---
# <h1 lang="en">tidyr::functions</h1>

- separate() 把一个变量的单元格内的字符串拆成两份，变成两个变量 <br>
  **更适合用于按固定分隔符分割字符串，如将“2022-02-25”分成“2022”、“02”和“25”三列** <br>
- extract() 类似于separate <br>
  **更适合用于从字符串中提取特定的信息，如将“John Smith”分成“John”和“Smith”两列** <br>
- unite() 把多个列（字符串）整合为一列 <br>

- pivot_longer() 把宽数据转化为长数据 <br>
- pivot_wider() 把长数据转化为宽数据 <br>   
  
- drop_na() 删除缺失值
---
class: center, middle
# <h1 lang="zh-CN">接下来就要正式讲tidyr了</h1>
<img src="https://tidyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
<br>

---
# <h1 lang="en">tidyr::separate</h1>
## <h2 lang="zh-CN">拆分单元格内字符串</h2>

```{r tidyr::separate | rawdata_matchtask}
df.clean.separate <- df.mt.raw %>%
  tidyr::separate(., col = Shape, into = c("Valence", "Identity"), 
                                  sep = "(?<=moral|immoral)(?=Self|Other)") %>%
  dplyr::select(Sub, Valence, Identity, everything())
```
```{r tidyr::separate | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.separate, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">tidyr::extract</h1>
## <h2 lang="zh-CN">拆分单元格内字符串</h2>

```{r tidyr::extract | rawdata_matchtask}
df.clean.extract <- df.mt.raw %>% 
  tidyr::extract(Shape, into = c("Valence", "Identity"),
                        regex = "(moral|immoral)(Self|Other)", remove = FALSE) %>% 
  dplyr::select(Sub, Valence, Identity, everything())
```
```{r tidyr::extract | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.extract, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">tidyr::unite</h1>
## <h2 lang="zh-CN">合并单元格的字符串</h2>

```{r tidyr::unite | rawdata_matchtask}
df.clean.unite <- df.clean.separate %>%
  tidyr::unite(Shape, Valence, Identity, sep = "") 
```
```{r tidyr::unite | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.unite, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">tidyr::pivot_longer</h1>
## <h2 lang="zh-CN">长数据与宽数据的相互转换</h2>

```{r pivot_longer | rawdata_matchtask}
df.clean.long <- df.mt.raw %>% 
  tidyr::pivot_longer(cols = c(RT, ACC),
                      names_to = "DV",
                      values_to = "Value") 
```
```{r pivot_longer | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.long, 48),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">tidyr::pivot_wider</h1>
## <h2 lang="zh-CN">长数据与宽数据的相互转换</h2>

```{r pivot_wider | rawdata_matchtask, warning=FALSE}
df.clean.wide <- df.clean.long %>% 
  dplyr::select(Sub, Trial, Shape, DV, Value) %>%
  dplyr::group_by(Sub, Shape, DV) %>%
  dplyr::summarise(mean_Value = mean(Value)) %>%
  tidyr::pivot_wider(names_from = c("Shape", "DV"), values_from = "mean_Value", names_glue = "{Shape}_{DV}")
```
```{r pivot_wider | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.wide, 10),
              fillContainer = TRUE, options = list(pageLength = 3))
```

---
# <h1 lang="en">tidyr::drop_na</h1>
## <h2 lang="zh-CN">删除含有缺失值的行（被试，试次，实验条件...）</h2>

```{r drop_na | rawdata_matchtask, warning=FALSE}
df.clean.drou_na <- df.mt.raw %>% 
  tidyr::drop_na()
```
```{r drop_na | rawdata_matchtask check NA}
paste("原始数据集有", nrow(df.mt.raw), "行")
paste("删除缺失值后有", nrow(df.clean.drou_na), "行")
# 实际操作中，可能粗暴的删除所有含有缺失值的行并不妥
# 因此建议通过dplyr::的filter来筛选出合格的被试（行）
# 然后再用is.na()来检验是否还存在缺失值
any(is.na(df.mt.raw)); sum(is.na(df.mt.raw))
```

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>

Step1: 选择我们需要的变量   [select]<br>
<br>
Step2: 删除缺失值，选择符合标准的被试   [drop_na, filter]<br>
<br>
Step3: 分实验条件计算平均反应时和正确率   [group_by, summarise]<br>
<br>
Step4: 将Shape变量拆分为Valence和Identity，选取Match-Moral组    [extract, filter]<br>
<br>
Step5: 将长数据转化为宽数据，得到Self和Other情况下的efficiency    [pivot_wide]<br>
<br>
Step6: 计算实验条件为Match-Moral时efficiency的SPE   [mutate, select]<br>
<br>

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step1: 选择我们需要的变量</h2>

```{r example of total part1 rawdata_matchtask,message=FALSE}
df.mt.clean <- df.mt.raw %>%
  dplyr::select(Sub, Age, Sex, Hand, #人口统计学
                Block, Bin, Trial, # 试次
                Shape, Label, Match, # 刺激
                Resp, ACC, RT) # 反应结果
```
```{r example of total part1 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step2: 删除缺失值，选择符合标准的被试</h2>

```{r example of total part2 rawdata_matchtask,message=FALSE}
df.mt.clean <- df.mt.clean %>%
  tidyr::drop_na() %>% #删除缺失值
  dplyr::filter(.,Hand == "R", # 选择右利手被试
                  ACC == 0 | ACC == 1 , # 排除无效应答（ACC = -1 OR 2)
                  RT >= 0.2 & RT <= 1.5)  # 选择RT属于[200,1500]
```
```{r example of total part2 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step3: 分实验条件计算</h2>

```{r example of total part3 rawdata_matchtask,message=FALSE}
df.mt.clean <- df.mt.clean %>%
  dplyr::group_by(Sub, Shape, Label, Match) %>%
  dplyr::summarise(mean_ACC = mean(ACC),
                   mean_RT = mean(RT)) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(efficiency = mean_RT/mean_ACC) 
```
```{r example of total part3 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step4: 将Shape变量拆分</h2>

```{r example of total part4 rawdata_matchtask}
df.mt.clean <- df.mt.clean %>%
  tidyr::extract(Shape, into = c("Valence", "Identity"),
                        regex = "(moral|immoral)(Self|Other)", remove = FALSE) %>%
  dplyr::filter(Match == "match" & Valence == "moral") 
  # 自我优势效应一般讨论的是匹配条件下
  # 人们对自己相关的信息反应快于非自我相关的

```
```{r example of total part4 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step5: 将长数据转化为宽数据</h2>

```{r example of total part5 rawdata_matchtask}
df.mt.clean <- df.mt.clean %>%
  dplyr::select(Sub, Identity, efficiency) %>%
  tidyr::pivot_wider(names_from = "Identity", values_from = "efficiency")
```
```{r example of total part5 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step6: 计算SPE</h2>

```{r example of total part6 rawdata_matchtask}
df.mt.clean <- df.mt.clean %>%
  dplyr::mutate(eff_moral_SPE = Self - Other) %>%
  dplyr::select(Sub, eff_moral_SPE) 
```
```{r example of total part6 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
              fillContainer = TRUE, options = list(pageLength = 5))
```

---
## <h2 lang="zh-CN">感受一下如果所有数据清洗的代码放在一起是什么样子</h2>

```{r example of total rawdata_matchtask, message=FALSE}
df.mt.clean <- df.mt.raw %>%
  dplyr::select(Sub, Age, Sex, Hand, #人口统计学
                Block, Bin, Trial, # 试次
                Shape, Label, Match, # 刺激
                Resp, ACC, RT, # 反应结果
                ) %>% 
  tidyr::drop_na() %>% #删除缺失值
  dplyr::filter(.,Hand == "R", # 选择右利手被试
                  ACC == 0 | ACC == 1 , # 排除无效应答（ACC = -1 OR 2)
                  RT >= 0.2 & RT <= 1.5 # 选择RT属于[200,1500]
                ) %>%
  dplyr::group_by(Sub, 
                  Shape, Label, Match) %>%
  dplyr::summarise(mean_ACC = mean(ACC),
                   mean_RT = mean(RT)) %>%
  dplyr::ungroup() %>%
  dplyr::mutate(efficiency = mean_RT/mean_ACC) %>%
  tidyr::extract(Shape, into = c("Valence", "Identity"),
                        regex = "(moral|immoral)(Self|Other)", remove = FALSE) %>%
  dplyr::filter(Match == "match" & Valence == "moral") %>%
  dplyr::select(Sub, Identity, efficiency) %>%
  tidyr::pivot_wider(names_from = "Identity", values_from = "efficiency") %>%
  dplyr::mutate(eff_moral_SPE = Self - Other) %>%
  dplyr::select(Sub, eff_moral_SPE) 
```

---
# <h1 lang="zh-CN">课堂练习题</h1>

计算不同Shape情况下（immoralself，moralself，immoralother，moralother）<br>
基于信号检测论match与mismatch之间的d值（match为信号，mismatch噪音）<br>
为了方便大家写代码。以下是计算信号检测论d值的公式 <br>

```{r, eval=FALSE}
dplyr::summarise(
      hit = length(ACC[Match == "match" & ACC == 1]),
      fa = length(ACC[Match == "mismatch" & ACC == 0]),
      miss = length(ACC[Match == "match" & ACC == 0]),
      cr = length(ACC[Match == "mismatch" & ACC == 1]),
      Dprime = qnorm(
        ifelse(hit / (hit + miss) < 1,
               hit / (hit + miss),
               1 - 1 / (2 * (hit + miss))
              )
        ) 
             - qnorm(
        ifelse(fa / (fa + cr) > 0,
              fa / (fa + cr),
              1 / (2 * (fa + cr))
              )
                    )
      ) 
```

---
# <h1 lang="zh-CN">课堂练习题</h1>
## <h2 lang="zh-CN">思路</h2>
Step1: 选择需要的变量 <br>
<br>
Step2: 基于Sub，Block，Bin和Shape分组 <br>
<br>
Step3: 使用计算公式 <br>
<br>
Step4: 删除击中、虚报、误报、正确拒绝 <br>
<br>
Step5: 按Sub和Shape分组 <br>
<br>
Step6: 长转宽，得到每个Shape情况下的信号检测论d值 <br>
<br>