forked from FreyaWen/R4Psy
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathchapter_5.Rmd
720 lines (625 loc) · 25.8 KB
/
chapter_5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
---
title: ""
subtitle: ""
author: ""
institute: ""
date: ""
output:
xaringan::moon_reader:
css: [default, css/Font_Style.css]
lib_dir: libs
nature:
highlightStyle: github
highlightLines: true
countIncrementalSlides: false
---
class: center, middle
<span style="font-size: 60px;">第五章</span> <br>
<span style="font-size: 50px;">如何清理数据 二 数据的预处理</span> <br>
<span style="font-size: 30px;">如何使用 dplyr 和 tidyr</span> <br>
<br>
<br>
<span style="font-size: 30px;">胡传鹏</span> <br>
<span style="font-size: 30px;">`r Sys.Date()`</span> <br>
---
# <h1 lang="zh-CN">回顾</h1>
# <h2 lang="zh-CN">了解函数</h2>
# <h2 lang="zh-CN">向量类型</h2>
# <h2 lang="zh-CN">比较运算</h2>
# <h2 lang="zh-CN">数据筛选</h2>
<br>
# <h1 lang="zh-CN">本节课内容</h1>
# <h2 lang="zh-CN">函数拓展(批量读取数据)</h2>
# <h2 lang="zh-CN">dplyr包的函数应用(filter, select, mutate等)</h2>
# <h2 lang="zh-CN">tidyr包的函数应用(separate, drop_na等)</h2>
---
# <h1 lang="zh-CN">批量读取文件</h1>
在第二节课中,我们使用了`read.csv()`读取了一个.out文件,<br>
但是一些行为实验中,多个被试的数据存在一个文件夹中,一个一个读取效率就很低了。<br>
这时,我们可以批量读取这些文件。
<br>
# <h2 lang="en">Plan A: for loop</h2>
_思维难度低,书写难度高_<br>
_更好理解,但是代码更长_<br>
_存在中间变量_<br>
# <h2 lang="en">Plan B: lapply</h2>
_思维难度高,书写难度低_<br>
_更难理解,但是代码更简洁_<br>
_存在"看不见"的局部变量_<br>
---
## <h1 lang="zh-CN">什么是通配符</h1>
<font size=5>
  通配符是一种特殊字符,它可以在匹配文件名或其他文本字符串时代替其他字符。<br>
<br>
  R中常使用的通配符包括"*""?"和"[]"。<br>
  *:代表任意数量的字符,例如*.csv将匹配所有以.csv结尾的文件。<br>
  ?:代表单个字符,例如file?.txt将匹配file1.txt,file2.txt等文件,但不会匹配file10.txt。<br>
  []:用于匹配指定的一组字符。例如,file[123].txt将匹配file1.txt,file2.txt和file3.txt。
</font>
---
# <h1 lang="zh-CN">载入包</h1>
```{r used pacakge}
# 所有路径使用相对路径
library(here)
# 包含了dplyr和%>%等好用的包的集合
library(tidyverse)
```
---
# <h1 lang="zh-CN">设置工作路径</h1>
```{r Set Working Directory}
# 养成用相对路径的好习惯,便于其他人运行你的代码
WD <- here::here()
getwd()
```
---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">for loop</h2>
```{r for loop list.files, error=FALSE}
# 把所有符合某种标题的文件全部读取到一个list中
files <- list.files(file.path("data/match"), pattern = "data_exp7_rep_match_.*\\.out$")
head(files, n = 10L)
str(files)
```
*P.S.尽管函数叫list.files,但它得到的变量的属性是value,而不是list*
---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">for loop</h2>
```{r df.mt.out.fl}
# 创建一个空的列表来存储读取的数据框
df_list <- list()
# 循环读取每个文件,处理数据并添加到列表中
for (i in seq_along(files)) { # 重复"读取到的.out个数"的次数
# 对每个.out,使用read.table
df <- read.table(file.path("data/match", files[i]), header = TRUE) #read.table似乎比read.csv更聪明,不需要指定分隔符
# 给读取到的每个.out文件的每个变量统一变量格式
df <- dplyr::filter(df, Date != "Date") %>% # 因为有些.out文件中部还有变量名,所需需要用filter把这些行过滤掉
dplyr::mutate(Date = as.character(Date),Prac = as.character(Prac),
Sub = as.numeric(Sub),Age = as.numeric(Age),Sex = as.character(Sex),Hand = as.character(Hand),
Block = as.numeric(Block),Bin = as.numeric(Bin),Trial = as.numeric(Trial),
Shape = as.character(Shape),Label = as.character(Shape),Match = as.character(Match),
CorrResp = as.character(CorrResp),Resp = as.character(Resp),
ACC = as.numeric(ACC),RT = as.numeric(RT))
# 将数据框添加到列表中
df_list[[i]] <- df
}
# 合并所有数据框,只有当变量的属性一致时,才可以bind_rows
# bind_rows 意味着把list中的所有表格整合成一个大表格
df.mt.out.fl <- dplyr::bind_rows(df_list)
# 清除中间变量
rm(df,df_list,files,i)
# 如果你将这个步骤写成函数,则这些变量自然不会出现在全局变量中
```
---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">for loop</h2>
_应该是25920 obs 16 variables_
```{r df.mt.out DT, echo=FALSE}
DT::datatable(head(df.mt.out.fl, 100),
fillContainer = TRUE, options = list(pageLength = 7))
```
---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">lapply</h2>
```{r df.mt.raw.la}
# 获取所有的.out文件名
df.mt.out.la <- list.files(file.path("data/match"), pattern = "data_exp7_rep_match_.*\\.out$") %>%
# 对读取到的所有.out文件x都执行函数read.table
lapply(function(x) read.table(file.path("data/match", x), header = TRUE)) %>%
# 对所有被read.table处理过的数据执行dplyr的清洗
lapply(function(df) dplyr::filter(df, Date != "Date") %>% # 因为有些.out文件中部还有变量名,所需需要用filter把这些行过滤掉
dplyr::mutate(Date = as.character(Date),Prac = as.character(Prac),
Sub = as.numeric(Sub),Age = as.numeric(Age),Sex = as.character(Sex),Hand = as.character(Hand),
Block = as.numeric(Block),Bin = as.numeric(Bin),Trial = as.numeric(Trial),
Shape = as.character(Shape),Label = as.character(Shape),Match = as.character(Match),
CorrResp = as.character(CorrResp),Resp = as.character(Resp),
ACC = as.numeric(ACC),RT = as.numeric(RT)
) # 有些文件里读出来的数据格式不同,在这里统一所有out文件中的数据格式
) %>%
bind_rows()
```
---
# <h1 lang="zh-CN">批量读取文件</h1>
## <h2 lang="en">lapply</h2>
_应该是25920 obs 16 variables_
```{r df.mt.out.la DT, echo=FALSE}
DT::datatable(head(df.mt.out.la, 100),
fillContainer = TRUE, options = list(pageLength = 7))
```
---
# <h1 lang="zh-CN">保存文件</h1>
```{r write.csv}
#for loop 或 lapply的都可以
#write.csv(df.mt.out.fl, file = "./data/match/match_raw.csv",row.names = FALSE)
write.csv(df.mt.out.la, file = "./data/match/match_raw.csv",row.names = FALSE)
```
---
class: center, middle
<span style="font-size: 60px;">5.1 数据预处理准备</span> <br>
---
# <h1 lang="zh-CN">读取原始数据 </h1>
## <h2 lang="en">Raw Data: Penguin </h2>
```{r Read Penguin RawData}
# 读取原始数据
df.pg.raw <- read.csv('./data/penguin/penguin_rawdata.csv',
header = T, sep=",", stringsAsFactors = FALSE)
# 这里查看表格使用的是DT::datatable,为了PPT里好看
# 你可以直接点R Studio右边的环境变量来看,或者用str()或者head()
```
```{r Read Penguin RawData DT, echo=FALSE}
DT::datatable(head(df.pg.raw, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="zh-CN">读取原始数据 </h1>
## <h2 lang="en">Raw Data: Match Task </h2>
```{r Read Match Task RawData}
# 读取原始数据
df.mt.raw <- read.csv('./data/match/match_raw.csv',
header = T, sep=",", stringsAsFactors = FALSE)
```
```{r Read Match Task RawData DT, echo=FALSE}
DT::datatable(head(df.mt.raw, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
class: center, middle
<span style="font-size: 60px;">5.2 数据预处理的基本操作</span> <br>
---
# <h1 lang="en">dplyr</h1>
<br>
<body>
<p lang="en">dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges: </p>
</body>
<br>
<img src="https://dplyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
---
# <h1 lang="en">dplyr::functions</h1>
- filter() 选择符合某个条件的行(可能代表被试) <br>
- mutate() 生成新的变量 <br>
- group_by() 依据某些变量产生的条件,给数据分组 <br>
**如果你使用了 "group_by",** <br>
**一定要在summarise后使用 "ungroup".** <br>
- summarise() 进行某些加减乘除的运算 <br>
- ungroup() 取消刚刚进行的分组 <br>
- select() 选择最终进行分析时需要用到的变量,同时也起到了为所有变量排序的功能 <br>
- arrange() 某一列的值,按照某个顺序排列(其他列也会随之变动) <br>
_当你清洗数据时,也基本上会按照这个顺序来使用_
---
class: center, middle
# <h1 lang="zh-CN">接下来就要正式讲dplyr了</h1>
<img src="https://dplyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
<br>
---
# <h1 lang="en">dplyr::filter</h1>
## <h2 lang="zh-CN">选择个案</h2>
```{r example of filter rawdata_penguin}
# 使用filter筛选出数据集中1995之后出生的被试
df.clean.filter <- df.pg.raw %>%
dplyr::filter(.,age >= 1995)
```
```{r example of filter rawdata_penguin DT, echo=FALSE}
# 看看筛选后的数据是不是只有95后
DT::datatable(head(df.clean.filter, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">dplyr::select</h1>
## <h2 lang="zh-CN">选择变量</h2>
```{r example of select rawdata_penguin}
# 使用select选择age和ALEX的所有题目
df.clean.select <- df.pg.raw %>%
dplyr::select(age, starts_with("ALEX"), eatdrink, avoidance)
#笨一点的方法,就是把16个ALEX都写出来
```
```{r example of select rawdata_penguin DT, echo=FALSE}
# 看看其他变量是不是都消失了
DT::datatable(head(df.clean.select, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">计算变量 对指定列求和</h2>
```{r example of mutate_1 rawdata_penguin}
# 把ALEX1 - 4求和
df.clean.mutate_1 <- df.pg.raw %>%
dplyr::mutate(ALEX_SUM = ALEX1 + ALEX2 + ALEX3 + ALEX4)
```
```{r example of mutate_1 rawdata_penguin DT, echo=FALSE}
# 看看是不是真的求和了
DT::datatable(head(df.clean.mutate_1, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">计算变量 对含有某个字符的列求和</h2>
```{r example of mutate_2 rawdata_penguin}
# 对所有含有ALEX的列求和
df.clean.mutate_2 <- df.pg.raw %>%
dplyr::mutate(ALEX_SUM = rowSums(select(., starts_with("ALEX"))))
```
```{r example of mutate_2 rawdata_penguin DT, echo=FALSE}
DT::datatable(head(df.clean.mutate_2, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">重新编码为不同变量</h2>
```{r example of mutate_3 rawdata_penguin}
df.clean.mutate_3 <- df.pg.raw %>%
dplyr::mutate(decade = case_when(age <= 1969 ~ 60,
age >= 1970 & age <= 1979 ~ 70,
age >= 1980 & age <= 1989 ~ 80,
age >= 1990 & age <= 1999 ~ 90,
TRUE ~ NA_real_)
) %>% #当括号多的时候注意括号的位置
dplyr::select(.,decade, everything())
```
```{r example of mutate_3 rawdata_penguin DT, echo=FALSE}
DT::datatable(head(df.clean.mutate_3, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">dplyr::mutate</h1>
## <h2 lang="zh-CN">重新编码为相同变量(反向计分)</h2>
```{r example of mutate_4 rawdata_penguin}
df.clean.mutate_4 <- df.pg.raw %>%
dplyr::mutate(ALEX1 = case_when(ALEX1 == '1' ~ '5',
ALEX1 == '2' ~ '4',
ALEX1 == '3' ~ '3',
ALEX1 == '4' ~ '2',
ALEX1 == '5' ~ '1',
TRUE ~ as.character(ALEX1))
) %>%
dplyr::mutate(ALEX1 = as.numeric(ALEX1))
```
```{r example of mutate_4 rawdata_penguin DT, echo=FALSE}
DT::datatable(head(df.clean.mutate_4, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">dplyr::group_by & summarise</h1>
## <h2 lang="zh-CN">拆分文件 分组计算</h2>
```{r example of group_by rawdata_penguin}
df.clean.group_by <- df.clean.mutate_3 %>%
dplyr::group_by(.,decade) %>% # 根据被试的出生年代,将数据拆分
dplyr::summarise(mean_avoidance = mean(avoidance)) %>% # 计算不同年代下被试的平均avoidance
dplyr::ungroup()
```
```{r example of group_by rawdata_penguin DT, echo=FALSE}
# 拆分文件并不会让源文件产生任何视觉上的变化
DT::datatable(head(df.clean.group_by, 4),
fillContainer = TRUE, options = list(pageLength = 4))
```
---
# <h1 lang="en">dplyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>
Step1: 选择eatdrink为1的被试 [filter]<br>
<br>
Step2: 选择我们需要的变量 [select]<br>
<br>
Step3: 对反向计分题目重新编码 [mutate]<br>
<br>
Step4: 将出生年份编码为出生年代 [mutate]<br>
<br>
Step5: 按年代计算ALEX的平均值 [group_by, summarise]<br>
<br>
---
# <h1 lang="en">dplyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>
```{r example of total rawdata_penguin}
df.pg.clean <- df.pg.raw %>%
dplyr::filter(eatdrink == 1) %>% # 选择eatdrink为1的被试
dplyr::select(age, starts_with("ALEX"), eatdrink, avoidance) %>%
dplyr::mutate(ALEX1 = case_when(ALEX1 == '1' ~ '5', # 反向计分
ALEX1 == '2' ~ '4',
ALEX1 == '3' ~ '3',
ALEX1 == '4' ~ '2',
ALEX1 == '5' ~ '1',
TRUE ~ as.character(ALEX1))) %>%
dplyr::mutate(ALEX1 = as.numeric(ALEX1)) %>%
dplyr::mutate(ALEX_SUM = rowSums(select(., starts_with("ALEX"))), # 把所有ALEX的题目分数求和
decade = case_when(age <= 1969 ~ 60, # 把出生年份转换为年代
age >= 1970 & age <= 1979 ~ 70,
age >= 1980 & age <= 1989 ~ 80,
age >= 1990 & age <= 1999 ~ 90,
TRUE ~ NA_real_)) %>%
dplyr::group_by(decade) %>% # 按照年代将数据拆分
dplyr::summarise(mean_ALEX = mean(ALEX_SUM)) %>% # 计算每个年代的被试的平均的ALEX_SUM
dplyr::ungroup() # 解除对数据的拆分
```
---
# <h1 lang="en">dplyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>
```{r result of total, echo=FALSE}
DT::datatable(head(df.pg.clean, 5),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
class: center, middle
<span style="font-size: 60px;">5.3 数据预处理的进阶操作</span> <br>
---
# <h1 lang="en">tidyr</h1>
**The goal of tidyr is to help you create tidy data. Tidy data is data where:** <br>
- Every column is variable. <br>
- Every row is an observation. <br>
- Every cell is a single value. <br>
<img src="https://tidyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
---
# <h1 lang="en">tidyr::functions</h1>
- separate() 把一个变量的单元格内的字符串拆成两份,变成两个变量 <br>
**更适合用于按固定分隔符分割字符串,如将“2022-02-25”分成“2022”、“02”和“25”三列** <br>
- extract() 类似于separate <br>
**更适合用于从字符串中提取特定的信息,如将“John Smith”分成“John”和“Smith”两列** <br>
- unite() 把多个列(字符串)整合为一列 <br>
- pivot_longer() 把宽数据转化为长数据 <br>
- pivot_wider() 把长数据转化为宽数据 <br>
- drop_na() 删除缺失值
---
class: center, middle
# <h1 lang="zh-CN">接下来就要正式讲tidyr了</h1>
<img src="https://tidyr.tidyverse.org/logo.png" alt="dplyr" style="display: block; margin: 0 auto;">
<br>
---
# <h1 lang="en">tidyr::separate</h1>
## <h2 lang="zh-CN">拆分单元格内字符串</h2>
```{r tidyr::separate | rawdata_matchtask}
df.clean.separate <- df.mt.raw %>%
tidyr::separate(., col = Shape, into = c("Valence", "Identity"),
sep = "(?<=moral|immoral)(?=Self|Other)") %>%
dplyr::select(Sub, Valence, Identity, everything())
```
```{r tidyr::separate | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.separate, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">tidyr::extract</h1>
## <h2 lang="zh-CN">拆分单元格内字符串</h2>
```{r tidyr::extract | rawdata_matchtask}
df.clean.extract <- df.mt.raw %>%
tidyr::extract(Shape, into = c("Valence", "Identity"),
regex = "(moral|immoral)(Self|Other)", remove = FALSE) %>%
dplyr::select(Sub, Valence, Identity, everything())
```
```{r tidyr::extract | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.extract, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">tidyr::unite</h1>
## <h2 lang="zh-CN">合并单元格的字符串</h2>
```{r tidyr::unite | rawdata_matchtask}
df.clean.unite <- df.clean.separate %>%
tidyr::unite(Shape, Valence, Identity, sep = "")
```
```{r tidyr::unite | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.unite, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">tidyr::pivot_longer</h1>
## <h2 lang="zh-CN">长数据与宽数据的相互转换</h2>
```{r pivot_longer | rawdata_matchtask}
df.clean.long <- df.mt.raw %>%
tidyr::pivot_longer(cols = c(RT, ACC),
names_to = "DV",
values_to = "Value")
```
```{r pivot_longer | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.long, 48),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">tidyr::pivot_wider</h1>
## <h2 lang="zh-CN">长数据与宽数据的相互转换</h2>
```{r pivot_wider | rawdata_matchtask, warning=FALSE}
df.clean.wide <- df.clean.long %>%
dplyr::select(Sub, Trial, Shape, DV, Value) %>%
dplyr::group_by(Sub, Shape, DV) %>%
dplyr::summarise(mean_Value = mean(Value)) %>%
tidyr::pivot_wider(names_from = c("Shape", "DV"), values_from = "mean_Value", names_glue = "{Shape}_{DV}")
```
```{r pivot_wider | rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.clean.wide, 10),
fillContainer = TRUE, options = list(pageLength = 3))
```
---
# <h1 lang="en">tidyr::drop_na</h1>
## <h2 lang="zh-CN">删除含有缺失值的行(被试,试次,实验条件...)</h2>
```{r drop_na | rawdata_matchtask, warning=FALSE}
df.clean.drou_na <- df.mt.raw %>%
tidyr::drop_na()
```
```{r drop_na | rawdata_matchtask check NA}
paste("原始数据集有", nrow(df.mt.raw), "行")
paste("删除缺失值后有", nrow(df.clean.drou_na), "行")
# 实际操作中,可能粗暴的删除所有含有缺失值的行并不妥
# 因此建议通过dplyr::的filter来筛选出合格的被试(行)
# 然后再用is.na()来检验是否还存在缺失值
any(is.na(df.mt.raw)); sum(is.na(df.mt.raw))
```
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">把之前学到的函数一起使用</h2>
Step1: 选择我们需要的变量 [select]<br>
<br>
Step2: 删除缺失值,选择符合标准的被试 [drop_na, filter]<br>
<br>
Step3: 分实验条件计算平均反应时和正确率 [group_by, summarise]<br>
<br>
Step4: 将Shape变量拆分为Valence和Identity,选取Match-Moral组 [extract, filter]<br>
<br>
Step5: 将长数据转化为宽数据,得到Self和Other情况下的efficiency [pivot_wide]<br>
<br>
Step6: 计算实验条件为Match-Moral时efficiency的SPE [mutate, select]<br>
<br>
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step1: 选择我们需要的变量</h2>
```{r example of total part1 rawdata_matchtask,message=FALSE}
df.mt.clean <- df.mt.raw %>%
dplyr::select(Sub, Age, Sex, Hand, #人口统计学
Block, Bin, Trial, # 试次
Shape, Label, Match, # 刺激
Resp, ACC, RT) # 反应结果
```
```{r example of total part1 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step2: 删除缺失值,选择符合标准的被试</h2>
```{r example of total part2 rawdata_matchtask,message=FALSE}
df.mt.clean <- df.mt.clean %>%
tidyr::drop_na() %>% #删除缺失值
dplyr::filter(.,Hand == "R", # 选择右利手被试
ACC == 0 | ACC == 1 , # 排除无效应答(ACC = -1 OR 2)
RT >= 0.2 & RT <= 1.5) # 选择RT属于[200,1500]
```
```{r example of total part2 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step3: 分实验条件计算</h2>
```{r example of total part3 rawdata_matchtask,message=FALSE}
df.mt.clean <- df.mt.clean %>%
dplyr::group_by(Sub, Shape, Label, Match) %>%
dplyr::summarise(mean_ACC = mean(ACC),
mean_RT = mean(RT)) %>%
dplyr::ungroup() %>%
dplyr::mutate(efficiency = mean_RT/mean_ACC)
```
```{r example of total part3 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step4: 将Shape变量拆分</h2>
```{r example of total part4 rawdata_matchtask}
df.mt.clean <- df.mt.clean %>%
tidyr::extract(Shape, into = c("Valence", "Identity"),
regex = "(moral|immoral)(Self|Other)", remove = FALSE) %>%
dplyr::filter(Match == "match" & Valence == "moral")
# 自我优势效应一般讨论的是匹配条件下
# 人们对自己相关的信息反应快于非自我相关的
```
```{r example of total part4 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step5: 将长数据转化为宽数据</h2>
```{r example of total part5 rawdata_matchtask}
df.mt.clean <- df.mt.clean %>%
dplyr::select(Sub, Identity, efficiency) %>%
tidyr::pivot_wider(names_from = "Identity", values_from = "efficiency")
```
```{r example of total part5 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
# <h1 lang="en">dplyr & tidyr::functions</h1>
## <h2 lang="zh-CN">Step6: 计算SPE</h2>
```{r example of total part6 rawdata_matchtask}
df.mt.clean <- df.mt.clean %>%
dplyr::mutate(eff_moral_SPE = Self - Other) %>%
dplyr::select(Sub, eff_moral_SPE)
```
```{r example of total part6 rawdata_matchtask DT, echo=FALSE}
DT::datatable(head(df.mt.clean, 24),
fillContainer = TRUE, options = list(pageLength = 5))
```
---
## <h2 lang="zh-CN">感受一下如果所有数据清洗的代码放在一起是什么样子</h2>
```{r example of total rawdata_matchtask, message=FALSE}
df.mt.clean <- df.mt.raw %>%
dplyr::select(Sub, Age, Sex, Hand, #人口统计学
Block, Bin, Trial, # 试次
Shape, Label, Match, # 刺激
Resp, ACC, RT, # 反应结果
) %>%
tidyr::drop_na() %>% #删除缺失值
dplyr::filter(.,Hand == "R", # 选择右利手被试
ACC == 0 | ACC == 1 , # 排除无效应答(ACC = -1 OR 2)
RT >= 0.2 & RT <= 1.5 # 选择RT属于[200,1500]
) %>%
dplyr::group_by(Sub,
Shape, Label, Match) %>%
dplyr::summarise(mean_ACC = mean(ACC),
mean_RT = mean(RT)) %>%
dplyr::ungroup() %>%
dplyr::mutate(efficiency = mean_RT/mean_ACC) %>%
tidyr::extract(Shape, into = c("Valence", "Identity"),
regex = "(moral|immoral)(Self|Other)", remove = FALSE) %>%
dplyr::filter(Match == "match" & Valence == "moral") %>%
dplyr::select(Sub, Identity, efficiency) %>%
tidyr::pivot_wider(names_from = "Identity", values_from = "efficiency") %>%
dplyr::mutate(eff_moral_SPE = Self - Other) %>%
dplyr::select(Sub, eff_moral_SPE)
```
---
# <h1 lang="zh-CN">课堂练习题</h1>
计算不同Shape情况下(immoralself,moralself,immoralother,moralother)<br>
基于信号检测论match与mismatch之间的d值(match为信号,mismatch噪音)<br>
为了方便大家写代码。以下是计算信号检测论d值的公式 <br>
```{r, eval=FALSE}
dplyr::summarise(
hit = length(ACC[Match == "match" & ACC == 1]),
fa = length(ACC[Match == "mismatch" & ACC == 0]),
miss = length(ACC[Match == "match" & ACC == 0]),
cr = length(ACC[Match == "mismatch" & ACC == 1]),
Dprime = qnorm(
ifelse(hit / (hit + miss) < 1,
hit / (hit + miss),
1 - 1 / (2 * (hit + miss))
)
)
- qnorm(
ifelse(fa / (fa + cr) > 0,
fa / (fa + cr),
1 / (2 * (fa + cr))
)
)
)
```
---
# <h1 lang="zh-CN">课堂练习题</h1>
## <h2 lang="zh-CN">思路</h2>
Step1: 选择需要的变量 <br>
<br>
Step2: 基于Sub,Block,Bin和Shape分组 <br>
<br>
Step3: 使用计算公式 <br>
<br>
Step4: 删除击中、虚报、误报、正确拒绝 <br>
<br>
Step5: 按Sub和Shape分组 <br>
<br>
Step6: 长转宽,得到每个Shape情况下的信号检测论d值 <br>
<br>