forked from perlatex/R_for_Data_Science
-
Notifications
You must be signed in to change notification settings - Fork 0
/
eda_olympics.Rmd
281 lines (211 loc) · 5.56 KB
/
eda_olympics.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
# 探索性数据分析-奥林匹克 {#eda-olympics}
这是Nature期刊上的一篇文章[Nature. 2004 September 30; 431(7008)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3173856/#SD2),
```{r eda-olympics-1, out.width = '80%', fig.align='left', echo = FALSE}
knitr::include_graphics(path = "images/ukmss-36386-f0001.jpg")
```
虽然觉得这个结论不太严谨,但我却无力反驳。
于是在文章补充材料里,我找到了文章使用的数据,现在的任务是,重复这张图和文章的分析过程。
## 导入数据
```{r eda-olympics-2, message=FALSE, warning=FALSE}
library(tidyverse)
library(readxl)
```
```{r eda-olympics-3, message=FALSE, warning=FALSE}
d <- read_excel("./demo_data/olympics.xlsx")
d
```
## 可视化
我们先画图看看
```{r eda-olympics-4, out.width = '100%'}
d %>%
ggplot() +
geom_point(aes(x = Olympic_year, y = Men_score), color = "blue") +
geom_point(aes(x = Olympic_year, y = Women_score), color = "red")
```
这样写也是可以的,只不过最好先tidy数据
```{r eda-olympics-5, out.width = '80%', fig.align='left', echo = FALSE}
knitr::include_graphics(path = "images/pivot.png")
```
```{r eda-olympics-6}
d1 <- d %>%
pivot_longer(
cols = -Olympic_year,
names_to = "sex",
values_to = "winning_time"
)
d1
```
然后在画图
```{r eda-olympics-7, out.width = '100%'}
d1 %>%
ggplot(aes(x = Olympic_year, y = winning_time, color = sex)) +
geom_point() +
# geom_smooth(method = "lm") +
scale_color_manual(
values = c("Men_score" = "blue", "Women_score" = "red")
) +
scale_x_continuous(
breaks = seq(1900, 2004, by = 4),
labels = seq(1900, 2004, by = 4)
) +
theme(axis.text.x = element_text(
size = 10, angle = 45, colour = "black",
vjust = 1, hjust = 1
))
```
## 回归分析
建立年份与成绩的线性关系
$$
\text{score}_i = \alpha + \beta \times \text{year}_i + \epsilon_i; \qquad \epsilon_i\in \text{Normal}(\mu, \sigma)
$$
我们需要求出其中系数$\alpha$和$\beta$,写R语言代码如下
(`lm(y ~ 1 + x,data = d)`, 要求得 $\alpha$和$\beta$,就是对应 1 和 x 前的系数)
```{r eda-olympics-8}
fit_1 <- lm(Men_score ~ 1 + Olympic_year, data = d)
summary(fit_1)
```
```{r eda-olympics-9}
fit_2 <- lm(Women_score ~ 1 + Olympic_year, data = d)
summary(fit_2)
```
## 预测
使用`predict()`完成预测
```{r eda-olympics-10}
df <- data.frame(Olympic_year = 2020)
predict(fit_1, newdata = df)
```
为了图片中的一致,我们使用1900年到2252年(`seq(1900, 2252, by = 4)`)建立预测项,并整理到数据框里
```{r eda-olympics-11}
grid <- tibble(
Olympic_year = as.numeric(seq(1900, 2252, by = 4))
)
grid
```
```{r eda-olympics-12}
tb <- grid %>%
mutate(
Predict_Men = predict(fit_1, newdata = grid),
Predict_Women = predict(fit_2, newdata = grid)
)
tb
```
有时候我喜欢用`modelr::add_predictions()`函数实现相同的功能
```{r eda-olympics-13}
library(modelr)
grid %>%
add_predictions(fit_1, var = "Predict_Men") %>%
add_predictions(fit_2, var = "Predict_Women")
```
## 再次可视化
```{r eda-olympics-14}
tb1 <- tb %>%
pivot_longer(
cols = -Olympic_year,
names_to = "sex",
values_to = "winning_time"
)
tb1
```
```{r eda-olympics-15, out.width = '100%'}
tb1 %>%
ggplot(aes(
x = Olympic_year,
y = winning_time,
color = sex
)) +
geom_line(size = 2) +
geom_point(data = d1) +
scale_color_manual(
values = c(
"Men_score" = "blue",
"Women_score" = "red",
"Predict_Men" = "#588B8B",
"Predict_Women" = "#C8553D"
),
labels = c(
"Men_score" = "Men score",
"Women_score" = "Women score",
"Predict_Men" = "Men Predict score",
"Predict_Women" = "Women Predict score"
)
) +
scale_x_continuous(
breaks = seq(1900, 2252, by = 16),
labels = as.character(seq(1900, 2252, by = 16))
) +
theme(axis.text.x = element_text(
size = 10, angle = 45, colour = "black",
vjust = 1, hjust = 1
))
```
早知道nature文章这么简单,10年前我也可以写啊!
## list_column
这里是另外的一种方法
```{r eda-olympics-16}
library(modelr)
```
```{r eda-olympics-17, out.width = '100%'}
d1 <- d %>%
pivot_longer(
cols = -Olympic_year,
names_to = "sex",
values_to = "winning_time"
)
fit_model <- function(df) lm(winning_time ~ Olympic_year, data = df)
d2 <- d1 %>%
group_nest(sex) %>%
mutate(
mod = map(data, fit_model)
)
d2
# d2 %>% mutate(p = list(grid, grid))
# d3 <- d2 %>% mutate(p = list(grid, grid))
# d3
# d3 %>%
# mutate(
# predictions = map2(p, mod, add_predictions),
# )
# or
tb4 <- d2 %>%
mutate(
predictions = map(mod, ~ add_predictions(grid, .))
) %>%
select(sex, predictions) %>%
unnest(predictions)
tb4 %>%
ggplot(aes(
x = Olympic_year,
y = pred,
group = sex,
color = sex
)) +
geom_point() +
geom_line(size = 2) +
geom_point(
data = d1,
aes(
x = Olympic_year,
y = winning_time,
group = sex,
color = sex
)
) +
scale_x_continuous(
breaks = seq(1900, 2252, by = 16),
labels = as.character(seq(1900, 2252, by = 16))
) +
theme(axis.text.x = element_text(
size = 10, angle = 45, colour = "black",
vjust = 1, hjust = 1
))
```
## 课后作业
- 探索数据,建立身高体重的线性模型
```{r eda-olympics-18, echo = F}
# remove the objects
# rm(list=ls())
rm(d, d1, d2, df, fit_1, fit_2, fit_model, grid, tb, tb1, tb4)
```
```{r eda-olympics-19, echo = F, message = F, warning = F, results = "hide"}
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```