-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathp8105_hw3_yb2584.Rmd
286 lines (237 loc) · 9.36 KB
/
p8105_hw3_yb2584.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
---
title: "Homework 3"
author: "Yunshen Bai"
date: '`r format(Sys.time(), "%Y-%m-%d")`'
output: github_document
---
```{r setup, include=FALSE}
library(tidyverse)
library(ggridges)
library(patchwork)
library(p8105.datasets)
knitr::opts_chunk$set(
echo = TRUE,
warning = FALSE,
fig.width = 8,
fig.height = 6,
out.width = "90%"
)
theme_set(theme_minimal() + theme(legend.position = "bottom"))
options(
ggplot2.continuous.colour = "viridis",
ggplot2.continuous.fill = "viridis"
)
scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
```
## Problem 1
### Read in the data
```{r}
data("instacart")
instacart =
instacart |>
as_tibble()
```
### Answer questions about the data
This dataset contains `r nrow(instacart)` rows and `r ncol(instacart)` columns, with each row resprenting a single product from an instacart order. Variables include identifiers for user, order, and product; the order in which each product was added to the cart. There are several order-level variables, describing the day and time of the order, and number of days since prior order. Then there are several item-specific variables, describing the product name (e.g. Yogurt, Avocado), department (e.g. dairy and eggs, produce), and aisle (e.g. yogurt, fresh fruits), and whether the item has been ordered by this user in the past. In total, there are `r instacart |> select(product_id) |> distinct() |> count()` products found in `r instacart |> select(user_id, order_id) |> distinct() |> count()` orders from `r instacart |> select(user_id) |> distinct() |> count()` distinct users.
Below is a table summarizing the number of items ordered from aisle. In total, there are 134 aisles, with fresh vegetables and fresh fruits holding the most items ordered by far.
```{r}
instacart |>
count(aisle) |>
arrange(desc(n))
```
Next is a plot that shows the number of items ordered in each aisle. Here, aisles are ordered by ascending number of items.
```{r}
instacart |>
count(aisle) |>
filter(n > 10000) |>
mutate(aisle = fct_reorder(aisle, n)) |>
ggplot(aes(x = aisle, y = n)) +
geom_point() +
labs(title = "Number of items ordered in each aisle") +
theme(axis.text.x = element_text(angle = 60, hjust = 1))
```
Our next table shows the three most popular items in aisles `baking ingredients`, `dog food care`, and `packaged vegetables fruits`, and includes the number of times each item is ordered in your table.
```{r}
instacart |>
filter(aisle %in% c("baking ingredients", "dog food care", "packaged vegetables fruits")) |>
group_by(aisle) |>
count(product_name) |>
mutate(rank = min_rank(desc(n))) |>
filter(rank < 4) |>
arrange(desc(n)) |>
knitr::kable()
```
Finally is a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week. This table has been formatted in an untidy manner for human readers. Pink Lady Apples are generally purchased slightly earlier in the day than Coffee Ice Cream, with the exception of day 5.
```{r}
instacart |>
filter(product_name %in% c("Pink Lady Apples", "Coffee Ice Cream")) |>
group_by(product_name, order_dow) |>
summarize(mean_hour = mean(order_hour_of_day)) |>
pivot_wider(
names_from = order_dow,
values_from = mean_hour) |>
knitr::kable(digits = 2)
```
## Problem 2
### Read the data
First, do some data cleaning:
- format the data to use appropriate variable names;
- focus on the “Overall Health” topic
- include only responses from “Excellent” to “Poor”
- organize responses as a factor taking levels ordered from “Poor” to “Excellent”
```{r}
data("brfss_smart2010")
brfss_smart2010=
brfss_smart2010 |>
janitor::clean_names()|>
rename(state=locationabbr,county=locationdesc)|>
filter(topic == "Overall Health")|>
mutate(response=factor(response,levels = c("Poor", "Fair", "Good", "Very good", "Excellent"),ordered = TRUE))
```
### In 2002, which states were observed at 7 or more locations? What about in 2010?
```{r}
more_7_loc_2002=
brfss_smart2010|>
filter(year==2002)|>
group_by(state)|>
summarize(n_obs=n_distinct(county))|>
filter(n_obs>=7)
more_7_loc_2002
```
Form this result, `r more_7_loc_2002$state` were observed at 7 or more locations in 2002.
What about in 2010?
```{r}
more_7_loc_2010=
brfss_smart2010|>
filter(year==2010)|>
group_by(state)|>
summarize(n_obs=n_distinct(county))|>
filter(n_obs>=7)
more_7_loc_2010
```
Form this result, `r more_7_loc_2010$state` were observed at 7 or more locations in 2010.
### Construct a dataset that is limited to Excellent responses, and contains, year, state, and a variable that averages the data_value across locations within a state. Make a “spaghetti” plot of this average value over time within a state.
```{r,message = FALSE}
ave_data_value_summ=
brfss_smart2010|>
filter(response=="Excellent")|>
group_by(year,state)|>
summarize(ave_data_value=mean(data_value))
ave_data_value_summ
```
```{r}
ave_data_value_summ|>
ggplot(aes(x=year,y=ave_data_value,color=state))+
theme_bw() +
geom_line()+
labs(
title = "Spaghetti plot of average data value over time within a state",
x = "Year",
y = "Average data value",
color = "State"
)+
viridis::scale_color_viridis(
name = "State",
discrete = TRUE
)
```
### Make a two-panel plot showing, for the years 2006, and 2010, distribution of data_value for responses (“Poor” to “Excellent”) among locations in NY State.
```{r}
plot_2006=
brfss_smart2010|>
filter(year==2006,state=="NY")|>
ggplot(aes(x=data_value,fill=response))+
theme_bw()+
geom_density(alpha = .5)+
labs(
title = "Distribution of data_value in 2006"
)+
theme(legend.position = "none")
plot_2010=
brfss_smart2010|>
filter(year==2010,state=="NY")|>
ggplot(aes(x=data_value,fill=response))+
theme_bw()+
geom_density(alpha = .5)+
labs(
title = "Distribution of data_value in 2010"
)
plot_2006+plot_2010
```
## Problem 3
read data
```{r,message = FALSE , warning = FALSE}
accel=read_csv("./data/nhanes_accel.csv")
covar=read_csv("./data/nhanes_covar.csv",skip = 4)
```
Tidy, merge, and otherwise organize the data sets.
```{r,message = FALSE}
mims=
left_join(covar,accel)|>
mutate(sex=recode_factor(sex,'1'="male",'2'="female"),
education=recode_factor(education,
'1'="Less than high school",
'2'="High school equivalent",
'3'="More than high school"))|>
drop_na(min1:min1440)|>
pivot_longer(
min1:min1440,
names_to = "min",
values_to = "mims"
)|>
filter(age>=21)
```
### Produce a reader-friendly table for the number of men and women in each education category, and create a visualization of the age distributions for men and women in each education category. Comment on these items.
```{r}
mims|>
group_by(sex,education)|>
summarize(n_obs=n_distinct(SEQN))
```
```{r}
mims|>
distinct(SEQN,.keep_all = TRUE)|>
ggplot(aes(x=age,fill=sex))+
geom_density(alpha = .5) +
facet_grid(~education) +
viridis::scale_fill_viridis(discrete = TRUE)
```
From the numbers and distributions for men and women in each education category, we can observe that the number of men in High school equivalent level is obviously higher than the number of women in High school equivalent level. The average age of women in Less than high school level and High school equivalent level is higher than women. However, the average age of women in More than high school level is smaller than man. For men and women, there are more young people among highly educated people.
### Using your tidied dataset, aggregate across minutes to create a total activity variable for each participant.
```{r,message=FALSE}
Traditional_analyses=
mims|>
group_by(SEQN,sex,age,education)|>
summarize(ave_mims=mean(mims))
Traditional_analyses
```
### Plot these total activities (y-axis) against age (x-axis); your plot should compare men to women and have separate panels for each education level. Include a trend line or a smooth to illustrate differences. Comment on your plot.
```{r,message=FALSE}
Traditional_analyses|>
ggplot(aes(x=age,y=ave_mims,color=sex))+
geom_point()+
geom_smooth(se = FALSE)+
facet_grid(. ~ education)
```
In conclusion, as age increases, MIMS value tend to decrease. Also, MIMS values for women is higher than men in general.
### Make a three-panel plot that shows the 24-hour activity time courses for each education level and use color to indicate sex. Describe in words any patterns or conclusions you can make based on this graph; including smooth trends may help identify differences
```{r}
mims|>
mutate(min=rep(1:1440,length(SEQN)/1440))|>
group_by(min,sex,education)|>
summarize(mims=mean(mims))|>
ggplot(aes(x=min,y=mims,color=sex))+
theme_bw()+
geom_point(alpha = .1)+
geom_smooth(se = FALSE)+
scale_x_continuous(
breaks = c(1,60*1,60*2,60*3,60*4,60*5,60*6,60*7,
60*8,60*9,60*10,60*11,60*12,60*13,60*14,
60*15,60*16,60*17,60*18,60*19,60*20,60*21,
60*22,60*23,60*24),
labels = c("0","1","2","3","4","5","6","7","8","9",
"10","11","12","13","14","15","16","17",
"18","19","20","21","22","23","24"))+
facet_grid(. ~ education)
```
In conclusion, MIMS value will reach to the maxima at noon, and reach to the minima at 4 pm. In general, MIMS value of women is higher than men.