p8105_hw3_yb2584.Rmd

---
title: "Homework 3"
author: "Yunshen Bai"
date: '`r format(Sys.time(), "%Y-%m-%d")`'
output: github_document

---

```{r setup, include=FALSE}
library(tidyverse)
library(ggridges)
library(patchwork)

library(p8105.datasets)

knitr::opts_chunk$set(
	echo = TRUE,
	warning = FALSE,
	fig.width = 8, 
  fig.height = 6,
  out.width = "90%"
)

theme_set(theme_minimal() + theme(legend.position = "bottom"))

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis"
)

scale_colour_discrete = scale_colour_viridis_d
scale_fill_discrete = scale_fill_viridis_d
```


## Problem 1

### Read in the data

```{r}
data("instacart")

instacart = 
  instacart |> 
  as_tibble()
```

### Answer questions about the data

This dataset contains `r nrow(instacart)` rows and `r ncol(instacart)` columns, with each row resprenting a single product from an instacart order. Variables include identifiers for user, order, and product; the order in which each product was added to the cart. There are several order-level variables, describing the day and time of the order, and number of days since prior order. Then there are several item-specific variables, describing the product name (e.g. Yogurt, Avocado), department (e.g. dairy and eggs, produce), and aisle (e.g. yogurt, fresh fruits), and whether the item has been ordered by this user in the past. In total, there are `r instacart |> select(product_id) |> distinct() |> count()` products found in `r instacart |> select(user_id, order_id) |> distinct() |> count()` orders from `r instacart |> select(user_id) |> distinct() |> count()` distinct users.

Below is a table summarizing the number of items ordered from aisle. In total, there are 134 aisles, with fresh vegetables and fresh fruits holding the most items ordered by far.

```{r}
instacart |> 
  count(aisle) |> 
  arrange(desc(n))
```

Next is a plot that shows the number of items ordered in each aisle. Here, aisles are ordered by ascending number of items.

```{r}
instacart |> 
  count(aisle) |> 
  filter(n > 10000) |> 
  mutate(aisle = fct_reorder(aisle, n)) |> 
  ggplot(aes(x = aisle, y = n)) + 
  geom_point() + 
  labs(title = "Number of items ordered in each aisle") +
  theme(axis.text.x = element_text(angle = 60, hjust = 1))

```

Our next table shows the three most popular items in aisles `baking ingredients`, `dog food care`, and `packaged vegetables fruits`, and includes the number of times each item is ordered in your table.

```{r}
instacart |> 
  filter(aisle %in% c("baking ingredients", "dog food care", "packaged vegetables fruits")) |>
  group_by(aisle) |> 
  count(product_name) |> 
  mutate(rank = min_rank(desc(n))) |> 
  filter(rank < 4) |> 
  arrange(desc(n)) |>
  knitr::kable()
```

Finally is a table showing the mean hour of the day at which Pink Lady Apples and Coffee Ice Cream are ordered on each day of the week. This table has been formatted in an untidy manner for human readers. Pink Lady Apples are generally purchased slightly earlier in the day than Coffee Ice Cream, with the exception of day 5.

```{r}
instacart |>
  filter(product_name %in% c("Pink Lady Apples", "Coffee Ice Cream")) |>
  group_by(product_name, order_dow) |>
  summarize(mean_hour = mean(order_hour_of_day)) |>
  pivot_wider(
    names_from = order_dow, 
    values_from = mean_hour) |>
  knitr::kable(digits = 2)
```

## Problem 2
### Read the data

First, do some data cleaning:

- format the data to use appropriate variable names;
- focus on the “Overall Health” topic
- include only responses from “Excellent” to “Poor”
- organize responses as a factor taking levels ordered from “Poor” to “Excellent”
```{r}
data("brfss_smart2010")
brfss_smart2010=
  brfss_smart2010 |>
  janitor::clean_names()|>
  rename(state=locationabbr,county=locationdesc)|>
  filter(topic == "Overall Health")|>
  mutate(response=factor(response,levels = c("Poor", "Fair", "Good", "Very good", "Excellent"),ordered = TRUE))
```

### In 2002, which states were observed at 7 or more locations? What about in 2010?

```{r}
more_7_loc_2002=
  brfss_smart2010|>
  filter(year==2002)|>
  group_by(state)|>
  summarize(n_obs=n_distinct(county))|>
  filter(n_obs>=7)
more_7_loc_2002
```

Form this result, `r more_7_loc_2002$state` were observed at 7 or more locations in 2002.  

What about in 2010?
```{r}
more_7_loc_2010=
  brfss_smart2010|>
  filter(year==2010)|>
  group_by(state)|>
  summarize(n_obs=n_distinct(county))|>
  filter(n_obs>=7)
more_7_loc_2010
```
Form this result, `r more_7_loc_2010$state` were observed at 7 or more locations in 2010.  

### Construct a dataset that is limited to Excellent responses, and contains, year, state, and a variable that averages the data_value across locations within a state. Make a “spaghetti” plot of this average value over time within a state.

```{r,message = FALSE}
ave_data_value_summ=
  brfss_smart2010|>
  filter(response=="Excellent")|>
  group_by(year,state)|>
  summarize(ave_data_value=mean(data_value))
ave_data_value_summ
```

```{r}
ave_data_value_summ|>
  ggplot(aes(x=year,y=ave_data_value,color=state))+
  theme_bw() + 
  geom_line()+
  labs(
    title = "Spaghetti plot of average data value over time within a state",
    x = "Year",
    y = "Average data value",
    color = "State"
  )+
  viridis::scale_color_viridis(
    name = "State", 
    discrete = TRUE
  )
```

### Make a two-panel plot showing, for the years 2006, and 2010, distribution of data_value for responses (“Poor” to “Excellent”) among locations in NY State.

```{r}
plot_2006=
  brfss_smart2010|>
  filter(year==2006,state=="NY")|>
  ggplot(aes(x=data_value,fill=response))+
  theme_bw()+
  geom_density(alpha = .5)+
  labs(
    title = "Distribution of data_value in 2006"
  )+
  theme(legend.position = "none")

plot_2010=
  brfss_smart2010|>
  filter(year==2010,state=="NY")|>
  ggplot(aes(x=data_value,fill=response))+
  theme_bw()+
  geom_density(alpha = .5)+
  labs(
    title = "Distribution of data_value in 2010"
  )

plot_2006+plot_2010
```

## Problem 3
read data
```{r,message = FALSE , warning = FALSE}
accel=read_csv("./data/nhanes_accel.csv")
covar=read_csv("./data/nhanes_covar.csv",skip = 4)
```

Tidy, merge, and otherwise organize the data sets.
```{r,message = FALSE}
mims=
  left_join(covar,accel)|>
  mutate(sex=recode_factor(sex,'1'="male",'2'="female"),
         education=recode_factor(education,
                                 '1'="Less than high school",
                                 '2'="High school equivalent",
                                 '3'="More than high school"))|>
  drop_na(min1:min1440)|>
  pivot_longer(
    min1:min1440,
    names_to = "min",
    values_to = "mims"
  )|>
  filter(age>=21)
```

### Produce a reader-friendly table for the number of men and women in each education category, and create a visualization of the age distributions for men and women in each education category. Comment on these items.

```{r}
mims|>
  group_by(sex,education)|>
  summarize(n_obs=n_distinct(SEQN))
```
```{r}
mims|>
  distinct(SEQN,.keep_all = TRUE)|>
  ggplot(aes(x=age,fill=sex))+
  geom_density(alpha = .5) + 
  facet_grid(~education) + 
  viridis::scale_fill_viridis(discrete = TRUE)
```
From the numbers and distributions for men and women in each education category, we can observe that the number of men in High school equivalent level is obviously higher than the number of women in High school equivalent level. The average age of women in Less than high school level and High school equivalent level is higher than women. However, the average age of women in More than high school level is smaller than man. For men and women, there are more young people among highly educated people.

### Using your tidied dataset, aggregate across minutes to create a total activity variable for each participant. 
```{r,message=FALSE}
Traditional_analyses=
  mims|>
  group_by(SEQN,sex,age,education)|>
  summarize(ave_mims=mean(mims))
Traditional_analyses
```
### Plot these total activities (y-axis) against age (x-axis); your plot should compare men to women and have separate panels for each education level. Include a trend line or a smooth to illustrate differences. Comment on your plot.

```{r,message=FALSE}
Traditional_analyses|>
  ggplot(aes(x=age,y=ave_mims,color=sex))+
  geom_point()+
  geom_smooth(se = FALSE)+
  facet_grid(. ~ education)
```
In conclusion, as age increases, MIMS value tend to decrease. Also, MIMS values for women is higher than men in general. 

### Make a three-panel plot that shows the 24-hour activity time courses for each education level and use color to indicate sex. Describe in words any patterns or conclusions you can make based on this graph; including smooth trends may help identify differences


```{r}
mims|>
  mutate(min=rep(1:1440,length(SEQN)/1440))|>
  group_by(min,sex,education)|>
  summarize(mims=mean(mims))|>
  ggplot(aes(x=min,y=mims,color=sex))+
  theme_bw()+
  geom_point(alpha = .1)+
  geom_smooth(se = FALSE)+
  scale_x_continuous(
    breaks = c(1,60*1,60*2,60*3,60*4,60*5,60*6,60*7,
               60*8,60*9,60*10,60*11,60*12,60*13,60*14,
               60*15,60*16,60*17,60*18,60*19,60*20,60*21,
               60*22,60*23,60*24), 
    labels = c("0","1","2","3","4","5","6","7","8","9",
               "10","11","12","13","14","15","16","17",
               "18","19","20","21","22","23","24"))+
  facet_grid(. ~ education)
```

In conclusion, MIMS value will reach to the maxima at noon, and reach to the minima at 4 pm. In general, MIMS value of women is higher than men.