forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
199 lines (153 loc) · 5.12 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
Let's first load up the helper libraries:
```{r message = FALSE, warnings = FALSE}
library(tidyverse)
library(lubridate)
library(scales)
```
The dataset is already present in the repositury. Let's just unzip it and load it up:
```{r message = FALSE, warnings = FALSE}
# let's extract the dataset first:
unzip("activity.zip")
dataset <- read_csv("activity.csv")
```
## What is mean total number of steps taken per day?
Let's first subset the data to only work with complete cases:
```{r}
completes <- dataset[complete.cases(dataset), ]
```
We're interested in the total number of steps taken on each day:
```{r}
completes_with_totals <-
completes %>%
group_by(date) %>%
summarize(total=sum(steps))
mean_total_steps <- completes_with_totals$total %>% mean
median_total_steps <- completes_with_totals$total %>% median
print(paste("Mean total steps:", mean_total_steps))
print(paste("Median total steps:", median_total_steps))
```
Now, let's plot a histogram:
```{r}
completes_with_totals %>%
ggplot(aes(total)) +
geom_histogram(bins=10) +
labs(title="Histogram of total number of steps per day") +
xlab("Number of steps") +
ylab("Frequency")
```
## What is the average daily activity pattern?
For this we need to group by the interval and compute the mean of steps.
```{r}
interval_mean_steps <- dataset %>%
group_by(interval) %>%
summarize(mean_steps=mean(steps, na.rm = TRUE))
mean_max_steps_per_interval <- max(interval_mean_steps$mean_steps)
most_active_interval <- (
interval_mean_steps %>%
filter(mean_steps == mean_max_steps_per_interval)
)$interval[1]
print(paste("Most active interval: ", most_active_interval))
print(paste("Average number of steps in that interval: ", mean_max_steps_per_interval))
interval_mean_steps %>%
ggplot(aes(interval, mean_steps)) +
geom_line() +
geom_vline(
xintercept = most_active_interval, colour="red"
) +
annotate(
"text",
x=most_active_interval + 400,
y=mean_max_steps_per_interval,
label=paste("Most active interval: ", most_active_interval),
colour="red"
)
```
## Imputing missing values
```{r}
total_incompletes <- dataset[!complete.cases(dataset), ] %>% nrow
print(paste("Total number of incomplete rows in the dataset:", total_incompletes))
```
The strategy of data imputation here is to take the mean steps count for each interval per week day. We're hoping to catch the differences between a given interval depending on the given week day (e. g. expecing to see a difference between an interval X on Monday and on Sunday).
Let's first add an info about the weekday:
```{r}
imputed_dataset <- dataset %>% mutate(wday=wday(date, label=TRUE))
```
Now on to adding the mean steps value by doing the inner join:
```{r}
imputed_dataset <- inner_join(
imputed_dataset,
imputed_dataset %>%
group_by_(.dots=c("wday", "interval")) %>%
summarize(mean_steps=mean(steps, na.rm=TRUE)))
```
Let's now use what we have, for all rows where steps is NA, use the computed mean value:
```{r}
imputed_dataset <- imputed_dataset %>%
mutate(steps=ifelse(is.na(steps), mean_steps, steps)) %>%
select(steps, date, interval)
```
Now to double-check that all the missing values have been fully imputed:
```{r}
print(
paste(
"Does the imputed dataset contain all days filled with data?",
nrow(dataset) == (imputed_dataset %>% complete.cases() %>% sum())
)
)
```
Let us now have a look at the plot we made earlier. This time let's use the dataset with imputed values:
```{r}
imputed_with_totals <-
imputed_dataset %>%
group_by(date) %>%
summarize(total=sum(steps))
mean_imputed_total_steps <- imputed_with_totals$total %>% mean
median_imputed_total_steps <- imputed_with_totals$total %>% median
print(paste("Mean total steps for the imputed dataset:", mean_imputed_total_steps))
print(paste("Median total steps for the imputed dataset:", median_imputed_total_steps))
imputed_with_totals %>%
ggplot(aes(total)) +
geom_histogram(bins=10) +
labs(title="Histogram of total number of steps per day") +
xlab("Number of steps") +
ylab("Frequency")
```
What impact did the imputation have on the mean and median estimation we got previously?
```{r}
print(
paste(
"Change in the mean total number of steps after imputing:",
percent((mean_imputed_total_steps - mean_total_steps) / mean_total_steps)
)
)
print(
paste(
"Change in the median total number of steps after imputing:",
percent((median_imputed_total_steps - median_total_steps) / median_total_steps)
)
)
```
## Are there differences in activity patterns between weekdays and weekends?
Lastly, let's compare the series of the mean number of steps per interval per day type (weekday or weekend)?:
```{r}
imputed_dataset %>%
mutate(
daytype=ifelse(
wday(date) == 1 | wday(date) == 7,
"weekend",
"weekday")
) %>%
mutate(daytype=as.factor(daytype)) %>%
group_by_(.dots = c("daytype", "interval")) %>%
summarize(mean_steps=mean(steps)) %>%
ggplot(aes(interval, mean_steps)) +
facet_grid(rows=vars(daytype)) +
geom_line()
```