forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
181 lines (134 loc) · 7.3 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
title: "PA1_template"
author: "Erin Stein"
date: "December 5, 2016"
output:
html_document:
keep_md: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## R Markdown for Reproducible Research: Peer Assessment 1
###Loading and prepocessing the data
Load the necessary packages.
```{r, echo = TRUE}
library(dplyr)
library(ggplot2)
```
Read in the activity set and confirm the variables by viewing the first few lines. And looking at a summary to determine the class of each variable. Noting that date is classified as a factor, transform it to "date" class.
```{r, echo = TRUE}
unzip("activity.zip")
activity <- read.csv("activity.csv")
activity$date <- as.Date(activity$date)
head(activity)
str(activity)
```
###What is the mean total number of steps taken per day?
Group the activity data by date, then use the summarise() function to determine the number of total steps taken each day.
```{r, echo = TRUE}
activity_bydate <- group_by(activity, date)
sumsteps <- summarise(activity_bydate, sum(steps))
colnames(sumsteps) <- c("Date", "Total.Steps")
head(sumsteps)
```
Plot a histogram of the number of total steps per day.
```{r, echo = TRUE}
g <- ggplot(sumsteps, aes(Total.Steps))
g + geom_histogram(colour = "black", fill = "aquamarine3", binwidth = 1000) +
labs(title = "Histogram of Total Steps per Day", x = "Total Steps", y = "Count") +
geom_vline(xintercept = mean(sumsteps$Total.Steps, na.rm = TRUE), lwd = 2, colour = "darkmagenta", linetype = "longdash") +
annotate("text", label = "Mean", x = 12300, y = 8.5, colour = "darkmagenta") +
geom_vline(xintercept = median(sumsteps$Total.Steps, na.rm = TRUE), lwd = 1, colour = "orange1") +
annotate("text", label = "Median", x = 12500, y = 7.6, colour = "orange1")
```
Calculate the mean and median number of steps taken per day.
```{r, echo = TRUE}
meansteps <- mean(sumsteps$Total.Steps, na.rm = TRUE)
print(paste("Mean of", meansteps, "steps taken per day."))
mediansteps <- median(sumsteps$Total.Steps, na.rm = TRUE)
print(paste("Median of", mediansteps, "steps taken per day."))
```
###What is the average daily activity pattern?
Begin by grouping the data by interval.
```{r, echo = TRUE}
activity_byint <- group_by(activity, interval)
```
Next, use the summarise() function to determine the average step count per interval.
```{r, echo = TRUE}
avgsteps_int <- summarise(activity_byint, mean(steps, na.rm = TRUE))
colnames(avgsteps_int) <- c("Interval", "Average.Step.Count")
head(avgsteps_int)
```
Use ggplot2 to plot a time series, with the 5-minute intervals on the x-axis and the average number of steps in that interval on the y-axis.
```{r, echo = TRUE}
p <- ggplot(avgsteps_int, aes(Interval, Average.Step.Count))
p + geom_line(colour = "dodgerblue3", lwd = 0.8) +
labs(title = "Average Step Count per 5-min Interval") +
labs(x = "5-minute Interval", y = "Average Step Count")
```
To determine which interval contains the max average step count, use which.max().
```{r, echo = TRUE}
maxavgint <- avgsteps_int[which.max(avgsteps_int$Average.Step.Count),]
print(maxavgint)
print(paste("The max average step count of", maxavgint$Average.Step.Count, "occurs during the 5-min interval beginning at", maxavgint$Interval, "am."))
```
###Imputing missing values
To calculate the number of rows missing data on step counts, simply sum the number of NA's that occur in the steps column of the original activity data set.
```{r, echo = TRUE}
sum(is.na(activity$steps))
```
We'll need to replace these 2304 missing values. Let's do so by replacing them with the entire time period's average step count for the corresponding 5-min intervals. First, join the activity data set with the average step count per interval data set. Then, replace all NA step values with the average interval step count in the new column. Finalize the data set by trimming it back down to the steps, date, and interval columns.
```{r, echo = TRUE}
colnames(avgsteps_int) <- c("interval", "steps")
activity2 <- full_join(activity, avgsteps_int, by = "interval")
index <- is.na(activity2$steps.x)
activity2 <- within(activity2, steps.x[index] <- steps.y[index])
activity2 <- select(activity2, steps.x, date, interval)
colnames(activity2)[1] <- "steps"
```
Now to plot a new histogram of the number of steps taken per day using the same group_by() and summarise() tools as before.
```{r, echo = TRUE}
activity2_bydate <- group_by(activity2, date)
sumsteps2 <- summarise(activity2_bydate, Total.Steps = sum(steps))
head(sumsteps2)
g2 <- ggplot(sumsteps2, aes(Total.Steps))
g2 + geom_histogram(colour = "black", fill = "lightsalmon2", binwidth = 1000) +
labs(title = "Histogram of Total Steps per Day- (Imputed)", x = "Total Steps", y = "Count") +
geom_vline(xintercept = mean(sumsteps2$Total.Steps, na.rm = TRUE), lwd = 2, colour = "dodgerblue4", linetype = "longdash") +
annotate("text", label = "Mean", x = 13100, y = 10, colour = "dodgerblue4") +
geom_vline(xintercept = median(sumsteps2$Total.Steps, na.rm = TRUE), lwd = 1, colour = "red") +
annotate("text", label = "Median", x = 13300, y = 9, colour = "red")
```
Continue by calculating the new mean and median based off this NA-free data set.
```{r, echo = TRUE}
meansteps2 <- mean(sumsteps2$Total.Steps, na.rm = TRUE)
print(paste("Mean of", meansteps2, "steps taken per day."))
mediansteps2 <- median(sumsteps2$Total.Steps, na.rm = TRUE)
print(paste("Median of", mediansteps2, "steps taken per day."))
```
The impact that imputation has had on the measures of mean and median appears negligible for this data set. The median has increased by about 0.011%.
###Are there differences in activity patterns between weekdays and weekends?
Let's first create a new column in the imputed data set that designates the weekday of the observation and then convert that into a factor column with two levels: Weekday and Weekend.
```{r, echo = TRUE}
activity2$weekday <- weekdays(activity2$date)
index2 <- activity2$weekday %in% c("Sunday", "Saturday")
activity2 <- within(activity2, weekday[index2] <- "Weekend")
activity2 <- within(activity2, weekday[!index2] <- "Weekday")
activity2$weekday <- as.factor(activity2$weekday)
```
Now to finish with a panel plot displaying the Average Step Count per 5-min interval for the weekend observations vs the weekday observations. Begin by grouping the imputed data by both weekday factor level and interval.
```{r, echo = TRUE}
activity2_byint <- group_by(activity2, weekday, interval)
avgsteps_int2 <- summarise(activity2_byint, Avg.Step.Count = mean(steps, na.rm = TRUE))
```
Then, create a multi-panel plot, one for the weekday data, and one for the weekend data, that displays the average step count per 5-min interval during the day.
```{r, echo = TRUE}
q <- ggplot(avgsteps_int2, aes(interval, Avg.Step.Count))
q + geom_line(aes(color = weekday), lwd = 0.8) +
facet_grid(weekday ~ .) +
labs(title = "Average Step Count per 5-min Interval (Imputed)") +
labs(x = "5-minute Interval", y = "Average Step Count") +
theme(legend.position = "none")
```
We see that the weekday data has a much higher spike in the average step count in the morning during commuter hours, but that the weekend average step counts are more consistently higher throughout the day. These patterns seem to fit with typical workday vs weekend activity patterns.