forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
259 lines (212 loc) · 8.51 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
Adapted from [rdpeng's template](https://github.com/rdpeng/RepData_PeerAssessment1)
Compiled using the command:
```
knit2html("PA1_template.Rmd","PA1_template.html")
# N.B. not a knitr R block.
```
```{r setoptions,echo=TRUE}
# Some knitr housekeeping and libraries
library(knitr)
library(dplyr)
library(ggplot2)
library(scales)
opts_chunk$set(echo=TRUE)
```
## Loading and preprocessing the data
Data cover the range 2012-10-30 to 2012-11-30.
We loaded the included activity.zip (same as available from the assignment via coursera as repdata\_data\_activity.zip). The
interval labels were found to be 24hr clock times with leading
zeros ommitted. (If you notice plateaus in your interval plots the
gap from 55th minute of one hour, xx55, to 0th minute of the next,
yy00, is the reason.)
```{r}
# The activity.zip in source control matches exactly the
# repdata_data_activity.zip downloadable from Coursera.
unzip("activity.zip")
activity <- read.csv("activity.csv",
colClasses=c("numeric","Date","numeric"),
na.strings = "NA")
# This is really quite horrible. There are 24*60=1440 minutes
# in a day, 24*12 = 288 five minute intervals. But the five minute
# intervals range [0 2355], with 288 distinct values. It's actually
# an unformatted time. Since:
# 1. It spans a likely daylight savings change.
# 2. This course is running in October and daylight savings might
# hit a careless conversion
# 3. Plotting them as is introduces nonlinear gaps from xx55 -> yy00
# I'm just going to calculate as if all intervals are times on the
# first day of the period.
activity <- activity %>%
mutate(intervaltime=as.POSIXct(
sprintf("2012-10-01 %04d",interval), format="%Y-%m-%d %H%M")
)
# Define some helper functions
# Determine optimum breaks for a histogram from data x. If
# nbins not supplied then use one of the nclass. functions.
findBreaks <- function (x, nclass.meth=nclass.Sturges,
nbins) {
if (missing(nbins)) {
nbins <- nclass.meth(x)
}
breaks <- seq(from=min(x), to=max(x),
along.with = 0:nbins)
}
```
Sanity check, our date range is expected to be 2012-10-01 to
2012-11-30.
```{r}
c(min(as.Date(activity$date)), max(as.Date(activity$date)))
```
## What is mean total number of steps taken per day?
### Histograms of the total number of steps taken each day
```{r}
# Calculate the total number of steps taken per day
dailyActivity <- activity %>% group_by(date) %>%
summarise(steps=sum(steps, na.rm=TRUE))
# Going to plot two histograms, so use a function to do it
doHisto <- function (activity, breaks, title=NULL) {
# Adapt example from ggplot2 documentation
qplot(steps, data=dailyActivity, xlab="Daily steps",
breaks=breaks,
geom="histogram", fill=..count..) +
scale_fill_gradient("Count", low = "blue", high = "red") +
labs(title=title)
}
breaks <- findBreaks(dailyActivity$steps)
doHisto(dailyActivity, breaks,
"Total daily steps, missing values ignored")
# Though I sort of like 10...
breaks <- findBreaks(dailyActivity$steps,nbins=10)
doHisto(dailyActivity, breaks,
"Total daily steps, missing values ignored")
```
* Mean value of total steps per day:
```{r}
round(mean(dailyActivity$steps, na.rm=TRUE),3)
```
* Median value of total steps per day:
```{r}
median(dailyActivity$steps, na.rm=TRUE)
```
## What is the average daily activity pattern?
Steps at each interval, averaged over days. Oct/Nov 2012.
```{r}
# mean activity each interval
intervalActivity <- activity %>% group_by(intervaltime) %>%
summarise(avgSteps=mean(steps,na.rm=TRUE))
# It says 'type ="l"', so I suppose that's base plotting:
plot(avgSteps ~ intervaltime, data=intervalActivity, type="l",
ylab="Average steps", xlab="Interval time")
```
Interval with largest number of average steps daily is:
```{r}
topInterval <- intervalActivity[
which.max(intervalActivity$avgSteps),"intervaltime"]
format(topInterval,"%H:%M")
```
## Imputing missing values
* Total number of missing values in dataset:
```{r}
# "the total number of rows with NAs" (which does turn out to be
# the same as number of is.na(activity$steps), i.e. no rows with
# steps and missing date/interval)
incomplete <- (!complete.cases(activity))
sum(incomplete)
```
We impute missing values in the data by randomly sampling with replacement values from other intervals from the same time and weekday.
```{r}
# So what to do? Could do mean/median for each interval or day, or
# whole period. Could even set them all to pi.
# Instead: for each interval * day of the week, sample with
# replacement from the existing cases for that combination.
# While it's not the best solution for a single result, this
# is something that could be repeated and bootstrapped to
# estimate the uncertainty from our imputation.
# Bit neater than full names from weekdays()
activity$day <- strftime(activity$date,format="%a")
activityComplete <- activity %>% filter(!is.na(steps))
set.seed(1)
fillValue <- function(steps, xday, xinterval) {
if (is.na(steps)) {
sample(
(activityComplete %>%
filter(day== xday & interval == xinterval))$steps,
size=1
)
} else {
steps
}
}
# New variable, keep original data clean
activityImp <- activity %>% rowwise() %>%
mutate(steps = fillValue(steps,day,interval)) %>%
ungroup()
```
The following check should show missing values have been filled
with values taken from the original set of matching day and
interval.
```{r}
# Confirm this works, fill values different
merge (activity, activityImp, by=c("date","day","intervaltime",
"interval"),
suffixes=c(".original", ".imputed") ) %>%
select(date,day,interval,starts_with("steps")) %>%
filter(day=="Sun" & interval == 1800)
```
Resulting imputed step frequencies.
```{r}
dailyActivityImp <- activityImp %>% group_by(date) %>%
summarise(steps=sum(steps, na.rm=FALSE))
breaks <- findBreaks(dailyActivityImp$steps,nbins=10)
doHisto(dailyActivityImp, breaks,"Total daily steps after imputation")
```
* Mean of total daily steps after imputation:
```{r}
round(mean(dailyActivityImp$steps, na.rm=FALSE),3)
```
* Median of total daily steps after imputation:
```{r}
median(dailyActivityImp$steps, na.rm=FALSE)
```
The values are higher after imputation. In the earlier part of the assignment daily activity was formed by summing values over the day and ignoring na values. Filling them with something greater than 0 will increase the daily activity values and tend to increase the summary values. (Median however is only affected if any values increase past the previous current, it is not affected by changes in the ends of the range.)
## Are there differences in activity patterns between weekdays and weekends?
Plot of average interval activity profile for weekends and weekdays.
```{r}
# Add a weekday factor for plotting
activityImp$weekday = factor(
x = sapply(activityImp$day,
function(x) {
ifelse(x %in% c("Sat","Sun"),
"weekend","weekday")
} )
)
intervalActivityImp <- activityImp %>%
group_by(weekday, intervaltime) %>%
summarise(avgSteps=mean(steps,na.rm=TRUE))
activityImp <- activityImp %>% mutate(intervaltimelabel=format(intervaltime,"%H:%M"))
# The labels= trick wasn't at all obvious...
ggplot(data=intervalActivityImp) +
geom_line(aes(intervaltime,avgSteps)) +
facet_wrap(~weekday, ncol=1, as.table=TRUE) +
scale_y_continuous(name="Mean number of steps") +
scale_x_datetime(name= "Time",
labels = date_format("%H:%M")
)
```
Plot of average interval activity profile for weekends and weekdays with estimated confidence intervals
```{r}
ggplot(data=activityImp) +
stat_summary(aes(intervaltime,steps),fun.data=mean_cl_boot, geom="smooth") +
facet_wrap(~weekday, ncol=1, as.table=TRUE) +
scale_y_continuous(name="Mean number of steps") +
scale_x_datetime(name= "Time",
labels = date_format("%H:%M")
)
```
What would be nice would be to calculate the per-interval difference between the two and test that statistically (with appropriate multiple comparisons correction). However, not what's asked for. Looks like weekend activity is a little higher. If you accidentally plot points (as I originally did) weekend maximum activity is higher. There is more variability / higher frequency, but this could just be because we are averaging over a larger number of days. Time range of activity is about the same, but the early morning pattern is maybe a bit more consistent for weekdays.