forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
207 lines (140 loc) · 6.15 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
title: "Reproducible Research Peer Assessment Part 1"
author: "Gran Ville Lintao"
date: "August 16, 2015"
output:
html_document:
keep_md: yes
---
## Introduction
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the "quantified self" movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This study makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
We set the options and include libraries in this part of code:
```{r setoptions,echo=TRUE, message=FALSE}
library(plyr)
library(dplyr)
library(ggplot2)
knitr::opts_chunk$set(fig.width=6, fig.height=4, fig.path='figure/',
warning=FALSE, message=FALSE)
#library(plyr)
```
## Loading and preprocessing the data
1. Loading the data can be done via the following code:
```{r}
path2csv <- "./activity.csv"
mydf <- read.csv(path2csv)
```
2. This code converts the data frame to a data frame table structure for easier processing:
```{r}
mytbldf <- tbl_df(mydf)
```
## What is mean total number of steps taken per day?
1. The total number of steps taken per day can be calculated by using the verbs from dplyr - group_by and summarize
```{r}
by_day <- group_by(mytbldf, date)
sum_by_day <- summarize(by_day, sumsteps=sum(steps))
```
And then visualized in a histogram by using ggplot2
code:
```{r}
# convert the date to a vector of integers so it can be displayed nicely in the plot
days <- as.numeric(sum_by_day$date)
sum_by_day <- cbind(sum_by_day, days)
# display a histogram plot
qplot(days, data=sum_by_day, weight=sumsteps, geom="histogram", binwidth=1)
```
2.
The mean of the total number of steps per day is calculated as:
```{r}
meansteps <- mean(sum_by_day$sumsteps, na.rm=TRUE)
meansteps
```
While the median of the total number of steps per day is calculated as:
```{r}
mediansteps <- median(sum_by_day$sumsteps, na.rm=TRUE)
mediansteps
```
## What is the average daily activity pattern?
1. The average daily activity can be shown by creating a time series plot of the 5-minute interval in X axis vs the average number of steps across all days in Y axis.
```{r}
by_intrv <- group_by(mytbldf, interval)
mean_by_intrv <- summarize(by_intrv, avesteps=mean(steps, na.rm=TRUE))
plot <- ggplot(mean_by_intrv, aes(interval, avesteps)) + geom_line()
plot + xlab("Interval") + ylab("Average Steps across all days")
```
2. The 5-minute interval averaged across all days that contains the maximum number of steps can be calculated as follows:
```{r}
maxavestep <- max(mean_by_intrv$avesteps)
rowanswer <- subset(mean_by_intrv, avesteps == maxavestep)
```
The value of the said interval is:
```{r echo=FALSE}
rowanswer$interval
```
## Imputing missing values
1. The total number of rows with NA steps in data is:
```{r}
rowsWithNaSteps <- subset(mytbldf, is.na(steps))
totalRowsWithNas <- nrow(rowsWithNaSteps)
totalRowsWithNas
```
2. There are `r totalRowsWithNas` rows with NA values. An acceptable strategy for filling this empty rows is by using the mean for each interval calculated in the previous section.
3. In this part we're going to fill the NAs with the mean steps of each 5-minute interval averaged across all days:
```{r}
newmytbldf <- mytbldf
for (i in 1:nrow(newmytbldf)) {
if (is.na(newmytbldf[i,]$steps)) {
answer <- subset(mean_by_intrv, interval == newmytbldf[i,]$interval)
newmytbldf[i,]$steps <- answer$avesteps
}
}
```
4. Then we're going to compare the average values and its histogram with the first part - which contains missing data
```{r}
by_day2 <- group_by(newmytbldf, date)
sum_by_day2 <- summarize(by_day2, sumsteps=sum(steps))
# convert the date to a vector of integers so it can be displayed nicely in the plot
days2 <- as.numeric(sum_by_day2$date)
sum_by_day2 <- cbind(sum_by_day2, days2)
# display a histogram plot
qplot(days2, data=sum_by_day2, weight=sumsteps, geom="histogram", binwidth=1, xlab="Days", ylab="Total Steps Taken")
```
The mean total number of steps per day is:
```{r}
meansteps2 <- mean(sum_by_day2$sumsteps)
meansteps2
```
While the median of the total number of steps per day is calculated as:
```{r}
mediansteps2 <- median(sum_by_day2$sumsteps)
mediansteps2
```
Based on visually comparing the previous plot and the current one we can see that there's not much difference except for some spaces that are filled out with higher values. The mean is the same and the median for the latter is just higher by one point.
## Are there differences in activity patterns between weekdays and weekends?
1. In this part of the code we're going to create a new factor variable that identifies whether a row falls on a weekday or a weekend:
```{r}
cdays <- weekdays(as.Date(newmytbldf$date))
newmytbldf$cdays <- cdays
newmytbldf$daytype <- ""
for (i in 1:nrow(newmytbldf)) {
if (newmytbldf[i,]$cdays == "Sunday" ||
newmytbldf[i,]$cdays == "Saturday")
{
newmytbldf[i,]$daytype <- "weekend"
}
else
{
newmytbldf[i,]$daytype <- "weekday"
}
}
newmytbldf$daytype <- as.factor(newmytbldf$daytype)
```
And then we're going to create a time series plot to investigate whether there's a difference in activity patterns between weekdays and weekends.
As we can see below the maximum number of steps is taken on weekdays but during weekends there seems to be a larger group of larger numbers of steps taken compared to weekdays.
```{r}
dataSum <- ddply(newmytbldf, ~interval * daytype, summarise, meanSteps=mean(steps))
qplot(interval, meanSteps, data=dataSum,
facets=.~daytype, geom="line",
ylab="Average Steps taken",
xlab="Interval")
```