forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
154 lines (102 loc) · 6.1 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
# Reproducible Research: Peer Assessment 1
It is now possible to collect a large amount of data about personal
movement using activity monitoring devices such as a
[Fitbit](http://www.fitbit.com), [Nike
Fuelband](http://www.nike.com/us/en_us/c/nikeplus-fuelband), or
[Jawbone Up](https://jawbone.com/up). These type of devices are part of
the "quantified self" movement -- a group of enthusiasts who take
measurements about themselves regularly to improve their health, to
find patterns in their behavior, or because they are tech geeks. But
these data remain under-utilized both because the raw data are hard to
obtain and there is a lack of statistical methods and software for
processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring
device. This device collects data at 5 minute intervals through out the
day. The data consists of two months of data from an anonymous
individual collected during the months of October and November, 2012
and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded here: [Activity monitoring data](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip) [52K]
The variables included in this dataset are:
* **steps**: Number of steps taking in a 5-minute interval (missing
values are coded as `NA`)
* **date**: The date on which the measurement was taken in YYYY-MM-DD
format
* **interval**: Identifier for the 5-minute interval in which
measurement was taken
The dataset is stored in a comma-separated-value (CSV) file and there
are a total of 17,568 observations in this
dataset.
## Loading and preprocessing the data
```{r load_process}
echo = TRUE
options(scipen=999)
data <- read.csv("activity.csv",
colClasses = c("integer","Date","integer"),
na.strings = "NA",
quote = "\"")
colnames(data) <- c("Steps","Day","Interval")
dataSumByDay <- aggregate(data$Steps,list(data$Day), function(x) sum(x,na.rm=TRUE))
colnames(dataSumByDay) <- c("Day", "Steps")
dataAvgByInterval <- aggregate(data$Steps,list(data$Interval), function(x) mean(x,na.rm=TRUE))
colnames(dataAvgByInterval) <- c("Interval", "AverageSteps")
```
The original dataset is loaded by the code above. Additionally, 2 summary datasets are derived from the original data: a summary by day and by 5-minute interval. Note, the aggregation functions will remove missing values.
## What is mean total number of steps taken per day?
```{r means}
hist(dataSumByDay$Steps,
breaks = 10,
xlab = "Steps Per Day",
main = "Activity")
meanPerDay <- mean(dataSumByDay$Steps, na.rm = TRUE)
medianPerDay <- median(dataSumByDay$Steps, na.rm = TRUE)
```
The histogram above shows the frequency of total steps per day over the 61 days of opbsrevation.
The measures of centrality are as follows:
**Mean** total number of steps taken per day: `r meanPerDay`
**Median** total number of steps taken per day: `r medianPerDay`
## What is the average daily activity pattern?
```{r avg_daily_pattern}
plot(dataAvgByInterval$Interval,dataAvgByInterval$AverageSteps,
main="Average Daily Activity Pattern",
type="l",
xlab="Interval",
ylab="Average Steps")
maxStepIntervals <- dataAvgByInterval[which(dataAvgByInterval$AverageSteps == max(dataAvgByInterval$AverageSteps, na.rm = TRUE)),]
```
On average across all the days in the dataset, the 5-minute interval that contains the maximum number of steps was found to be **`r maxStepIntervals[1,1]`** with a average value of **`r maxStepIntervals[1,2]`** steps taken.
## Imputing missing values
```{r imputation}
missingData <- sum(is.na(data$Steps))
toImpute <- data[is.na(data$Steps),]$Interval
imputedVals <- unlist(lapply(toImpute, FUN = function(i){
dataAvgByInterval[dataAvgByInterval$Interval==i,]$AverageSteps
}))
dataWImputedValues <- data
dataWImputedValues[is.na(dataWImputedValues$Steps),]$Steps <- imputedVals
dataSumByDayWImputedValues <- aggregate(dataWImputedValues$Steps,list(dataWImputedValues$Day), function(x) sum(x,na.rm=TRUE))
colnames(dataSumByDayWImputedValues) <- c("Day", "Steps")
hist(dataSumByDayWImputedValues$Steps,
breaks = 10,
xlab = "Steps Per Day",
main = "Activity (missing values imputed to mean by interval)")
meanPerDayImputed <- mean(dataSumByDayWImputedValues$Steps, na.rm = TRUE)
medianPerDayImputed <- median(dataSumByDayWImputedValues$Steps, na.rm = TRUE)
```
There were **`r missingData`** observations with missing data.
These values were replaced by the code above using a strategy of replacement by average for the same interval over the entire dataset less the missing observations.
The new values of centrality with imputed data are as follows:
**Mean** total number of steps taken per day: `r meanPerDayImputed`
**Median** total number of steps taken per day: `r medianPerDayImputed`
The most significant impact of the imputation strategy employed is that the median and mean values are now the same at ***`r medianPerDayImputed`*** steps taken per day.
## Are there differences in activity patterns between weekdays and weekends?
```{r compute_weekday_weekend}
dataWImputedValues$Weekdays <- ifelse(weekdays(dataWImputedValues$Day)=="Saturday"|weekdays(dataWImputedValues$Day)=="Sunday","Weekend","Weekday")
dataAvgByWeekdayIntervalWImputedValues <- aggregate(dataWImputedValues$Steps,list(dataWImputedValues$Weekdays, dataWImputedValues$Interval), function(x) mean(x,na.rm=TRUE))
colnames(dataAvgByWeekdayIntervalWImputedValues) <- c("Weekday","Interval", "AverageSteps")
library(ggplot2)
ggplot(data = dataAvgByWeekdayIntervalWImputedValues, aes(x = Interval, y = AverageSteps)) +
geom_line(colour = "red", size = .8) +
facet_wrap(~Weekday, nrow = 2, ncol = 1) +
theme_grey()
```
As demonstrated above, activity patterns vary between weekdays and weekend days. This suggests that the imputation strategy could be refined by using an average value for the interval for the subset of data for weekdays or weekend days accordingly.