forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
115 lines (81 loc) · 4.27 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
```{r setoptions, echo=TRUE}
#Assume the activity.csv was placed at the same directory as this Rmarkdown file.
data = read.csv("activity.csv", header=TRUE, sep=",", na.strings="NA")
#tranform date to correct Date format
data$date = as.Date(as.character(data$date), "%Y-%m-%d")
```
## What is mean total number of steps taken per day?
```{r part2}
#ignore the missing values
dataTidy = data[!is.na(data$steps), ]
## create histogram of the total number of steps taken each day
#calculate the total sum of steps w.r.t. the date
sumStep = aggregate(steps~date, dataTidy, sum)
#plot graph
hist(sumStep$steps, col="red", main="Total number of steps taken each day", xlab="Steps Per Day")
#disable scientific notation
options(scipen=999)
#calculate the mean and median total number of steps taken per day
meanStep = mean(sumStep$steps)
medianStep = median(sumStep$steps)
```
* The __mean__ total number of steps taken per day: _`r meanStep`_
* The __median__ total number of steps taken per day: _`r medianStep`_
## What is the average daily activity pattern?
``` {r part3}
## create time series plot of the 5-minute interval and
## the average number of steps taken
#calculate the average of steps w.r.t. the time interval
avgStep = aggregate(steps~interval, dataTidy, mean)
#plot graph
#plot(avgStep$step, type="l", xaxt="n", xlab="5-minute interval identifiers", main="Average number of steps across all days", ylab="Average number of steps")
#get the interval list
plot(avgStep$interval, avgStep$step, type="l",xlab="5-minute interval identifiers", ylab="Steps", main="Average number of steps across all days")
#check which 5-minute interval contains the maximum number of steps
maxIdx = which.max(avgStep$steps)
maxInterval = avgStep[maxIdx, 'interval']
maxAvgStep = avgStep[maxIdx, 'steps']
```
* On average across all the days in the dataset, __`r maxInterval`__ 5-minute interval contains maximum number of steps (_`r maxAvgStep`_ steps)
## Imputing missing values
``` {r part4_0}
#calulate the total number of missing values in the dataset
totalNA = sum(is.na(data$steps))
```
* The total number of missing values in the dataset: __`r totalNA`__
To filling all of the missing values in the dataset, I adopted the strategy: using mean of 5-minute interval
``` {r part4}
#filling all of the missing values in the dataset
# STRATEGY: using mean of 5 minute interval
#get the NA index
naIdx = which(is.na(data$steps))
naInterval = data[naIdx, 3] #get the corrosponding intervale
fillSteps = sapply(naInterval, function(x) { avgStep[(avgStep$interval==x), 2]})
#create a new dataset that is equal to the orginal dataset
# BUT with the missing data filled in
dataNew = data
dataNew[naIdx, 'steps'] = fillSteps #fill in missing data
#again, calculate the average of steps w.r.t. the time interval
sumStepNew = aggregate(steps~date, dataNew, sum)
#plot graph
hist(sumStepNew$steps, col="red", main="Total number of steps taken each day \n (new dataset filled missing data)", xlab="Total Number of Steps")
#calculate the mean and median total number of steps taken per day
meanStepNew = mean(sumStepNew$steps)
medianStepNew = median(sumStepNew$steps)
```
### After filling the missing values
* The __mean__ total number of steps taken per day: _`r meanStepNew`_
* The __median__ total number of steps taken per day: _`r medianStepNew`_
The mean number of steps taken per day are the same (`r meanStepNew`).
But, the median number of steps taken per day are slightly different (before filling missing data: `r medianStep`, after filling missing data: `r medianStepNew`). It is probably due to filling missing values by the mean of 5 minute interval.
## Are there differences in activity patterns between weekdays and weekends?
```{r part5}
#create a new factor variable "dateIs" with two levels "Weekday" and "weekend"
dataNew['dateIs'] = factor(sapply(dataNew$date, function(x){ if (weekdays(x) == "Sunday" | weekdays(x) == "Saturday") { "weekend" } else { "weekday"} }))
#calculate the average of steps w.r.t. the time interval and dateIs
avgStepDateIs = aggregate(steps~interval + dateIs, mean, data=dataNew)
library(lattice)
xyplot( steps ~ interval | dateIs, data = avgStepDateIs, type="l", layout=c(1,2), xlab="Interval", ylab="Number of steps")
```