forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
147 lines (114 loc) · 6.4 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
---
title: 'Reproducible Research: Peer Assessment 1'
output:
pdf_document: default
html_document:
keep_md: yes
---
## Loading and preprocessing the data
We want to do a few things to clean this data:
- unzip
- read as a csv
- process the date column from character to Date-type data
The data source can be downloaded [here.](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip) The pre-processing code is shown below.
``` {r chunk1,echo=TRUE}
unzip("activity.zip",exdir="./data")
out_file <- "./data/activity.csv"
play_data <- read.csv(out_file, header = TRUE, sep = ",", na.strings="NA",stringsAsFactors=F)
play_data$date <- as.Date(play_data$date, format = "%Y-%m-%d")
```
Now the data is *cleaned up* and ready for analysis. Recall the data is comprised of three columns:
- **steps:** raw number of steps walked in a 5-minute interval. NA values exist.
- **date:** the year/month/day of the activity
- **interval:** a 5 minute interval from 00:00 to 23:55 in a 24-hour period
## What is mean total number of steps taken per day?
We must sum up the total number of steps taken per day, omitting any `NA` values in the denominator. For example, if there are 10 data points and 2 are `NA`, we would sum up 8 values and divide by 8 to get the sum. The `agregate` function allows us to do this. We summary steps by date.
```{r chunk2-plot1,echo=TRUE}
steps_count <- aggregate(steps ~ date,sum,data=play_data)
steps_mean<- as.integer(mean(steps_count$steps))
steps_median<-median(steps_count$steps)
print(paste("Mean steps per day:",steps_mean,"Median steps per day:",steps_median))
```
Now that we know the mean (`r steps_mean`) and median (`r steps_median`) values for steps *per day*, let's plot this data, with `steps_mean` to show the average steps *per day* taken.
``` {r chunk3-plot1,echo=TRUE,fig.align="center",fig.height=4}
par(mar=c(5,4,3,1),las=1)
hist(steps_count$steps, breaks=20,col="red",xlab="Total Steps in a day",
ylab="Frequency of Days", main="Frequency of Total Steps")
rug(steps_count$steps)
abline(v = steps_mean, col = "blue", lwd = 2)
text_x = steps_mean; text_y = 10
text(text_x, text_y, paste("Mean:",as.integer(steps_mean)), pos=2)
```
## What is the average daily activity pattern?
Average daily activity pattern means how many steps does one walk based on the time of the day (from `00:00` to `23:55`). Here, we are interested in averaging values based on a common `interval` factor, rather than a `date` factor. We will continue off the variables in the previous problem.
```{r chunk4-plot2,echo=TRUE}
avgs_interval<- aggregate(steps ~ interval,mean, data=play_data)
max_average_steps_interval <- as.integer(max(avgs_interval$steps))
print(paste("Maximum average steps is:",max_average_steps_interval,"at interval",
avgs_interval[which(avgs_interval$steps==max(avgs_interval$steps)),1]))
```
So, `interval` `r avgs_interval[which(avgs_interval$steps==max(avgs_interval$steps)),1]` has `r max_average_steps_interval` `steps`. Looks like this is a morning person! Keep in mind we are **ignoring** `NA` values, which, later on, we'll see has an effect on this average!
```{r chunk5-plot2, echo=TRUE,fig.align="center",fig.height=4}
plot(x=avgs_interval$interval,y=avgs_interval$steps,type="l",
xlab="5 Minute Interval 00:00 to 23:55",ylab="Average Step Count",
main="Average Steps Taken by Interval Over All Days Observed")
abline(h=max_average_steps_interval,col="red")
```
## Imputing missing values
There are a number of `NA` values for steps. In fact, there are **`r sum(is.na(play_data$steps))`** `NA` values. Let's make a new data set `clean_data` to play with. To clean the code, I chose a for-loop for ease of use.
```{r chunk6-plot3,echo=TRUE}
clean_data <- play_data
head(clean_data,6)
# Get the mean of steps by interval
tapp_avgs <- tapply(play_data$steps,play_data$interval,mean,na.rm=TRUE)
# If NA exists, replace NA with mean value for current interval
for (i in 1:length(clean_data$steps)){
if (is.na(clean_data[i,1])) {
clean_data[i,1] <- tapp_avgs[as.character(clean_data[i,3])]
}
}
head(clean_data,6)
# how many NA's are left?
sum(is.na(clean_data))
```
Now that we have cleaned the data and replaced `NA` values, let's plot again. For good measure, we will also show the differences.
``` {r chunk7-plot3,echo=TRUE,fig.width=9}
par(mfrow=c(1,2))
clean_steps_count <- aggregate(steps ~ date,sum,data=clean_data)
clean_steps_mean<-as.integer(mean(clean_steps_count$steps)); clean_steps_median<-as.integer(median(clean_steps_count$steps))
hist(clean_steps_count$steps, breaks=20,col="blue",xlab="Total Steps in a Day",
ylab="Frequency of Days", main="Frequency of Total Steps\n with NA Replacement")
rug(clean_steps_count$steps)
abline(v=clean_steps_mean,col="red",lwd=2)
# show the old plot for a side-by-side comparison
hist(steps_count$steps, breaks=20,col="red",xlab="Total Steps in a Day",
ylab="Frequency of Days", main="Frequency of Total Steps\n without NA Replacement")
rug(steps_count$steps)
abline(v = steps_mean, col = "blue", lwd = 2)
```
The new mean is `r clean_steps_mean` and median is `r clean_steps_median`.
## Are there differences in activity patterns between weekdays and weekends?
Using the clean_data dataset with no `NA` values left, let's see if the user is more active during the weekdays (Monday-Friday) or weekends!
``` {r chunk8-plot4, echo=TRUE}
clean_data$weekdays <- weekdays(clean_data$date)
days <- unique(clean_data$weekdays)
weekday <- days[1:5]
weekend <- days[6:7]
wkday <- which(clean_data$weekdays %in% weekday)
wknd <- which(clean_data$weekdays %in% weekend)
weekday_count <- aggregate(steps ~ interval,mean,data=clean_data[wkday,])
weekday_count$wk <- "weekday"
weekend_count <- aggregate(steps ~ interval,mean,data=clean_data[wknd,])
weekend_count$wk <- "weekend"
nw_data <- rbind(weekday_count,weekend_count)
library(lattice)
xyplot( steps ~ interval | wk, data=nw_data,type="l",layout=c(2,1),
at = seq(min(nw_data$interval),max(nw_data$interval),max(nw_data$interval)/20),
main="Average Number of Steps Walked by Interval and Day Type")
```
So it looks like our person of interest is more active on the **average** weekend rather than the **average** weekday. By how much? Let's look!
``` {r chunk9-plot4,echo=TRUE}
diff <- round(100*(sum(weekday_count$steps)-sum(weekend_count$steps))/sum(weekday_count$steps),
digits=1)
print(paste("This person is",diff,"percent less active on weekdays than weekends."))
```