forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
81 lines (66 loc) · 2.98 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
# Reproducible Research: Peer Assessment 1
## Loading and preprocessing the data
The data for this report is found inside the activity.zip archive found in this repository. The following code will extract it, if necessary, and massage the data to the correct format.
```{r}
if(!file.exists('activity.csv')){
unzip('activity.zip')
}
activitydata <- read.csv('activity.csv', stringsAsFactors=F)
activitydata$date <- as.Date(activitydata$date)
```
## What is mean total number of steps taken per day?
Ignoring `NA` values a histogram of steps per day looks as follows.
```{r}
hist(activitydata$steps)
mean(activitydata$steps, na.rm=T)
median(activitydata$steps, na.rm=T)
```
The mean of steps daily is `r mean(activitydata$steps, na.rm=T)` and the median is `r median(activitydata$steps, na.rm=T)`
## What is the average daily activity pattern?
To view daily walking patterns it is helpful to view a plot of average steps over 5-minute intervals.
```{r}
aggregatedactivity <- aggregate(steps~interval,data=activitydata, FUN=mean)
plot(aggregatedactivity$interval, aggregatedactivity$steps, type='l', ylab="Average Number of Steps", xlab="Day Interval")
```
There is a large increase in steps per interval in the morning, with a steep decline, followed by a trailing off in the evening.
```{r}
indexOfMax <- which.max(aggregatedactivity$steps)
aggregatedactivity[indexOfMax,]
```
By looking for the row with the maximum number of steps, we can see the interval with the max number of steps is `r aggregatedactivity[indexOfMax,]$interval`
## Imputing missing values
By summing the number of missing `steps` values
```{r, results='hide'}
sum(is.na(activitydata$steps))
```
We can see there are `r sum(is.na(activitydata$steps))` missing values.
By setting `NA` values to the mean for that interval we can impute values. Imputed data is saved in `imputedActivity`
```{r}
imputedActivity <- activitydata
for(i in 1:nrow(aggregatedactivity)){
logicalVector <- (imputedActivity$interval==aggregatedactivity[i,1] & is.na(imputedActivity$steps))
imputedActivity[logicalVector, 1] <- aggregatedactivity[i,2]
}
```
Using code similar to above we can make a histogram of this imputed data and calculate mean and median.
```{r}
hist(imputedActivity$steps)
mean(imputedActivity$steps, na.rm=T)
median(imputedActivity$steps, na.rm=T)
```
The mean and median stay the same because imputing with the mean reinforces the average. The histogram shows there are now more recorded observations.
## Are there differences in activity patterns between weekdays and weekends?
People walk less on weekends.
```{r}
imputedActivity$dateType <- sapply(weekdays(imputedActivity$date), FUN=function(dayName){
if(grepl('^S',dayName)){
return("weekend")
}else{
return("weekday")
}
})
imputedActivity$dateType <- as.factor(imputedActivity$dateType)
aggregatedImputed <- aggregate(steps~interval+dateType, data=imputedActivity, FUN=mean)
library(ggplot2)
ggplot(aggregatedImputed, aes(x=interval, y=steps))+ geom_line() + facet_grid(dateType~.)
```