forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
101 lines (75 loc) · 3.21 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
---
title: "Reproducible Research: Peer Assessment 1"
output:
html_document:
keep_md: true
---
## Loading and preprocessing the data
```{r setoptions, echo=TRUE}
Sys.setlocale(category = 'LC_ALL', locale='en_US.UTF-8')
# unzip activity.zip, if not already unzipped
if (!file.exists('activity.csv'))
unzip(zipfile = 'activity.zip', exdir = '.')
# read data
data <- read.csv(file = 'activity.csv', header = TRUE, sep = ',')
data$date <- as.Date(as.character(data$date), '%Y-%m-%d')
tidy <- data[!is.na(data$steps), ]
```
## What is mean total number of steps taken per day?
```{r echo=TRUE}
d <- aggregate(steps~date, tidy, sum)
plot(d$date, d$steps, type="h", ylab = 'Total steps', xlab = 'Date',
main = 'Total steps in 2012')
```
- The *mean* total number of steps taken per day is **`r mean(d$steps, na.rm = T)`**
- The *median* total number of steps taken per day: **`r median(d$steps, na.rm = T)`**
## What is the average daily activity pattern?
```{r echo=TRUE}
dinterval <- aggregate(steps~interval, tidy, mean)
plot(dinterval, type='l', ylab='Average number of steps taken', xlab='interval of 5 minutes')
# find max index
index <- which.max(d$steps)
```
Maximun 5-minute interval is entry number **`r index`** with values **interval `r d[index, 1]`** and **steps mean `r d[index, 2]`**
## Imputing missing values
### Number of missing values in the dataset
```{r echo=TRUE}
numberOfNA <- nrow(data[is.na(data$steps),])
```
- Total number of missing values in the dataset is **`r numberOfNA`**
Filling the missing steps values by assigning the average for similar intervals.
```{r echo=TRUE}
fixed <- data
fixedNA <- is.na(fixed$steps)
fixed[fixedNA, 1] <- sapply(fixed[fixedNA, 3], function(x) dinterval[dinterval$interval == x, 2])
f <- aggregate(steps~date, fixed, sum)
plot(f$date, f$steps, type="h", ylab = 'Total steps', xlab = 'Date',
main = 'Total steps in 2012')
```
- The *mean* total number of steps taken per day is **`r mean(f$steps, na.rm=T)`**
- The *median* total number of steps taken per day: **`r median(f$steps, na.rm=T)`**
**What is the impact of imputing missing data on the estimates of the total daily number of steps?**
*There is no impact on the mean total number of steps. And there is a small impact on the median.*
## Are there differences in activity patterns between weekdays and weekends?
```{r echo=TRUE}
# helper funciton to determine if a date is a weekday or a weekend
isWeekend <- function(x) {
wd <- weekdays(x)
if (wd %in% c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'))
'weekday'
else
'weekend'
}
# add a new field to the dataset which indicate if the date is a weekday or a weekend
fixed$daytype <- sapply(fixed$date, isWeekend)
# calc steps mean
wdata <- aggregate(steps~interval+daytype, fixed, mean)
#weekenddata <- wdata[wdata$daytype=='weekend', c(1,3)]
#weekdaydata <- wdata[wdata$daytype=='weekday', c(1,3)]
#par(mfrow = c(2,1))
#plot(weekdaydata, type='l', ylab="Mean steps", xlab="Interval")
#plot(weekenddata, type='l', ylab="Mean steps", xlab="Interval")
library(lattice)
xyplot( steps ~ interval | daytype, wdata, type="l", xlab="Interval", ylab="Average of steps", layout=c(1,2))
```
It seems that there is more steps on weekends