forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
176 lines (121 loc) · 6.1 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
---
title: 'Repdata_Assignment: Course Project 1'
author: "Clara Hyunyoung Shin"
date: "February 21, 2016"
output: html_document
---
```{r setup, include=FALSE, echo = TRUE}
knitr::opts_chunk$set(echo = TRUE)
```
## 1. Code for reading in the dataset and/or processing the data
First, read in the data and store the raw data into `dat` parameter.
```{r reading in data, echo = TRUE}
setwd("~/Documents/PA1_template")
dat <- read.csv("activity.csv")
```
Let's create a subset called `dat1`, which does not contain any `NA`s.
There are 53 days out of 61 days which have actual steps records.
```{r subsetting into dat1, echo = TRUE}
dat1 <- subset(dat, steps != "NA")
c(length(unique(dat$date)), length(unique(dat1$date)))
```
Because `date` is a factor, change it into a date format.
```{r date factor, echo = TRUE}
dat1$date <- as.Date(dat1$date)
```
## 2. Histogram of the total number of steps taken each day
We will use `plyr` package for rearrange the data. Using `ddply` and `summarise` functions, we can summarize total steps by date. Let's store this new data into `dat2`.
```{r calculate total steps, echo = TRUE}
library(plyr)
dat2 <- ddply(dat1, "date", summarise, totalsteps = sum(steps))
head(dat2)
```
A barplot and a histogram are different. While a barplot depicts the steps by each day, histogram shows the distribution of the steps.
This is the visualization of total steps per day:
```{r barplot, echo = TRUE}
plot(dat2$date, dat2$totalsteps, type = "h", xlab = "Date", ylab = "Total Number of Steps")
```
And this is a **histogram**:
```{r histogram, echo = TRUE}
hist(dat2$totalsteps, breaks=10, xlab = "Total Steps Per Day", main = "Histograms of Total Steps Per Day")
```
## 3. Mean and median number of steps taken each day
The mean is **10766.19** and the median is **10765.00**.
```{r mean and median, echo = TRUE}
dat3 <- c(mean(dat2$totalsteps), median(dat2$totalsteps))
names(dat3) <- c("mean", "median")
dat3
```
## 4. Time series plot of the average number of steps taken
We will use `ddply` and `summarise` functions, so we can get summarized values of average steps by interval. Let's store this new data into `dat4`.
```{r average by interval, echo = TRUE}
dat4 <- ddply(dat1, "interval", summarise, avgsteps = mean(steps))
```
This is a time series plot of `dat4`, which is by 5-minute interval and average number of steps taken, averaged across all days.
```{r time series plot, echo = TRUE}
plot(dat4$interval, dat4$avgsteps, type = "l", xlab = "5-minute Interval", ylab = "Average Number of Steps")
```
## 5. The 5-minute interval that, on average, contains the maximum number of steps
The interval containing the maximum average number of steps is **835** minutes.
```{r which interval contain max, echo = TRUE}
dat4[which.max(dat4[, 2]), 1]
```
## 6. Code to describe and show a strategy for imputing missing data
First, we will get a subset of the original dataset, `dat`, which contains `NA` value. Let's store it into `dat5`. (Sorry for generic names! :p) There are **2304** rows of missing values.
```{r subset dat5, echo = TRUE}
dat5 <- dat[rowSums(is.na(dat)) > 0,]
nrow(dat5)
```
In section 4, we calculated average steps by interval and stored into `dat4`.
```{r dat4, echo = TRUE}
head(dat4)
```
We can use this data to impute the missing data! Instead of `NA` value, let's put `dat4`'s average steps by interval. There are 8 days of missing values. So, repeat the dataset 8 times.
```{r imputing data, echo = TRUE}
dat5$steps <- rep(dat4$avgsteps,8)
```
Then, combine two datasets - `dat1`, which we created first time, without any `NA` values, and `dat5`, data with `NA` values imputed with average values. And then sort by date!
```{r combine, echo = TRUE}
dat6 <- rbind(dat1, dat5)
dat6 <- dat6[with(dat6, order(date)), ]
dat6$steps <- as.numeric(dat6$steps)
```
## 7. Histogram of the total number of steps taken each day after missing values are imputed
Now we have `dat6`, an imputed data with complete date range. Let's summarize the total steps by date, using `ddply` and `summarise` functions. It is actually a repetition of what we did with `dat1` before.
```{r summarize dat6, echo = TRUE}
dat7 <- ddply(dat6, "date", summarise, totalsteps = sum(steps))
```
This is the histogram. The distribution looks similar, but overall values are increased a bit.
```{r histogram dat7, echo = TRUE}
hist(dat7$totalsteps, breaks=10, xlab = "Total Steps Per Day", main = "Histograms of Total Steps Per Day")
```
Using the same methods above, let's get the mean and median of the data. The mean is **10766.19** and the median is **10766.19**. The mean is same as the original data without missing values, but the median increased a little bit. It is actually same value as the mean. Coincidence?!
```{r mean and median dat8, echo = TRUE}
dat8 <- c(mean(dat7$totalsteps), median(dat7$totalsteps))
names(dat8) <- c("mean", "median")
dat8
```
## 8. Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends
First, I created a column called weekday using `weekdays` function. Then, created another column called day, having two levels, that if the weekday is Saturday or Sunday then it is weekend, else it is weekday.
```{r weekday, echo = TRUE}
dat6$weekday <- weekdays(dat6$date)
dat6$day[dat6$weekday == "Saturday" | dat6$weekday == "Sunday"] <- "weekend"
dat6$day[dat6$weekday == "Monday" | dat6$weekday == "Tuesday" |
dat6$weekday == "Wednesday" | dat6$weekday == "Thursday" |
dat6$weekday == "Friday"] <- "weekday"
```
Prior to make a panel plot, we need to summarize into average steps by interval and day.
```{r summarize dat9, echo = TRUE}
dat9 <- ddply(dat6, .(day, interval), summarise, avgsteps = mean(steps))
```
Let's use `lattice` package to make a panel plot. `xyplot` function helps you to do that!
```{r panel plot, echo = TRUE}
library(lattice)
xyplot(avgsteps ~ interval| factor(day),
data = dat9,
type = "l",
xlab = "Interval",
ylab = "Number of steps",
layout=c(1,2))
```
On weekdays, we could see a more interesting pattern. In the weekends, the time series had less extreme values and seem to be more evenly. :)