forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
132 lines (107 loc) · 4.6 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
title: "Peer Assignment 1 - Reproducible Research"
author: "Kiichi Takeuchi"
date: "July 16, 2014"
output: html_document
---
##Setup
First, setup your work directory path. For example, here is my path below. If you are using R Studio, you can also set it from Session > Set Working Directory > To Source File Location.
```{r pathsetup,echo=TRUE}
setwd("~/work/r/class/RepData_PeerAssessment1")
```
Below is the list of required R packages for this analysis:
* data.table
* ggplot2
* gridExtra
Download and extract Zip file if it was not extracted yet
```{r unzipping,echo=TRUE}
if (!file.exists("activity.zip")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip","activity.zip",method="curl")
}
if (!file.exists("activity.csv")){
unzip("activity.zip")
}
```
##Loading Data
Next, load the data file
```{r loading,echo=TRUE,results='hide'}
library(data.table)
data<-fread("activity.csv")
```
##Calculate Steps per Day and Histogram
```{r sum_per_day,echo=TRUE}
library(ggplot2)
sum_by_date <- data[,list(total=sum(steps,na.rm=T)),by=date]
```
Draw histogram of Steps per Day
```{r histgram_sum,echo=TRUE,fig.width=15,fig.height=7}
ggplot(sum_by_date, aes(x=total)) + geom_histogram()
```
##Mean and Median
```{r mean_median,echo=TRUE}
mean_val1<-mean(sum_by_date$total)
mean_val1
median_val1<-median(sum_by_date$total)
median_val1
```
##The average daily activity pattern
Time Series Plot would be generated as below. The 5-minute interval and the average number of steps taken, averaged across all days.
```{r interval, echo=TRUE,,fig.width=10,fig.height=5}
avg_interval <- data[,list(avg=mean(steps,na.rm=T)),by=interval]
head(avg_interval,5)
ggplot(data=avg_interval,aes(x=interval,y=avg)) + geom_line()
```
The max average steps and which interval is it?
```{r max_interval,echo=TRUE}
avg_interval[avg==max(avg_interval$avg),]
```
Total Number of NA (Missing Values)
```{r missing_na, echo=TRUE}
sum(is.na(data$steps))
```
##Imputing missing values
Here is the approach that I took to fill those missing values. I've used average value of each interval in order to replace NA in steps column. Before filling data, I merged original data table with average table by interval as the key. Make sure order by date and interval after the process.
```{r filling_na1, echo=TRUE}
data_avg<-merge(data,avg_interval,by="interval")[order(date,interval)]
head(data_avg,5)
```
After mergining tables, fill steps where it contains NA. I'm casting average steps as integer below. The second line is verification code. The number of NA should be zero.
```{r filling_na2, echo=TRUE}
data_f<-data # copy data first
data_f[is.na(data_f$steps),c('steps')]<-as.integer(data_avg[is.na(data_f$steps),avg])
head(data_f,5)
sum(is.na(data$steps))
```
Here is the same analysis that I did above but I use the new data without missing values: histogram, mean and median.
```{r histgram_sum_2,echo=TRUE,fig.width=15,fig.height=7}
sum_by_date_f <- data_f[,list(total=sum(steps,na.rm=T)),by=date]
ggplot(sum_by_date_f, aes(x=total)) + geom_histogram()
mean_val2<-mean(sum_by_date_f$total)
mean_val2
median_val2<-median(sum_by_date_f$total)
median_val2
```
The histogram has been filled and bars are more smooth than first data set. The mean value slightly increased from `r format(round(mean_val1, 2), nsmall = 2)` to `r format(round(mean_val2, 2), nsmall = 2)`. The median has been changed from `r format(round(median_val1, 2), nsmall = 2)` to `r `r format(round(median_val2, 2), nsmall = 2)`.
#Weekdays and Weekends
In this section, I will categorize date into two factors : weekday and weekend, and we explore the diffences.
```{r weekdays1, echo=TRUE}
days<-weekdays(as.Date(data_f$date,'%Y-%m-%d'))
data_f[,c('day_in_week')]<-days
data_f[,c('day_type')]<-factor(days=='Saturday' | days=='Sunday', labels=c('weekday','weekend'))
head(data_f,5)
```
Render weekend and weekday steps
```{r draw_weekend, echo=TRUE,fig.width=10,fig.height=10}
avg_weekend <- data_f[day_type == 'weekend',list(avg=mean(steps,na.rm=T)),by=interval]
g1<-ggplot(data=avg_weekend,aes(x=interval,y=avg)) + geom_line() + ggtitle("Average steps in Weekends")
avg_weekday <- data_f[day_type == 'weekday',list(avg=mean(steps,na.rm=T)),by=interval]
g2<-ggplot(data=avg_weekday,aes(x=interval,y=avg)) + geom_line() + ggtitle("Average steps in Weekdays")
library(gridExtra)
grid.arrange( g1, g2, nrow=2)
```
Patterns until 900 from the begining looks similar but after 1000, weekdays data shows less steps during daytime. This could be because of less activity during work hours.
Note:Generate .md document using knitr library
```{r gen,echo=TRUE}
#library(knitr)
#knit2html("PA1_template.Rmd")
```