forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
129 lines (102 loc) · 4.69 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
---
title: "Peer Assignment 1 - Reproducible Research"
author: "Kiichi Takeuchi"
date: "July 16, 2014"
output: html_document
---
##Setup
First, setup your work directory path. For example, here is my path below. If you are using R Studio, you can also set it from Session > Set Working Directory > To Source File Location.
```{r pathsetup,echo=TRUE}
setwd("~/work/r/class/RepData_PeerAssessment1")
```
Below is the list of required R packages for this analysis:
* data.table
* ggplot2
Download and extract Zip file if it was not extracted yet
```{r unzipping,echo=TRUE}
if (!file.exists("activity.zip")){
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip","activity.zip",method="curl")
}
if (!file.exists("activity.csv")){
unzip("activity.zip")
}
```
##Loading Data
Next, load the data file
```{r loading,echo=TRUE,results='hide'}
library(data.table)
data<-fread("activity.csv")
```
##Calculate Steps per Day and Histogram
```{r sum_per_day,echo=TRUE}
library(ggplot2)
sum_by_date <- data[,list(total=sum(steps,na.rm=T)),by=date]
```
Draw histogram of Steps per Day
```{r histgram_sum,echo=TRUE,fig.width=15,fig.height=7}
ggplot(sum_by_date, aes(x=date, y=total)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90,hjust = 1))# + coord_flip()# + scale_x_reverse()
```
##Mean and Median
```{r mean_median,echo=TRUE}
mean_val1<-mean(sum_by_date$total)
mean_val1
median_val1<-median(sum_by_date$total)
median_val1
```
##The average daily activity pattern
Time Series Plot would be generated as below. The 5-minute interval and the average number of steps taken, averaged across all days.
```{r interval, echo=TRUE,fig.height=4}
avg_interval <- data[,list(avg=mean(steps,na.rm=T)),by=interval]
head(avg_interval,5)
ggplot(data=avg_interval,aes(x=interval,y=avg)) + geom_line()
```
The max average steps and which interval is it?
```{r max_interval,echo=TRUE}
avg_interval[avg==max(avg_interval$avg),]
```
Total Number of NA (Missing Values)
```{r missing_na, echo=TRUE}
sum(is.na(data$steps))
```
##Imputing missing values
Here is the approach that I took to fill those missing values. I've used average value of each interval in order to replace NA in steps column. Before filling data, I merged original data table with average table by interval as the key. Make sure order by date and interval after the process.
```{r filling_na1, echo=TRUE}
data_avg<-merge(data,avg_interval,by="interval")[order(date,interval)]
head(data_avg,5)
```
After mergining tables, fill steps where it contains NA. I'm casting average steps as integer below. The second line is verification code. The number of NA should be zero.
```{r filling_na2, echo=TRUE}
data_f<-data # copy data first
data_f[is.na(data_f$steps),c('steps')]<-as.integer(data_avg[is.na(data_f$steps),avg])
head(data_f,5)
sum(is.na(data$steps))
```
Here is the same analysis that I did above but I use the new data without missing values: histogram, mean and median.
```{r histgram_sum_2,echo=TRUE,fig.width=15,fig.height=7}
sum_by_date_f <- data_f[,list(total=sum(steps,na.rm=T)),by=date]
ggplot(sum_by_date_f, aes(x=date, y=total)) + geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = 90,hjust = 1))# + coord_flip()# + scale_x_reverse()
mean_val2<-mean(sum_by_date_f$total)
mean_val2
median_val2<-median(sum_by_date_f$total)
median_val2
```
The histogram has been filled and bars are more smooth than first data set. The mean value slightly increased from `r format(round(mean_val1, 2), nsmall = 2)` to `r format(round(mean_val2, 2), nsmall = 2)`. The median has been changed from `r format(round(median_val1, 2), nsmall = 2)` to `r `r format(round(median_val2, 2), nsmall = 2)`.
#Weekdays and Weekends
In this section, I will categorize date into two factors : weekday and weekend, and we explore the diffences.
```{r weekdays1, echo=TRUE}
days<-weekdays(as.Date(data_f$date,'%Y-%m-%d'))
data_f[,c('day_in_week')]<-days
data_f[,c('day_type')]<-factor(days=='Saturday' | days=='Sunday', labels=c('weekday','weekend'))
head(data_f,5)
```
Render weekend steps
```{r draw_weekend, echo=TRUE,fig.width=10,fig.height=5}
avg_weekend <- data_f[day_type == 'weekend',list(avg=mean(steps,na.rm=T)),by=interval]
ggplot(data=avg_weekend,aes(x=interval,y=avg)) + geom_line() + ggtitle("Average steps in Weekends")
```
Render weekday steps
```{r draw_weekday, echo=TRUE,fig.width=10,fig.height=5}
avg_weekday <- data_f[day_type == 'weekday',list(avg=mean(steps,na.rm=T)),by=interval]
ggplot(data=avg_weekday,aes(x=interval,y=avg)) + geom_line() + ggtitle("Average steps in Weekdays")
```
Patterns until 900 from the begining looks similar but after 1000, weekdays data shows less steps during daytime. This could be because of less activity during work hours.