PA1_template.Rmd

---
title: "Peer Assignment 1 - Reproducible Research"
author: "Kiichi Takeuchi"
date: "July 16, 2014"
output: html_document
---

##Setup
First, setup your work directory path. For example, here is my path below. If you are using R Studio, you can also set it from Session > Set Working Directory > To Source File Location.

```{r pathsetup,echo=TRUE}
setwd("~/work/r/class/RepData_PeerAssessment1")
```

Below is the list of required R packages for this analysis:
* data.table
* ggplot2
* gridExtra

Download and extract Zip file if it was not extracted yet
```{r unzipping,echo=TRUE}
if (!file.exists("activity.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip","activity.zip",method="curl")
}

if (!file.exists("activity.csv")){
  unzip("activity.zip")
}
```

##Loading Data
Next, load the data file
```{r loading,echo=TRUE,results='hide'}
library(data.table)
data<-fread("activity.csv")
```

##Calculate Steps per Day and Histogram
```{r sum_per_day,echo=TRUE}
library(ggplot2)
sum_by_date <- data[,list(total=sum(steps,na.rm=T)),by=date]
```

Draw histogram of Steps per Day
```{r histgram_sum,echo=TRUE,fig.width=15,fig.height=7}
ggplot(sum_by_date, aes(x=total)) + geom_histogram()
```

##Mean and Median
```{r mean_median,echo=TRUE}
mean_val1<-mean(sum_by_date$total)
mean_val1
median_val1<-median(sum_by_date$total)
median_val1
```

##The average daily activity pattern
Time Series Plot would be generated as below. The 5-minute interval and the average number of steps taken, averaged across all days.

```{r interval, echo=TRUE,,fig.width=10,fig.height=5}
avg_interval <- data[,list(avg=mean(steps,na.rm=T)),by=interval]
head(avg_interval,5)
ggplot(data=avg_interval,aes(x=interval,y=avg)) + geom_line()
```

The max average steps and which interval is it?
```{r max_interval,echo=TRUE}
avg_interval[avg==max(avg_interval$avg),]
```


Total Number of NA (Missing Values)
```{r missing_na, echo=TRUE}
sum(is.na(data$steps))
```

##Imputing missing values
Here is the approach that I took to fill those missing values. I've used average value of each interval in order to replace NA in steps column. Before filling data, I merged original data table with average table by interval as the key. Make sure order by date and interval after the process.
```{r filling_na1, echo=TRUE}
data_avg<-merge(data,avg_interval,by="interval")[order(date,interval)]
head(data_avg,5)
```
After mergining tables, fill steps where it contains NA. I'm casting average steps as integer below. The second line is verification code. The number of NA should be zero.
```{r filling_na2, echo=TRUE}
data_f<-data # copy data first
data_f[is.na(data_f$steps),c('steps')]<-as.integer(data_avg[is.na(data_f$steps),avg])
head(data_f,5)
sum(is.na(data$steps))
```

Here is the same analysis that I did above but I use the new data without missing values: histogram, mean and median.
```{r histgram_sum_2,echo=TRUE,fig.width=15,fig.height=7}
sum_by_date_f <- data_f[,list(total=sum(steps,na.rm=T)),by=date]
ggplot(sum_by_date_f, aes(x=total)) + geom_histogram()
mean_val2<-mean(sum_by_date_f$total)
mean_val2
median_val2<-median(sum_by_date_f$total)
median_val2
```

The histogram has been filled and bars are more smooth than first data set. The mean value slightly increased from `r format(round(mean_val1, 2), nsmall = 2)` to `r format(round(mean_val2, 2), nsmall = 2)`. The median has been changed from `r format(round(median_val1, 2), nsmall = 2)` to `r `r format(round(median_val2, 2), nsmall = 2)`. 


#Weekdays and Weekends
In this section, I will categorize date into two factors : weekday and weekend, and we explore the diffences.

```{r weekdays1, echo=TRUE}
days<-weekdays(as.Date(data_f$date,'%Y-%m-%d'))
data_f[,c('day_in_week')]<-days
data_f[,c('day_type')]<-factor(days=='Saturday' | days=='Sunday', labels=c('weekday','weekend'))
head(data_f,5)
```

Render weekend and weekday steps
```{r draw_weekend, echo=TRUE,fig.width=10,fig.height=10}
avg_weekend <- data_f[day_type == 'weekend',list(avg=mean(steps,na.rm=T)),by=interval]
g1<-ggplot(data=avg_weekend,aes(x=interval,y=avg)) + geom_line() + ggtitle("Average steps in Weekends")

avg_weekday <- data_f[day_type == 'weekday',list(avg=mean(steps,na.rm=T)),by=interval]
g2<-ggplot(data=avg_weekday,aes(x=interval,y=avg)) + geom_line() + ggtitle("Average steps in Weekdays")
library(gridExtra)
grid.arrange( g1, g2, nrow=2)
```

Patterns until 900 from the begining looks similar but after 1000, weekdays data shows less steps during daytime. This could be because of less activity during work hours.


Note:Generate .md document using knitr library
```{r gen,echo=TRUE}
#library(knitr)
#knit2html("PA1_template.Rmd")
```