PA1_template.Rmd

# Reproducible Research: Peer Assessment 1


It is now possible to collect a large amount of data about personal
movement using activity monitoring devices such as a
[Fitbit](http://www.fitbit.com), [Nike
Fuelband](http://www.nike.com/us/en_us/c/nikeplus-fuelband), or
[Jawbone Up](https://jawbone.com/up). These type of devices are part of
the "quantified self" movement -- a group of enthusiasts who take
measurements about themselves regularly to improve their health, to
find patterns in their behavior, or because they are tech geeks. But
these data remain under-utilized both because the raw data are hard to
obtain and there is a lack of statistical methods and software for
processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring
device. This device collects data at 5 minute intervals through out the
day. The data consists of two months of data from an anonymous
individual collected during the months of October and November, 2012
and include the number of steps taken in 5 minute intervals each day.

The data for this assignment can be downloaded here: [Activity monitoring data](https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip) [52K]

The variables included in this dataset are:

* **steps**: Number of steps taking in a 5-minute interval (missing
    values are coded as `NA`)

* **date**: The date on which the measurement was taken in YYYY-MM-DD
    format

* **interval**: Identifier for the 5-minute interval in which
    measurement was taken

The dataset is stored in a comma-separated-value (CSV) file and there
are a total of 17,568 observations in this
dataset.

## Loading and preprocessing the data

```{r load_process}
echo = TRUE
options(scipen=999)

data <- read.csv("activity.csv", 
                colClasses = c("integer","Date","integer"),
                na.strings = "NA", 
                quote = "\"")
colnames(data) <- c("Steps","Day","Interval")

dataSumByDay <- aggregate(data$Steps,list(data$Day), function(x) sum(x,na.rm=TRUE))
colnames(dataSumByDay) <- c("Day", "Steps")

dataAvgByInterval <- aggregate(data$Steps,list(data$Interval), function(x) mean(x,na.rm=TRUE))
colnames(dataAvgByInterval) <- c("Interval", "AverageSteps")

```

The original dataset is loaded by the code above. Additionally, 2 summary datasets are derived from the original data: a summary by day and by 5-minute interval. Note, the aggregation functions will remove missing values.  

## What is mean total number of steps taken per day?

```{r means}
hist(dataSumByDay$Steps, 
        breaks = 10,
        xlab = "Steps Per Day",
        main = "Activity")        
meanPerDay <- mean(dataSumByDay$Steps, na.rm = TRUE)
medianPerDay <- median(dataSumByDay$Steps, na.rm = TRUE)

```

The histogram above shows the frequency of total steps per day over the 61 days of opbsrevation.  

The measures of centrality are as follows:  

**Mean** total number of steps taken per day: `r meanPerDay`  
**Median** total number of steps taken per day: `r medianPerDay`

## What is the average daily activity pattern?

```{r avg_daily_pattern}

plot(dataAvgByInterval$Interval,dataAvgByInterval$AverageSteps, 
                main="Average Daily Activity Pattern",
                type="l",
                xlab="Interval",
                ylab="Average Steps")  

maxStepIntervals <- dataAvgByInterval[which(dataAvgByInterval$AverageSteps == max(dataAvgByInterval$AverageSteps, na.rm = TRUE)),]


```

On average across all the days in the dataset, the 5-minute interval that contains the maximum number of steps was found to be **`r maxStepIntervals[1,1]`** with a average value of **`r maxStepIntervals[1,2]`** steps taken.  

## Imputing missing values

```{r imputation}

missingData <- sum(is.na(data$Steps))

toImpute <- data[is.na(data$Steps),]$Interval
imputedVals <- unlist(lapply(toImpute, FUN = function(i){
                        dataAvgByInterval[dataAvgByInterval$Interval==i,]$AverageSteps
             }))

dataWImputedValues <- data
dataWImputedValues[is.na(dataWImputedValues$Steps),]$Steps <- imputedVals

dataSumByDayWImputedValues <- aggregate(dataWImputedValues$Steps,list(dataWImputedValues$Day), function(x) sum(x,na.rm=TRUE))
colnames(dataSumByDayWImputedValues) <- c("Day", "Steps")

hist(dataSumByDayWImputedValues$Steps,
        breaks = 10,
        xlab = "Steps Per Day",
        main = "Activity (missing values imputed to mean by interval)")
        
meanPerDayImputed <- mean(dataSumByDayWImputedValues$Steps, na.rm = TRUE)
medianPerDayImputed <- median(dataSumByDayWImputedValues$Steps, na.rm = TRUE)


```

There were **`r missingData`** observations with missing data.  

These values were replaced by the code above using a strategy of replacement by average for the same interval over the entire dataset less the missing observations. 

The new values of centrality with imputed data are as follows:  
**Mean** total number of steps taken per day: `r meanPerDayImputed`  
**Median** total number of steps taken per day: `r medianPerDayImputed`  

The most significant impact of the imputation strategy employed is that the median and mean values are now the same at ***`r medianPerDayImputed`*** steps taken per day.

## Are there differences in activity patterns between weekdays and weekends?


```{r compute_weekday_weekend}

dataWImputedValues$Weekdays <- ifelse(weekdays(dataWImputedValues$Day)=="Saturday"|weekdays(dataWImputedValues$Day)=="Sunday","Weekend","Weekday")

dataAvgByWeekdayIntervalWImputedValues <- aggregate(dataWImputedValues$Steps,list(dataWImputedValues$Weekdays, dataWImputedValues$Interval), function(x) mean(x,na.rm=TRUE))
colnames(dataAvgByWeekdayIntervalWImputedValues) <- c("Weekday","Interval", "AverageSteps")

library(ggplot2)

ggplot(data = dataAvgByWeekdayIntervalWImputedValues, aes(x = Interval, y = AverageSteps)) +
        geom_line(colour = "red", size = .8) +
        facet_wrap(~Weekday, nrow = 2, ncol = 1) +
        theme_grey()

```

As demonstrated above, activity patterns vary between weekdays and weekend days. This suggests that the imputation strategy could be refined by using an average value for the interval for the subset of data for weekdays or weekend days accordingly.