Reproducible Research: Peer Assessment 1

Sjors
2016-02-20

Load libraries

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(ggplot2)
library(lubridate)

Load the data

if (!file.exists("activity.csv")) {
  unzip("activity.zip")
}
activity <- tbl_df(read_csv("activity.csv"))

Steps per day

Here we are going to investigate the steps per day. Therefore the total number of steps taken per day is calculated. We'll ignore the missing values in the data set.

Process the data to be grouped by day and determine the total number of steps per day:

activity_by_day <- activity %>%
  group_by(date) %>%
  summarise(steps = sum(steps)) %>%
  filter(!is.na(steps))

The histogram of the total number of steps taken each day:

qplot(steps, data = activity_by_day, binwidth = 800)

Where the mean of the distribution is:

mean(activity_by_day$steps)

## [1] 10766.19

And the median of the distribution:

median(activity_by_day$steps)

## [1] 10765

The daily activity?

What is the average daily activity pattern?

Here we are going to investigate the average daily activity. Therefore the average number of steps per 5-min time interval is calculated. We'll ignore the missing values in the data set.

Process the data to be grouped by time interval and determine the average number of steps per time interval:

activity_by_interval <- activity %>%
  filter(!is.na(steps)) %>%
  group_by(interval) %>%
  summarise(steps = mean(steps))

The time series looks like:

qplot(interval, steps, data = activity_by_interval,
      ylab = "Number of steps",
      geom = c("line"))

Where the most active time interval is at minute:

row <- which.max(activity_by_interval$steps)
activity_by_interval[row,]$interval

## [1] 835

Imputing missing values

There are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.

The numer of rows with missing values are:

nrow(activity[is.na(activity$steps),])

## [1] 2304

We strategy to fill in the missing values is to use the avearge value for that 5-minute interval. The following imputes the missing value in a new data frame:

mean_steps <- function(x) {
  activity_by_interval[activity_by_interval$interval == x,]$steps
}

activity_mod <- tbl_df(data.frame(activity))
activity_mod[is.na(activity$steps),]$steps <- 
  sapply(activity[is.na(activity$steps),]$interval, mean_steps)

activity_by_day <- activity_mod %>%
  group_by(date) %>%
  summarise(steps = sum(steps))

The histogram of the total number of steps taken each day now looks like:

qplot(steps, data = activity_by_day, binwidth = 800)

It very simular in shape as the one without the filled missing values. The differenc is clear in the hight of the histogram. With the missing values filled in more days are shown in the histogram, and the histogram is higher.

Where the mean of the distribution is now:

mean(activity_by_day$steps)

## [1] 10766.19

And the median is:

median(activity_by_day$steps)

## [1] 10766.19

comparing weekdays and weekends

The following creates a new data set based on the one with the filled in values to group the time intervals for weekdays and weekends in order to investigate if there is a difference.

days = c("weekend", rep("weekday", 5), "weekend")

activity_mod <- activity_mod %>%
  mutate(day = factor(days[wday(ymd(date, tz = "Europe/Amsterdam"))], 
                      levels = c("weekend", "weekday")))

activity_by_interval <- activity_mod %>%
  group_by(day, interval) %>%
  summarise(steps = mean(steps)) %>%
  arrange(desc(day))

The time series split for weekdays and weekends looks like:

qplot(interval, steps, data = activity_by_interval, 
      ylab = "Number of steps",
      geom = c("line"), facets = day ~ .)

Especially in the morning a clear difference is observed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PA1_template.md

PA1_template.md

Reproducible Research: Peer Assessment 1

Load libraries

Load the data

Steps per day

The daily activity?

Imputing missing values

comparing weekdays and weekends

Files

PA1_template.md

Latest commit

History

PA1_template.md

File metadata and controls

Reproducible Research: Peer Assessment 1

Load libraries

Load the data

Steps per day

The daily activity?

Imputing missing values

comparing weekdays and weekends