You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# read in datasetdf.raw<- read.csv("activity.csv")
# do some analysis - to save space on the report, output is omitted# str(df.raw) summary(df.raw) table(df.raw$date)
Mean of the total number of steps per day
# we use the data.table package, it might be overkill for a relatively small# datasaet but the syntax is so much cleaner
require(data.table)
## Loading required package: data.table
dt.raw<- data.table(df.raw)
setkeyv(dt.raw, c("date", "interval"))
# first, find the total steps for each day and save to a data.frame of one# observation per day summing over all 5 minute intervals to get total steps# for each daydfSum<-dt.raw[, sum(steps, na.rm=TRUE), by=date]
setnames(dfSum, c("date", "totalsteps"))
# draw a histogram of the total number of steps each day
hist(dfSum$totalsteps, breaks=20, main="Distribution of Total Steps per Day ",
xlab="Total Steps")
# Now we can find the mean of the total steps for each daymeanTotalSteps<-dfSum[, mean(totalsteps, na.rm=TRUE)]
meanTotalSteps<- as.integer(round(meanTotalSteps, 0))
# and the median of the total steps for each daymedianTotalSteps<-dfSum[, median(totalsteps, na.rm=TRUE)]
medianTotalSteps<- as.integer(round(medianTotalSteps, 0))
# while we are at it, let's exclude those days with zero activity Find# median when we exclude the days of zero activitymeanTotalStepsExcZeros<-dfSum[totalsteps!=0, mean(totalsteps, na.rm=TRUE)]
meanTotalStepsExcZeros<- as.integer(round(meanTotalStepsExcZeros))
# find median when we exclude the days of zero activitymedianTotalStepsExcZeros<-dfSum[totalsteps!=0, median(totalsteps, na.rm=TRUE)]
medianTotalStepsExcZeros<- as.integer(round(medianTotalStepsExcZeros, 0))
The mean total number of steps for all days is: 9354 and the median total number of steps over all days is: 10395. If we were to ignore the days with zero activity, then the mean is 10766 and the median is 10765.
Average Daily Activity Pattern
dfAve<-dt.raw[, mean(steps, na.rm=TRUE), by=interval]
setnames(dfAve, c("interval", "mean"))
with(dfAve, plot(y=mean, x=interval, type="l", xlab=" Time (HHMM - 24 hour clock)",
main="Average steps for each 5-min interval"))
The maximum average steps taken in any 5 minute segment was 206.1698 steps which occurred in the 5 minutes starting at 8:35 o'clock.
Imput missing values
Here we impute any missing values for any 5 minute segment. The strategy will be, for a given 5 minute interval, calculate the mean across all days that have any value, including zeros, for that same 5 minute interval, and replace the NA's with that calculated mean. Repeat for all such intervals.
# First let's find the total number of missing values in the datasetnumberOfNAs<- sum(is.na(df.raw$steps)) # 2304numberOfNAs
## [1] 2304
The total number of missing values are: 2304.
# Second, let's take a look at when the NA's occurbeforeImputing<-dt.raw[, list(sum(is.na(steps)), sum(steps, na.rm=TRUE)),
by=date]
# beforeImputing# We see that the NA's are recorded for every 5 minute segment where this# individual did not record any steps. Also that if the individual did not# record activity for a segment, then that individual had no activity at all# for the entire day# IMPUTATION STRATEGY: fill in any missing values with the mean of that# 5-minute segment# do this in two steps.# First create a new field that replicates the mean values for each interval# R's default behavior is to take a shorter vector, in our case dfAve$mean# and repeat it as often as necessary to fill in the larger targetdfMissingfilled=dt.raw[, `:=`(newstepsfield, dfAve$mean), by=date]
# Second, if there is a NA in the field 'steps', replace it with the value# from this new fielddfMissingfilled[, `:=`(steps, ifelse(is.na(steps), newstepsfield, steps))]
# now get rid of the newstepsfield, since we dont need it anymoredfMissingfilled$newstepsfield<-NULLafterImputing<-dfMissingfilled[, list(sum(is.na(steps)), sum(steps, na.rm=TRUE)),
by=date]
setnames(afterImputing, c("date", "numberofna", "totalsteps"))
# afterImputing# take a quick look to compare the before and after the only daily# observations that should have been effected are the days that had all NA's# and 0's for total number of steps no need to save it, we are just looking# at it
merge(beforeImputing, afterImputing)
# now find the new mean and median after imputing# Find the mean of the total steps for each daymeanTotalStepsImputed<-afterImputing[, mean(totalsteps, na.rm=TRUE)]
meanTotalStepsImputed<- as.integer(round(meanTotalStepsImputed, 0))
# Find the mean of the total steps for each daymedianTotalStepsImputed<-afterImputing[, median(totalsteps, na.rm=TRUE)]
medianTotalStepsImputed<- as.integer(round(medianTotalStepsImputed, 0))
hist(afterImputing$totalsteps, breaks=20, main="Distribution of total daily steps per day (after imputing)",
xlab="Total steps per day")
It appears that the imputation using a mean based on each 5 minute interval had neglible effect on the overall mean for each day when we compare the end results after imputing the means and medians to that where we excluded those days that had no activity to begin with. The original days with activity had a mean: 10766 and a median: 10765. After imputing, the mean is: 10766 and the median is: 10766.
qplot(y=mean, x=interval, data=dfAve2, facets=indicator~., margins=FALSE,
labeller=label_value, main="Mean number of steps by each 5 minute interval",
xlab=" Time (HHMM - 24 hour clock)", geom="line") + theme_bw()