This is my solution for peer assessment 1.
Load the data in from the file. We'll just take the default column names and data types.
activityData <- read.csv("activity.csv")
For the next few steps we need the data aggregated per day.
dailyData <- aggregate(steps ~ date, activityData, FUN = sum)
hist(dailyData$steps, main = "Frequency of Total Steps per Day", xlab = "Steps")
myMean <- mean(dailyData$steps)
myMean
## [1] 10766
myMedian <- median(dailyData$steps)
myMedian
## [1] 10765
The mean steps per day is 10766 and the median is 10765.
Find the average (mean) number of steps for each time interval across all days, and graph it.
intervalAgg <- aggregate(steps ~ interval, activityData, FUN = mean)
plot(intervalAgg$interval, intervalAgg$steps, type = "l", ylab = "Avg. Steps",
xlab = "Interval")
maxInterval <- intervalAgg[intervalAgg$steps == max(intervalAgg$steps), ]
maxInterval
## interval steps
## 104 835 206.2
The interval with the highest average is 835 with an average number of steps of 206.1698.
Note that the date and interval columns are always populated so we only need to check the steps column for NA values.
navalues <- is.na(activityData$steps)
nacount <- sum(navalues)
nacount
## [1] 2304
There are 2304 records with NA steps.
Approach: For imputing the missing values, it will be reasonably accurate to populate each record with the mean value for that interval. There is probably a way to do this in a single line, replacing the values inline. But this is straightforward enough that I'm not going to try to optimize it more. This isn't a giant dataset where it would be worthwhile to highly optimize.
# Pull out only the rows we want to manipulate
naRows <- activityData[is.na(activityData), ]
# Do a join with the data.frame containing the mean values for each interval
mergedRecords <- merge(naRows, intervalAgg, by = "interval")
# Throw out the original Step value and replace it with the new mean value
# in column 4
newdata <- mergedRecords[, c(4, 3, 1)]
colnames(newdata) <- c("steps", "date", "interval")
# And finally assemble a new data.frame that combines the good data with
# this new modified data.
goodRecords <- na.omit(activityData)
fullData <- rbind(goodRecords, newdata)
Now aggregate it for each day, and create a histogram
fullDailyData <- aggregate(steps ~ date, fullData, FUN = sum)
hist(fullDailyData$steps, main = "Daily Steps", xlab = "Steps")
The this histogram looks quite similar to the histogram above. The only difference is that there are more records in the middle bucket. This makes sense, since we just manufactured a bunch of records based on mean values.
And now calculate the mean/median of the new data.
newMean <- mean(fullDailyData$steps)
newMedian <- median(fullDailyData$steps)
In this modified data, mean steps per day is 10766 and the median is 10766. This compares to values of 10766 and10765 in the above steps. The mean has not changed, but the median has changed slightly.
Conclusion: Setting the NA values to the median has a minor effect on the data when looking at the mean/median values. However, it does increase the shape of the histogram and make the data appear more dense than it would likely be if actual values were available.s
For this part of the assignment we'll continue using the fully populated data with the NA values imputed.
Find the weekday for each date, then figure out if that day is on a weekend or not.
# Add a new column to the data with name of the day of the week
fullData$weekday <- weekdays(as.POSIXlt(fullData$date))
# Add a new column with TRUE/FALSE values, if the day of the week is on a
# weekend then TRUE
t <- as.factor(fullData$weekday %in% c("Saturday", "Sunday"))
# Convert the TRUE/FALSE values to more user friendly values by overriding
# the label names in the factor
levels(t) <- c("weekday", "weekend")
# Add this column to the data.frame. The weekday column will be there too,
# but that isn't a big deal. There isn't any requirement to remove it.
fullData$weekend <- t
# Finally, do an aggregation on the steps for each interval, additionally
# segmented by the new weekend column
weekendAgg <- aggregate(steps ~ interval + weekend, fullData, FUN = mean)
head(weekendAgg)
## interval weekend steps
## 1 0 weekday 2.25115
## 2 5 weekday 0.44528
## 3 10 weekday 0.17317
## 4 15 weekday 0.19790
## 5 20 weekday 0.09895
## 6 25 weekday 1.59036
Now make a plot
library(lattice)
xyplot(steps ~ interval | weekend, weekendAgg, type = "l", layout = c(1, 2))