forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathPA1_template.Rmd
127 lines (107 loc) · 6.38 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
# Reproducible Research: Peer Assessment 1
## Introduction
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the "quantified self" movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
## Loading and preprocessing the data
First, I'll load the data from a zipped csv file into a dataset called act:
```{r Load data}
suppressMessages(require(ggplot2))
suppressMessages(require(data.table))
suppressMessages(require(gridExtra))
fname <- "./activity.zip"
if(file.exists(fname)) {
fn <- unzip(fname,list=T)[,1]
unzip(fname)
act <- read.csv(fn)
file.remove(fn)
}
```
At this stage, I don't think it necessary to do any processing of the data, and here is a brief overall summary of the activity data I just loaded in:
```{r Summary}
summary(act)
```
## What is mean total number of steps taken per day?
Here is a histogram and a summary of the total numer of steps per day.
```{r Mean total steps per day, Histogram}
actbyday <- tapply(act$steps,act$date,sum)
hist(actbyday,breaks=20)
summary(actbyday)
```
The mean and median total number of steps taken per day are as follows, if not considering missing values:
```{r Mean total steps per day, mean and median}
mean(actbyday,na.rm=T)
median(actbyday,na.rm=T)
```
## What is the average daily activity pattern?
The following figure depicts a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis):
```{r Average daily activity}
actbyint <- tapply(act$steps,act$interval,mean,na.rm = TRUE)
dt<-data.table(interval=as.integer(names(actbyint)),ave_steps=actbyint)
qplot(x=interval,y=ave_steps, data = dt, geom ="line")
```
Note that the following 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps:
```{r Interval with maximum number of steps on average}
dt[ave_steps==max(dt[,ave_steps])]$interval # interval of max ave_steps
```
## Imputing missing values
Now I calculate and report the total number of missing values in the dataset, i.e. total number of rows with NAs. Note that all missing values are in the steps column:
```{r Check missing values}
sum(with(act, is.na(steps)|is.na(date)|is.na(interval))) # total rows with NAs
with(act, c( sum(is.na(steps)), # total rows with NA steps
sum(is.na(date)), # no date is missing
sum(is.na(interval)) # no interval is missing
))
```
Now I fill in all of the missing values in the dataset with the mean for each 5-minute interval. A new dataset (actnew) that is equal to the original dataset but with the missing data filled in is created:
```{r Insert missing values}
actnew <- act
actnew[with(actnew,is.na(steps)),]$steps <- actbyint
```
Make sure the new dataset has the same mean for each 5-minute interval:
```{r}
sum(with(actnew, is.na(steps)|is.na(date)|is.na(interval))) # no missing data now
newactbyint <-tapply(actnew$steps,actnew$interval,mean,na.rm = F)
setequal(actbyint,newactbyint) # and averages are equal to old ones
```
The following is a histogram of the total number of steps taken each day and the mean and median total number of steps taken per day, using the new dataset.
```{r Histogram of total steps per day}
newactbyday <- tapply(actnew$steps,actnew$date,sum)
hist(newactbyday,breaks=20)
mean(newactbyday)
median(newactbyday)
```
Compared with those in the first part of the assignment, with the addition of missing data on the estimates of the total daily number of steps, the median value is slightly shifted up a little bit, but not the mean.
```{r difference in mean and median}
mean(newactbyday) - mean(actbyday,na.rm=T) # shift of mean
median(newactbyday) - median(actbyday,na.rm=T) # shift of median
```
## Are there differences in activity patterns between weekdays and weekends?
Now I create a new factor variable and add it to the dataset with two levels - "weekday" and "weekend" indicating whether a given date is a weekday or weekend day.
```{r Weekdays and weekends}
wkd <- as.factor(ifelse(weekdays(as.POSIXlt(actnew$date))
%in% c('Saturday','Sunday'),
"weekend",
"weekday"))
actnewwkd<- cbind(actnew,wkd)
```
Now I plot the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis), as shown below:
```{r Weekdays/Weelends plot}
dtnew <- with(actnewwkd,
tapply(steps, list("interval"=interval, "day"=wkd), mean))
dtnew <- as.data.table(cbind(interval=as.numeric(rownames(dtnew)),dtnew))
plot1<-qplot(data=dtnew,x=interval,y=weekday,geom="line") +
ggtitle('Weekday') +
xlab('') +
ylab('Number of Steps') +
coord_cartesian(ylim = c(0, 250))
plot2<-qplot(data=dtnew,x=interval,y=weekend,geom="line") +
ggtitle('Weekend') +
xlab('Interval') +
ylab('Number of Steps') +
coord_cartesian(ylim = c(0, 250))
grid.arrange(plot1, plot2, nrow=2,
main = "Number of Steps ~ Interval"
)
```
## Conclusion
The mean total number of steps taken per day of this anonymous person during the period of recording is 10766.The average daily activity mainly concentrated in the range of 5-min intervals between 500-2000, with peak activity at interval 835. There are apparent differences in activity patterns between weekdays and weekends, with increased activity in the interval range of about 1000-1700 on weekends, and decreased activity before 1000 on weekends.