forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
228 lines (207 loc) · 7.89 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
Reproducible Research: Peer Assessment 1
=============================================
```{r}
Sys.time()
```
This is my report for the assessment of "Reproductive Research."
## Loading and preprocessing the data
I loaded the dataset, "Activity monitoring data." I saved the data as "raw.data."
```{r}
raw.data<-read.csv("activity.csv")
head(raw.data)
nrow(raw.data)
```
This raw data has many NA values and I made a data frame that does not have NAs.
```{r}
data.no.NA<-subset(raw.data,is.na(raw.data[,"steps"])==FALSE)
head(data.no.NA)
```
The date on which the mesurement was taken is;
```{r}
date.names<-levels(factor(raw.data$date))
date.names<-paste(substring(date.names,6,7),"/",substring(date.names,9,10),sep="")
head(date.names,3)
tail(date.names,3)
```
5-minute interval in which mesurement was taken is;
```{r}
int.names<-levels(as.factor(raw.data$interval))
head(int.names,3)
tail(int.names,3)
```
In the following, I used "ggplot2" so that I can make elegant graphics.
```{r,warning=FALSE}
library(ggplot2)
```
## What is mean total number of steps taken per day?
I saved the total number of steps taken per day to "tsd," ignoring the missing values.
```{r,results='markup'}
total.step.day<-tapply(data.no.NA$steps,data.no.NA$date,sum)
tsd<-data.frame(date=date.names,tsd=total.step.day)
head(tsd)
```
And I made a histogram with ggplot2.
```{r tsd}
g<-ggplot(tsd,aes(x=tsd))
g<-g+geom_histogram(binwidth=1000)
g+labs(list(title="The total number of steps taken per day",
x="Total number of steps per each day",y="Frequency"))
```
The mean and median of total number of steps taken per day is in this summary.
```{r}
summary(tsd)
```
To see only mean and median, I made objects, meanA and medianA.
```{r}
#only mean
(meanA<-mean(tsd$tsd,na.rm=TRUE))
#only median
(medianA<-median(tsd$tsd,na.rm=TRUE))
```
## What is the average daily activity pattern?
I calculated the average number of steps taken for each intaval, with ignoring missing values. The values are averaged across all days.
```{r}
mean.int<-data.frame(interval=as.numeric(int.names),
mean=as.vector(tapply(data.no.NA$steps,
data.no.NA$interval,mean)))
```
I made a time siries plot of the average number of steps taken vs. the 5-minutes interval.
```{r daily activity pattern}
q<-ggplot(mean.int,aes(x=interval,y=mean))+geom_line()
q+labs(list(title="average daily activity pattern",x="interval",
y="the average number of steps taken for each intaval"))
```
The interval contains the maximum averaged number of steps is;
```{r}
sort.data<-mean.int[order(mean.int$mean,decreasing=TRUE),]
sort.data[1,"interval"]
```
## Imputing missing values
Because NAs sometimes introduce bias into analysis of data, I have to filling the misisng values in the dataset. The number of missing values in the dataset is;
```{r}
nrow(subset(raw.data,is.na(raw.data[,"steps"])==TRUE))
```
For estimating the NA values, I used the arithmetic mean number of means of the day and the interval. For example, I explain about the first row of raw.data:
```{r}
raw.data[1,]
```
The mean number of steps of 10/2 is;
```{r}
mean.day<-data.frame(date=date.names,
mean=as.vector(tapply(data.no.NA$steps,
data.no.NA$date,mean)))
mean.day[is.na(mean.day)]<-0
mean.day[1,]
```
the average number of steps taken for each intaval, with ignoring missing values;
```{r}
mean.int[is.na(mean.int)]<-0
mean.int[1,]
```
So, the NA value in the first row will be raplaced by the estimated number;
```{r}
(mean.day[1,2]+mean.int[1,2])/2
```
In the same way,
```{r}
replaced<-raw.data
for (i in 1:nrow(replaced)){
if(is.na(replaced[i,"steps"])==TRUE){
mean.today<-subset(mean.day,
mean.day[,"date"]==
paste(substring(replaced[i,"date"],6,7),"/",
substring(replaced[i,"date"],9,10),sep=""))
mean.the.int<-subset(mean.int,
mean.int[,"interval"]==replaced[i,"interval"])
replaced[i,"steps"]<-(mean.today[1,2]+mean.the.int[1,2])/2
}
}
head(replaced)
```
I use "replaced" in the following analysis. This dataset is a new dataset that is the same as the original dataset but with the missing cells filled in.
The total number of steps taken per day is;
```{r}
tsd2<-tapply(replaced$steps,replaced$date,sum)
tsd2<-data.frame(date.names,tsd2)
head(tsd2)
```
Here is the histogram of the new dataset;
```{r,tsd2}
p<-ggplot(tsd2,aes(x=tsd2))
p<-p+geom_histogram(binwidth=1000)
p+labs(list(title="The total number of steps taken per day (with missing data filled in)",
x="Total number of steps per each day",y="Frequency"))
```
And I calculated the mean and median;
```{r}
summary(tsd2)
meanB<-mean(tsd2$tsd2,na.rm=TRUE) #only mean
medianB<-median(tsd2$tsd2,na.rm=TRUE) #only median
```
To compare tsd and tsd2, I made multiple histograms in a screen.
```{r,hsit_total, warning=FALSE}
library(gridExtra)
g<-g+labs(list(title="The total number of steps taken per day",
x="Total number of steps per each day",y="Frequency"))
p<-p+labs(list(title="The total number of steps taken per day (with missing data filled in)",
x="Total number of steps per each day",y="Frequency"))
grid.arrange(g,p,nrow=2)
```
And I overayed multiple histograms.
```{r,fig_hist_last}
pg<-ggplot(data=tsd,aes(x=tsd))
pg<-pg+geom_histogram(binwidth=1000)
pg<-pg+geom_histogram(data=tsd2,aes(x=tsd2),binwidth=1000,alpha=0.5)
pg+labs(list(title="The difference of total number of steps taken per day with or without NAs",x="Total number of steps per each day",y="Frequency"))
```
The difference between tsd and tsd2 is in the gray zone.
This zone makes the differences of mean and median number.
```{r}
meanA-meanB
medianA-medianB
```
## Are there differences in activity patterns between weekdays and weekends?
First, I made a factor which has two levels-"weekday" and "weekend."
I set English as the language R prints with, and and added new column to the data frame "replaced."
```{r}
#print "Date" in English
Sys.setlocale("LC_TIME","English")
#to see weekdays
replaced$weekdays<-weekdays(as.Date(raw.data$date),abbreviate=TRUE)
#convert into weekday,weekend
replaced$weekdays[replaced$weekdays=="Sun"|
replaced$weekdays=="Sat"]<-"weekend"
replaced$weekdays[replaced$weekdays!="weekend"]<-"weekday"
```
Next, I calculated the mean numbers of steps taken, averaged across all weekday s or weekend days.
```{r}
end.data<-subset(replaced,replaced$weekdays=="weekend")
day.data<-subset(replaced,replaced$weekdays=="weekday")
end.int<-as.vector(tapply(end.data$steps,factor(end.data$interval),mean))
day.int<-as.vector(tapply(day.data$steps,factor(day.data$interval),mean))
week.int<-data.frame(interval=as.numeric(int.names),weekend=end.int,weekday=day.int)
head(week.int)
```
Last, I made a panel plot conteining a time series plot, using ggplot2 and gridExtra.
```{r,fig_week}
a<-ggplot(week.int,aes(x=interval,y=weekend))+geom_line(col="blue")
b<-ggplot(week.int,aes(x=interval,y=weekday))+geom_line(col="blue")
a<-a+labs(list(title="average daily activity pattern on weekends",
x="interval",y="average number of steps"))
b<-b+labs(list(title="average daily activity pattern on weekdays",
x="interval",y="average number of steps"))
grid.arrange(a,b,nrow=2)
```
## Citation
Data;
https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip
Explanations for this assessment in GitHub;
https://github.com/rdpeng/RepData_PeerAssessment1.git
Tools of analysis;
R Core Team (2013). R: A language and environment for statistical
computing. R Foundation for Statistical Computing, Vienna, Austria. URL
http://www.R-project.org/.
H. Wickham. ggplot2: elegant graphics for data analysis. Springer New
York, 2009. URL http://had.co.nz/ggplot2/book
Baptiste Auguie (2012). gridExtra: functions in Grid graphics. R package
version 0.9.1. http://CRAN.R-project.org/package=gridExtra