forked from datalit/rworkshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
practiceCode.R
242 lines (208 loc) · 6.81 KB
/
practiceCode.R
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
##'
##' Outline and concepts for workshop
##' =================================
##'
##' This `<-` command basically takes anything on the right hand side
##' and puts it into the left hand side. Like an equation. This is
##' called variable assignment.
##'
df <- read.csv('train.csv')
##' `head` checks only the first few rows of the dataset.
head(df)
##' `dim` checks the dimensions of the dataset (number of rows and
##' number of columns, in that order).
dim(df)
##' `str` checks the structure of the dataset, showing what the `df`
##' object is, what each item (or column) in the dataset is, such as
##' numeric, factor, etc. This is pretty useful as it can give you a
##' quick overview of what the variables have been classified as and
##' if there are any problems.
str(df)
##' `class` checks only the type of object you are asking, not the
##' contents (unlike `str`).
class(df)
##' `summary` is an extremely useful command to check the basic
##' descriptive statistics (mean, median, range, count for factors).
##' I usually use this anytime I want to quickly look at my dataset,
##' to get a sense of it.
summary(df)
##' If you want to see specific columns or rows, you can use the `[`
##' command. The first number is the row `[row, ]`, and the second
##' number is the column `[, column]`. So together: `[row, column]`.
##'
df[1:2, 1:5]
##' Also, numbers can be put together like so:
1:10
-1:10
1:-10
-10:1
##' In dataframes, a negative number means remove:
df[1:2, ]
df[1:2, -2:-4]
##' You can use strings (real words) to select a specific column.
df[1:2, 'Age']
##' You can use the combine command `c()` to put two strings or
##' numbers together.
df[c(1, 4), c('Age', 'Sex')]
##' You can also subset the data using these commands:
head(df[df$Sex == 'male', ])
head(df[c(df$Sex == 'male', df$Age < 40), ])
##'
##' `dplyr` and `tidyr` approach
##' ----------------------------
##'
##' However, these are a bit complicated, and hard to read! There is
##' a better way. Install and/or load these packages:
##+ dplyrInstall, eval = FALSE
install.packages('dplyr')
install.packages('tidyr')
##+ loadDplyr, eval = TRUE
library(dplyr)
library(tidyr)
##' To do the same as the above, use:
df %>%
filter(Sex == 'male', Age < 40) %>%
## You can keep chaining
tbl_df() %>%
## ... and chaining
select(Sex, Age, Pclass, Parch) %>%
## ... and chaining
summary()
##' The extremely useful `%>%` chain, or pipe command, is just like in
##' the shell/terminal. It takes the output of the previous command
##' and inputs it into the next command. Otherwise, without the `%>%`
##' pipe, it looks like:
summary(select(tbl_df(filter(df, Sex == 'male', Age < 40)), Sex, Age, Pclass, Parch))
##' The pipe does this by basically making the output be named `.`, so
##' really, the pipe is doing this:
df %>% select(., Fare, Sex) %>%
filter(., Sex == 'male') %>%
select(., Fare) %>%
round(., 3) %>%
head(.)
##' `tbl_df` makes the dataframe also a tbl object, so that the outout
##' can be printed easily. The verbs for dplyr are:
##'
##' * select
##' * filter
##' * mutate
##' * summarise
##' * arrange
##' * group_by
##'
##' For more explanation of dplyr, check the documentation:
##' https://github.com/hadley/dplyr or run this command
##' `vignette('introduction', package = 'dplyr')`
##'
##' The `tbl_df()` function makes the printing prettier.
df <- tbl_df(df)
df %>% summary
df %>%
## subset the data by SibSp
filter(SibSp >= 2) %>%
## select only the relevant columns
select(Age, Sex, Survived, Cabin, Fare) %>%
## order the data (in descending) by Age
arrange(Age) %>%
## create a new column
mutate(d.Fare = cut(Fare, 3, labels = c('Low', 'Middle', 'High')))
##' To do even more interesting things, we can combine the dplyr
##' package with the tidyr package. The tidyr has basically two main
##' verbs:
##'
##' * gather
##' * spread
##'
df %>%
filter(SibSp >= 2) %>%
select(Age, Sex, Survived, Cabin, Fare) %>%
## convert the data into a very long format
gather(Measure, Value, -Sex) %>%
## make each summarise command run on the groups Sex and Measure
group_by(Sex, Measure) %>%
## create summary statistics, in this cause the sample in each group
## (Sex and Measure)
summarise(n = n()) %>%
## convert the data into a wide format
spread(Sex, n)
##' Check the content again:
df %>% summary
##' Compare the means of continuous variables of those who survived
##' and those who didn't.
prep.table <- df %>%
select(Survived, Age, Pclass, SibSp, Parch, Fare) %>%
gather(Measure, Value, -Survived) %>%
group_by(Survived, Measure) %>%
## remove missing values
na.omit() %>%
## create a summary statistic (means)
summarise(mean = mean(Value) %>% round(2)) %>%
spread(Survived, mean)
prep.table
##' This can be created into a markdown table, so that it can be
##' easily put into a manuscript or report. A very useful package is
##' called `pander` which allows you to create markdown tables.
##+ panderInstall, eval = FALSE
install.packages('pander')
##+ table, results = 'asis'
library(pander)
prep.table %>% pander(style = 'rmarkdown')
##' There is also join commands from dplyr:
##'
##' * left_join
##' * outer_join
##' * inner_join
##' * anti_join
##'
##' Code used in workshop
##' =====================
##'
##' Exact code used in the workshop
ds <- df
ds %>% select(., Sex, Cabin, Fare) %>%
filter(., Sex == 'female', Fare > 10)
filter(select(ds, Sex, Cabin, Fare),
Sex == 'female')
ds %>%
select(Sex, Cabin, Fare) %>%
mutate(newcol = 10 * Fare) %>%
arrange(Fare)
ds %>%
select(Sex, Age, Fare) %>%
group_by(Sex) %>%
na.omit() %>%
summarise(n = n(),
mean = mean(Age))
##' Remove a column using the `-` sign.
ds %>% select(-Age)
prep.table2 <- ds %>%
select(Sex, Fare, Age, Pclass, SibSp) %>%
mutate(AgeDouble = Age * 2) %>%
gather(Measure, Value, -Sex) %>%
na.omit() %>%
group_by(Sex, Measure) %>%
## create a column with the mean and standard deviation
summarise(meanSD = paste0(mean(Value) %>% round(2),
' (', sd(Value) %>% round(2),
')')) %>%
spread(Sex, meanSD)
prep.table2
##' Create a table, with a caption!
##+ tableTesting, results = 'asis'
pander(prep.table2, style = 'rmarkdown',
caption = 'Testing caption for this table!')
##' To use the `grid.table` function instead of `pander`, load the
##' `gridExtra` package. You may need to install first
##' (`install.packages('gridExtra')`).
library(gridExtra)
grid.table(prep.table2, rows = NULL)
##' A package that does a tutorial *within* R! I've heard good things
##' from it.
##+ swirlInstall, eval = FALSE
install.packages('swirl')
##' To see help files, you can run the `vignette` command.
##+ vignHelp, eval = FALSE
vignette('introduction', package = 'dplyr')
##+ knit, eval = FALSE, echo = FALSE
library(knitr)
spin('practiceCode.R')