forked from hadley/r4ds
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmodel.Rmd
289 lines (219 loc) · 6.8 KB
/
model.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
---
layout: default
title: Model
---
# Model
Models are one of the most important tools for data scientists, because models describe relationships. Would you list out every value of a variable, or would you state the mean? Would you list out every pair of values, or would you state the function between variables?
### Outline
*Section 1* will explain what models are and what they can do for you.
*Section 2* will show you how to use R to build linear models, the most commonly used modeling tool. The section introduces R's model syntax, a general syntax that you can reuse with any of R's modelling functions.
*Section 3* will teach you to build and interpret multivariate linear models, models that use more than one variable to make a prediction.
*Section 4* will explain how to use categorical variables in your models and how to interpret the results.
*Section 5* will present a logical way to extend models to non-linear settings.
### Prerequisites
To access the functions and data sets that we will use in the chapter, load the `ggplot2`, `dplyr`, `mgcv`, `splines`, and `broom` packages:
```{r}
# install.packages("")
library(ggplot2)
library(dplyr)
library(mgcv)
library(splines)
library(broom)
```
**Note: the current examples use a data set that will be replaced in later drafts.**
## What is a model?
1. A model is just a summary, like a mean, median, or variance.
+ Example problem/data set
```{r echo = FALSE}
heights <- read.csv("data/heights.csv")
```
```{r}
head(heights)
```
2. As normally taught, modeling is a conflation of three subjects
+ Models as summaries
+ Hypothesis testing
+ Predictive modeling
3. C. This chapter shows how to build a model and use it as a summary. The methods for building a model apply to all three subjects.
## How to build a model
1. Best fit
+ Best fit of what? A certain class of function.
+ But how do you know which class to use? In some cases, the data can provide suggestions. In other cases existing theory can provide suggestions. But ultimately, you'll never know for sure. But that's okay, good enough is good enough.
2. What does best fit mean?
+ It may or may not accurately describe the true relationship. Heck, there might not even be a true relationship. But it is the best guess given the data.
+ Example problem/data set
+ It does not mean causation exists. Causation is just one type of relations, which is difficult enough to define, let alone prove.
3. How do you find the best fit?
+ With an algorithm. There is an algorithm to fit each specific class of function. We will cover some of the most useful here.
4. How do you know how good the fit is?
+ Adjusted $R^{2}$
5. Are we making assumptions when we fit a model?
+ No. Not unless you assume that you've selected the correct type of function (and I see no reason why you should assume that).
+ Assumptions come when you start hypothesis testing.
## Linear models
1. Linear models fit linear functions
2. How to fit in R
+ model syntax, which is reusable with all model functions
```{r}
earn ~ height
lm(earn ~ height, data = heights)
```
+ save model output
```{r}
hmod <- lm(earn ~ height, data = heights)
coef(hmod)
summary(hmod)
```
+ visualize
```{r}
ggplot(data = heights, mapping = aes(x = height, y = earn)) +
geom_point() +
geom_smooth(method = lm)
```
+ intercept or no intercept
```{r}
0 + earn ~ height
lm(earn ~ 0 + height, data = heights)
lm(earn ~ 0 + height, data = heights)
```
3. How to interpret
+ extract information. Resid. Predict.
```{r eval = FALSE}
resid(hmod)
predict(hmod)
```
+ Interpret coefficient
4. How to use the results (with `broom`)
+ tidy. augment. glance.
```{r eval = FALSE}
tidy(hmod)
augment(hmod)
glance(hmod)
```
```{r}
heights %>%
group_by(sex) %>%
do(glance(lm(earn ~ height, data = .)))
```
## Categorical data
```{r}
smod <- lm(earn ~ sex, data = heights)
smod
```
1. Factors
```{r}
heights$sex <- factor(heights$sex, levels = c("male", "female"))
smod2 <- lm(earn ~ sex, data = heights)
smod
smod2
```
2. How to interpret
```{r}
coef(smod)
```
## Multiple Variables
1. How to fit multivariate models in R
```{r}
mmod <- lm(earn ~ height + sex, data = heights)
mmod
```
2. How to interpret
```{r}
coef(mmod)
```
3. Interaction effects
```{r}
lm(earn ~ height + sex, data = heights)
lm(earn ~ height + sex + height:sex, data = heights)
lm(earn ~ height * sex, data = heights)
```
```{r}
lm(earn ~ height + ed, data = heights)
lm(earn ~ height * ed, data = heights)
```
4. Partition variance
+ Checking residuals
```{r}
m1 <- lm(earn ~ height, data = heights)
# plot histogram of residuals
# plot residulas vs. sex
m2 <- lm(earn ~ height + sex, data = heights)
# plot histogram of residuals
# plot residuals vs. education
m3 <- lm(earn ~ height + sex + ed, data = heights)
# plot histogram of residuals
m4 <- lm(earn ~ height + sex + race + ed + age,
data = heights)
# plot histogram of residuals
m5 <- lm(earn ~ ., data = heights)
```
## Non-linear functions (recipes?)
0. Transformations
+ Log
```{r}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
geom_point()
```
```{r}
lm(log(price) ~ log(carat), data = diamonds)
# visualize model line
```
+ Logit with `glm()`
What if no handy transformation exists?
```{r}
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
geom_point() +
geom_smooth() +
coord_cartesian(ylim = c(0, 50000))
```
1. Polynomials
+ How to fit
```{r}
lm(earn ~ poly(age, 3), data = heights)
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ poly(x, 3))
```
+ How to interpret
+ Strengths and Weaknesses
2. Splines
+ How to fit. Knots. Different types of splines.
```{r eval = FALSE}
bs(degree = 1) # linear splines
bs() # cubic splines
ns() # natural splines
```
```{r}
lm(earn ~ ns(age, knots = c(40, 60)), data = heights)
lm(earn ~ ns(age, df = 4), data = heights)
```
```{r}
lm(earn ~ ns(age, df = 6), data = heights)
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
geom_point() +
geom_smooth(method = lm, formula = y ~ ns(x, df = 6)) +
coord_cartesian(ylim = c(0, 50000))
```
+ How to interpret
+ Strengths and weaknesses
3. General Additive Models
+ How to fit
```{r}
gmod <- gam(earn ~ s(height), data = heights)
ggplot(data = heights, mapping = aes(x= age, y = earn)) +
geom_point() +
geom_smooth(method = gam, formula = y ~ s(x))
```
+ How to interpret
+ Strengths and weaknesses
```{r eval = FALSE}
# Linear z
gam(y ~ s(x) + z, data = df)
# Smooth x and smooth z
gam(y ~ s(x) + s(z), data = df)
# Smooth surface of x and z
# (a smooth function that takes both x and z)
gam(y ~ s(x, z), data = df)
```