forked from mca91/EconometricsWithR
-
Notifications
You must be signed in to change notification settings - Fork 0
/
11-ch11.Rmd
1060 lines (728 loc) · 55 KB
/
11-ch11.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Regression with a Binary Dependent Variable {#rwabdv}
This chapter, we discusses a special class of regression models that aim to explain a limited dependent variable. In particular, we consider models where the dependent variable is binary. We will see that in such models, the regression function can be interpreted as a conditional probability function of the binary dependent variable.
We review the following concepts:
- the linear probability model
- the Probit model
- the Logit model
- maximum likelihood estimation of nonlinear regression models
Of course, we will also see how to estimate above models using `r ttcode("R")` and discuss an application where we examine the question whether there is racial discrimination in the U.S. mortgage market.
The following packages and their dependencies are needed for reproduction of the code chunks presented throughout this chapter on your computer:
+ `r ttcode("AER")` [@R-AER]
+ `r ttcode("stargazer")` [@R-stargazer]
Check whether the following code chunk runs without any errors.
```{r, warning=FALSE, message=FALSE}
library(AER)
library(stargazer)
```
## Binary Dependent Variables and the Linear Probability Model
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC11.1">
<h3 class = "right"> Key Concept 11.1 </h3>
<h3 class = "left"> The Linear Probability Model </h3>
The linear regression model
$$Y_i = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki} + u_i$$
with a binary dependent variable $Y_i$ is called the linear probability model. In the linear probability model we have $$E(Y\\vert X_1,X_2,\\dots,X_k) = P(Y=1\\vert X_1, X_2,\\dots, X_3)$$ where $$ P(Y = 1 \\vert X_1, X_2, \\dots, X_k) = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki}.$$
Thus, $\\beta_j$ can be interpreted as the change in the probability that $Y_i=1$, holding constant the other $k-1$ regressors. Just as in common multiple regression, the $\\beta_j$ can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals.
In most linear probability models, $R^2$ has no meaningful interpretation since the regression line can never fit the data perfectly if the dependent variable is binary and the regressors are continuous. This can be seen in the application below.
It is *essential* to use robust standard errors since the $u_i$ in a linear probability model are always heteroskedastic.
Linear probability models are easily estimated in <tt>R</tt> using the function <tt>lm()</tt>.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Linear Probability Model]{11.1}
The linear regression model
$$Y_i = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki} + u_i$$
with a binary dependent variable $Y_i$ is called the linear probability model. In the linear probability model we have $$E(Y\\vert X_1,X_2,\\dots,X_k) = P(Y=1\\vert X_1, X_2,\\dots, X_3)$$ where $$ P(Y = 1 \\vert X_1, X_2, \\dots, X_k) = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki}.$$
Thus, $\\beta_j$ can be interpreted as the change in the probability that $Y_i=1$, holding constant the other $k-1$ regressors. Just as in common multiple regression, the $\\beta_j$ can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals.\\newline
In most linear probability models, $R^2$ has no meaningful interpretation since the regression line can never fit the data perfectly if the dependent variable is binary and the regressors are continuous. This can be seen in the application below.\\newline
It is \\textit{essential} to use robust standard errors since the $u_i$ in a linear probability model are always heteroskedastic.\\newline
Linear probability models are easily estimated in \\texttt{R} using the function \\texttt{lm()}.
\\end{keyconcepts}
')
```
#### Mortgage Data {-}
Following the book, we start by loading the data set `r ttcode("HMDA")` which provides data that relate to mortgage applications filed in Boston in the year of 1990.
```{r, warning=FALSE, message=FALSE}
# load `AER` package and attach the HMDA data
library(AER)
data(HMDA)
```
We continue by inspecting the first few observations and compute summary statistics afterwards.
```{r}
# inspect the data
head(HMDA)
summary(HMDA)
```
The variable we are interested in modelling is `r ttcode("deny")`, an indicator for whether an applicant's mortgage application has been accepted (`r ttcode("deny = no")`) or denied (`r ttcode("deny = yes")`). A regressor that ought to have power in explaining whether a mortgage application has been denied is `r ttcode("pirat")`, the size of the anticipated total monthly loan payments relative to the the applicant's income. It is straightforward to translate this into the simple regression model
\begin{align}
deny = \beta_0 + \beta_1 \times P/I\ ratio + u. (\#eq:denymod1)
\end{align}
We estimate this model just as any other linear regression model using `r ttcode("lm()")`. Before we do so, the variable `r ttcode("deny")` must be converted to a numeric variable using `r ttcode("as.numeric()")` as `r ttcode("lm()")` does not accepts the *dependent variable* to be of class `r ttcode("factor")`. Note that `as.numeric(HMDA$deny)` will turn `r ttcode("deny = no")` into `r ttcode("deny = 1")` and `r ttcode("deny = yes")` into `r ttcode("deny = 2")`, so using `r ttcode("as.numeric(HMDA$deny)-1")` we obtain the values `r ttcode("0")` and `r ttcode("1")`.
```{r}
# convert 'deny' to numeric
HMDA$deny <- as.numeric(HMDA$deny) - 1
# estimate a simple linear probabilty model
denymod1 <- lm(deny ~ pirat, data = HMDA)
denymod1
```
Next, we plot the data and the regression line to reproduce Figure 11.1 of the book.
```{r}
# plot the data
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Scatterplot Mortgage Application Denial and the Payment-to-Income Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.8)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add the estimated regression line
abline(denymod1,
lwd = 1.8,
col = "steelblue")
```
According to the estimated model, a payment-to-income ratio of $1$ is associated with an expected probability of mortgage application denial of roughly $50\%$. The model indicates that there is a positive relation between the payment-to-income ratio and the probability of a denied mortgage application so individuals with a high ratio of loan payments to income are more likely to be rejected.
We may use `r ttcode("coeftest()")` to obtain robust standard errors for both coefficient estimates.
```{r}
# print robust coefficient summary
coeftest(denymod1, vcov. = vcovHC, type = "HC1")
```
The estimated regression line is
\begin{align}
\widehat{deny} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} P/I \ ratio. (\#eq:lpm)
\end{align}
The true coefficient on $P/I \ ratio$ is statistically different from $0$ at the $1\%$ level. Its estimate can be interpreted as follows: a 1 percentage point increase in $P/I \ ratio$ leads to an increase in the probability of a loan denial by $0.604 \cdot 0.01 = 0.00604 \approx 0.6\%$.
Following the book we augment the simple model \@ref(eq:denymod1) by an additional regressor $black$ which equals $1$ if the applicant is an African American and equals $0$ otherwise. Such a specification is the baseline for investigating if there is racial discrimination in the mortgage market: if being black has a significant (positive) influence on the probability of a loan denial when we control for factors that allow for an objective assessment of an applicants credit worthiness, this is an indicator for discrimination.
```{r}
# rename the variable 'afam' for consistency
colnames(HMDA)[colnames(HMDA) == "afam"] <- "black"
# estimate the model
denymod2 <- lm(deny ~ pirat + black, data = HMDA)
coeftest(denymod2, vcov. = vcovHC)
```
The estimated regression function is
\begin{align}
\widehat{deny} =& \, -\underset{(0.029)}{0.091} + \underset{(0.089)}{0.559} P/I \ ratio + \underset{(0.025)}{0.177} black. (\#eq:denymod2)
\end{align}
The coefficient on $black$ is positive and significantly different from zero at the $0.01\%$ level. The interpretation is that, holding constant the $P/I \ ratio$, being black increases the probability of a mortgage application denial by about $17.7\%$. This finding is compatible with racial discrimination. However, it might be distorted by omitted variable bias so discrimination could be a premature conclusion.
## Probit and Logit Regression {#palr}
The linear probability model has a major flaw: it assumes the conditional probability function to be linear. This does not restrict $P(Y=1\vert X_1,\dots,X_k)$ to lie between $0$ and $1$. We can easily see this in our reproduction of Figure 11.1 of the book: for $P/I \ ratio \geq 1.75$, \@ref(eq:lpm) predicts the probability of a mortgage application denial to be bigger than $1$. For applications with $P/I \ ratio$ close to $0$, the predicted probability of denial is even negative so that the model has no meaningful interpretation here.
This circumstance calls for an approach that uses a nonlinear function to model the conditional probability function of a binary dependent variable. Commonly used methods are Probit and Logit regression.
### Probit Regression {-}
In Probit regression, the cumulative standard normal distribution function $\Phi(\cdot)$ is used to model the regression function when the dependent variable is binary, that is, we assume
\begin{align}
E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 X). (\#eq:probitmodel)
\end{align}
$\beta_0 + \beta_1 X$ in \@ref(eq:probitmodel) plays the role of a quantile $z$. Remember that $$\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)$$ such that the Probit coefficient $\beta_1$ in \@ref(eq:probitmodel) is the change in $z$ associated with a one unit change in $X$. Although the effect on $z$ of a change in $X$ is linear, the link between $z$ and the dependent variable $Y$ is nonlinear since $\Phi$ is a nonlinear function of $X$.
Since the dependent variable is a nonlinear function of the regressors, the coefficient on $X$ has no simple interpretation. According to Key Concept 8.1, the expected change in the probability that $Y=1$ due to a change in $P/I \ ratio$ can be computed as follows:
1. Compute the predicted probability that $Y=1$ for the original value of $X$.
2. Compute the predicted probability that $Y=1$ for $X + \Delta X$.
3. Compute the difference between both predicted probabilities.
Of course we can generalize \@ref(eq:probitmodel) to Probit regression with multiple regressors to mitigate the risk of facing omitted variable bias. Probit regression essentials are summarized in Key Concept 11.2.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC11.2">
<h3 class = "right"> Key Concept 11.2 </h3>
<h3 class = "left"> Probit Model, Predicted Probabilities and Estimated Effects</h3>
Assume that $Y$ is a binary variable. The model
$$ Y= \\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k + u $$
with
$$P(Y = 1 \\vert X_1, X_2, \\dots ,X_k) = \\Phi(\\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)$$
is the population Probit model with multiple regressors $X_1, X_2, \\dots, X_k$ and $\\Phi(\\cdot)$ is the cumulative standard normal distribution function.
The predicted probability that $Y=1$ given $X_1, X_2, \\dots, X_k$ can be calculated in two steps:
1. Compute $z = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k$
2. Look up $\\Phi(z)$ by calling <tt>pnorm()</tt>.
$\\beta_j$ is the effect on $z$ of a one unit change in regressor $X_j$, holding constant all other $k-1$ regressors.
The effect on the predicted probability of a change in a regressor can be computed as in Key Concept 8.1.
In <tt>R</tt>, Probit models can be estimated using the function <tt>glm()</tt> from the package <tt>stats</tt>. Using the argument <tt>family</tt> we specify that we want to use a Probit link function.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Probit Model, Predicted Probabilities and Estimated Effects]{11.2}
Assume that $Y$ is a binary variable. The model
$$ Y= \\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k + u $$
with
$$P(Y = 1 \\vert X_1, X_2, \\dots ,X_k) = \\Phi(\\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)$$
is the population Probit model with multiple regressors $X_1, X_2, \\dots, X_k$ and $\\Phi(\\cdot)$ is the cumulative standard normal distribution function.\\newline
The predicted probability that $Y=1$ given $X_1, X_2, \\dots, X_k$ can be calculated in two steps:\\newline
\\begin{enumerate}
\\item Compute $z = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k$
\\item Look up $\\Phi(z)$ by calling \\texttt{pnorm()}.
\\end{enumerate}\\vspace{0.5cm}
$\\beta_j$ is the effect on $z$ of a one unit change in regressor $X_j$, holding constant all other $k-1$ regressors.\\newline
The effect on the predicted probability of a change in a regressor can be computed as in Key Concept 8.1.\\newline
In \\texttt{R}, Probit models can be estimated using the function \\texttt{glm()} from the package \\texttt{stats}. Using the argument \\texttt{family} we specify that we want to use a Probit link function.
\\end{keyconcepts}
')
```
We now estimate a simple Probit model of the probability of a mortgage denial.
```{r}
# estimate the simple probit model
denyprobit <- glm(deny ~ pirat,
family = binomial(link = "probit"),
data = HMDA)
coeftest(denyprobit, vcov. = vcovHC, type = "HC1")
```
The estimated model is
\begin{align}
\widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} P/I \ ratio). (\#eq:denyprobit)
\end{align}
Just as in the linear probability model we find that the relation between the probability of denial and the payments-to-income ratio is positive and that the corresponding coefficient is highly significant.
The following code chunk reproduces Figure 11.2 of the book.
<div class="unfolded">
```{r, fig.align='center'}
# plot data
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Probit Model of the Probability of Denial, Given P/I Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.85)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(denyprobit, list(pirat = x), type = "response")
lines(x, y, lwd = 1.5, col = "steelblue")
```
</div>
The estimated regression function has a stretched "S"-shape which is typical for the CDF of a continuous random variable with symmetric PDF like that of a normal random variable. The function is clearly nonlinear and flattens out for large and small values of $P/I \ ratio$. The functional form thus also ensures that the predicted conditional probabilities of a denial lie between $0$ and $1$.
We use `r ttcode("predict()")` to compute the predicted change in the denial probability when $P/I \ ratio$ is increased from $0.3$ to $0.4$.
```{r}
# 1. compute predictions for P/I ratio = 0.3, 0.4
predictions <- predict(denyprobit,
newdata = data.frame("pirat" = c(0.3, 0.4)),
type = "response")
# 2. Compute difference in probabilities
diff(predictions)
```
We find that an increase in the payment-to-income ratio from $0.3$ to $0.4$ is predicted to increase the probability of denial by approximately $6.2\%$.
We continue by using an augmented Probit model to estimate the effect of race on the probability of a mortgage application denial.
```{r}
denyprobit2 <- glm(deny ~ pirat + black,
family = binomial(link = "probit"),
data = HMDA)
coeftest(denyprobit2, vcov. = vcovHC, type = "HC1")
```
The estimated model equation is
\begin{align}
\widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} P/I \ ratio + \underset{(0.08)}{0.71} black). (\#eq:denyprobit2)
\end{align}
While all coefficients are highly significant, both the estimated coefficients on the payments-to-income ratio and the indicator for African American descent are positive. Again, the coefficients are difficult to interpret but they indicate that, first, African Americans have a higher probability of denial than white applicants, holding constant the payments-to-income ratio and second, applicants with a high payments-to-income ratio face a higher risk of being rejected.
How big is the estimated difference in denial probabilities between two hypothetical applicants with the same payments-to-income ratio? As before, we may use `r ttcode("predict()")` to compute this difference.
```{r}
# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denyprobit2,
newdata = data.frame("black" = c("no", "yes"),
"pirat" = c(0.3, 0.3)),
type = "response")
# 2. compute difference in probabilities
diff(predictions)
```
In this case, the estimated difference in denial probabilities is about $15.8\%$.
### Logit Regression {-}
Key Concept 11.3 summarizes the Logit regression function.
```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC11.3">
<h3 class = "right"> Key Concept 11.3 </h3>
<h3 class = "left"> Logit Regression </h3>
The population Logit regression function is
\\begin{align*}
P(Y=1\\vert X_1, X_2, \\dots, X_k) =& \\, F(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k) \\\\
=& \\, \\frac{1}{1+e^{-(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)}}.
\\end{align*}
The idea is similar to Probit regression except that a different CDF is used: $$F(x) = \\frac{1}{1+e^{-x}}$$ is the CDF of a standard logistically distributed random variable.
</div>
')
```
```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Logit Regression]{11.3}
The population Logit regression function is
\\begin{align*}
P(Y=1\\vert X_1, X_2, \\dots, X_k) =& \\, F(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k) \\\\
=& \\, \\frac{1}{1+e^{-(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)}}.
\\end{align*}
The idea is similar to Probit regression except that a different CDF is used: $$F(x) = \\frac{1}{1+e^{-x}}$$ is the CDF of a standard logistically distributed random variable.
\\end{keyconcepts}
')
```
As for Probit regression, there is no simple interpretation of the model coefficients and it is best to consider predicted probabilities or differences in predicted probabilities. Here again, $t$-statistics and confidence intervals based on large sample normal approximations can be computed as usual.
It is fairly easy to estimate a Logit regression model using `r ttcode("R")`.
```{r}
denylogit <- glm(deny ~ pirat,
family = binomial(link = "logit"),
data = HMDA)
coeftest(denylogit, vcov. = vcovHC, type = "HC1")
```
The subsequent code chunk reproduces Figure 11.3 of the book.
<div class="unfolded">
```{r, fig.align='center'}
# plot data
plot(x = HMDA$pirat,
y = HMDA$deny,
main = "Probit and Logit Models Model of the Probability of Denial, Given P/I Ratio",
xlab = "P/I ratio",
ylab = "Deny",
pch = 20,
ylim = c(-0.4, 1.4),
cex.main = 0.9)
# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")
# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")
lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)
# add a legend
legend("topleft",
horiz = TRUE,
legend = c("Probit", "Logit"),
col = c("steelblue", "black"),
lty = c(1, 2))
```
</div>
Both models produce very similar estimates of the probability that a mortgage application will be denied depending on the applicants payment-to-income ratio.
Following the book we extend the simple Logit model of mortgage denial with the additional regressor $black$.
```{r}
# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black,
family = binomial(link = "logit"),
data = HMDA)
coeftest(denylogit2, vcov. = vcovHC, type = "HC1")
```
We obtain
\begin{align}
\widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} P/I \ ratio + \underset{(0.15)}{1.27} black). (\#eq:denylogit2)
\end{align}
As for the Probit model \@ref(eq:denyprobit2) all model coefficients are highly significant and we obtain positive estimates for the coefficients on $P/I \ ratio$ and $black$. For comparison we compute the predicted probability of denial for two hypothetical applicants that differ in race and have a $P/I \ ratio$ of $0.3$.
```{r}
# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2,
newdata = data.frame("black" = c("no", "yes"),
"pirat" = c(0.3, 0.3)),
type = "response")
predictions
# 2. Compute difference in probabilities
diff(predictions)
```
We find that the white applicant faces a denial probability of only $7.5\%$, while the African American is rejected with a probability of $22.4\%$, a difference of $14.9\%$.
#### Comparison of the Models {-}
The Probit model and the Logit model deliver only approximations to the unknown population regression function $E(Y\vert X)$. It is not obvious how to decide which model to use in practice. The linear probability model has the clear drawback of not being able to capture the nonlinear nature of the population regression function and it may predict probabilities to lie outside the interval $[0,1]$. Probit and Logit models are harder to interpret but capture the nonlinearities better than the linear approach: both models produce predictions of probabilities that lie inside the interval $[0,1]$. Predictions of all three models are often close to each other. The book suggests to use the method that is easiest to use in the statistical software of choice. As we have seen, it is equally easy to estimate Probit and Logit model using `r ttcode("R")`. We can therefore give no general recommendation which method to use.
## Estimation and Inference in the Logit and Probit Models
So far nothing has been said about *how* Logit and Probit models are estimated by statistical software. The reason why this is interesting is that both models are *nonlinear in the parameters* and thus cannot be estimated using OLS. Instead one relies on *maximum likelihood estimation* (MLE). Another approach is estimation by *nonlinear least squares* (NLS).
#### Nonlinear Least Squares {-}
Consider the multiple regression Probit model
\begin{align}
E(Y_i\vert X_{1i}, \dots, X_{ki}) = P(Y_i=1\vert X_{1i}, \dots, X_{ki}) = \Phi(\beta_0 + \beta_1 X_{1i} + \dots + \beta_k X_{ki}). (\#eq:multprobit)
\end{align}
Similarly to OLS, NLS estimates the parameters $\beta_0,\beta_1,\dots,\beta_k$ by minimizing the sum of squared mistakes
$$\sum_{i=1}^n\left[ Y_i - \Phi(b_0 + b_1 X_{1i} + \dots + b_k X_{ki}) \right]^2.$$
NLS estimation is a consistent approach that produces estimates which are normally distributed in large samples. In `r ttcode("R")` there are functions like `r ttcode("nls()")` from package `r ttcode("stats")` which provide algorithms for solving nonlinear least squares problems.
However, NLS is inefficient, meaning that there are estimation techniques that have a smaller variance which is why we will not dwell any further on this topic.
#### Maximum Likelihood Estimation {-}
In MLE we seek to estimate the unknown parameters choosing them such that the likelihood of drawing the sample observed is maximized. This probability is measured by means of the likelihood function, the joint probability distribution of the data treated as a function of the unknown parameters. Put differently, the maximum likelihood estimates of the unknown parameters are the values that result in a model which is most likely to produce the data observed. It turns out that MLE is more efficient than NLS.
As maximum likelihood estimates are normally distributed in large samples, statistical inference for coefficients in nonlinear models like Logit and Probit regression can be made using the same tools that are used for linear regression models: we can compute $t$-statistics and confidence intervals.
Many software packages use an MLE algorithm for estimation of nonlinear models. The function `r ttcode("glm()")` uses an algorithm named *iteratively reweighted least squares*.
#### Measures of Fit {-}
It is important to be aware that the usual $R^2$ and $\bar{R}^2$ are *invalid* for nonlinear regression models. The reason for this is simple: both measures assume that the relation between the dependent and the explanatory variable(s) is linear. This obviously does not hold for Probit and Logit models. Thus $R^2$ need not lie between $0$ and $1$ and there is no meaningful interpretation. However, statistical software sometimes reports these measures anyway.
There are many measures of fit for nonlinear regression models and there is no consensus which one should be reported. The situation is even more complicated because there is no measure of fit that is generally meaningful. For models with a binary response variable like $deny$ one could use the following rule: \newline If $Y_i = 1$ and $\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} > 0.5$ or if $Y_i = 0$ and $\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} < 0.5$, consider the $Y_i$ as correctly predicted. Otherwise $Y_i$ is said to be incorrectly predicted. The measure of fit is the share of correctly predicted observations. The downside of such an approach is that it does not mirror the quality of the prediction: whether $\widehat{P(Y_i = 1|X_{i1}, \dots, X_{ik}) = 0.51}$ or $\widehat{P(Y_i =1|X_{i1}, \dots, X_{ik}) = 0.99}$ is not reflected, we just predict $Y_i = 1$.^[This is in contrast to the case of a numeric dependent variable where we use the squared errors for assessment of the quality of the prediction.]
An alternative to the latter are so called pseudo-$R^2$ measures. In order to measure the quality of the fit, these measures compare the value of the maximized (log-)likelihood of the model with all regressors (the *full model*) to the likelihood of a model with no regressors (*null model*, regression on a constant).
For example, consider a Probit regression. The $\text{pseudo-}R^2$ is given by $$\text{pseudo-}R^2 = 1 - \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}$$ where $f^{max}_j \in [0,1]$ denotes the maximized likelihood for model $j$.
The reasoning behind this is that the maximized likelihood increases as additional regressors are added to the model, similarly to the decrease in $SSR$ when regressors are added in a linear regression model. If the full model has a similar maximized likelihood as the null model, the full model does not really improve upon a model that uses only the information in the dependent variable, so $\text{pseudo-}R^2 \approx 0$. If the full model fits the data very well, the maximized likelihood should be close to $1$ such that $\ln(f^{max}_{full}) \approx 0$ and $\text{pseudo-}R^2 \approx 1$. See Appendix 11.2 of the book for more on MLE and pseudo-$R^2$ measures.
`r ttcode("summary()")` does not report $\text{pseudo-}R^2$ for models estimated by `r ttcode("glm()")` but we can use the entries *residual deviance* (`r ttcode("deviance")`) and *null deviance* (`r ttcode("null.deviance")`) instead. These are computed as
$$\text{deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{full}) \right]$$
and
$$\text{null deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{null}) \right]$$
where $f^{max}_{saturated}$ is the maximized likelihood for a model which assumes that each observation has its own parameter (there are $n+1$ parameters to be estimated which leads to a perfect fit). For models with a binary dependent variable, it holds that $$\text{pseudo-}R^2 = 1 - \frac{\text{deviance}}{\text{null deviance}} = 1- \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}.$$
We now compute the $\text{pseudo-}R^2$ for the augmented Probit model of mortgage denial.
```{r}
# compute pseudo-R2 for the probit model of mortgage denial
pseudoR2 <- 1 - (denyprobit2$deviance) / (denyprobit2$null.deviance)
pseudoR2
```
Another way to obtain the $\text{pseudo-}R^2$ is to estimate the null model using `r ttcode("glm()")` and extract the maximized log-likelihoods for both the null and the full model using the function `r ttcode("logLik()")`.
```{r}
# compute the null model
denyprobit_null <- glm(formula = deny ~ 1,
family = binomial(link = "probit"),
data = HMDA)
# compute the pseudo-R2 using 'logLik'
1 - logLik(denyprobit2)[1]/logLik(denyprobit_null)[1]
```
## Application to the Boston HMDA Data
Models \@ref(eq:denyprobit2) and \@ref(eq:denylogit2) indicate that denial rates are higher for African American applicants holding constant the payment-to-income ratio. Both results could be subject to omitted variable bias. In order to obtain a more trustworthy estimate of the effect of being black on the probability of a mortgage application denial we estimate a linear probability model as well as several Logit and Probit models. We thereby control for financial variables and additional applicant characteristics which are likely to influence the probability of denial and differ between black and white applicants.
Sample averages as shown in Table 11.1 of the book can be easily reproduced using the functions `r ttcode("mean()")` (as usual for numeric variables) and `r ttcode("prop.table()")` (for factor variables).
```{r}
# Mean P/I ratio
mean(HMDA$pirat)
# inhouse expense-to-total-income ratio
mean(HMDA$hirat)
# loan-to-value ratio
mean(HMDA$lvrat)
# consumer credit score
mean(as.numeric(HMDA$chist))
# mortgage credit score
mean(as.numeric(HMDA$mhist))
# public bad credit record
mean(as.numeric(HMDA$phist)-1)
# denied mortgage insurance
prop.table(table(HMDA$insurance))
# self-employed
prop.table(table(HMDA$selfemp))
# single
prop.table(table(HMDA$single))
# high school diploma
prop.table(table(HMDA$hschool))
# unemployment rate
mean(HMDA$unemp)
# condominium
prop.table(table(HMDA$condomin))
# black
prop.table(table(HMDA$black))
# deny
prop.table(table(HMDA$deny))
```
See Chapter 11.4 of the book or use `r ttcode("R")`'s help function for more on variables contained in the `r ttcode("HMDA")` dataset.
Before estimating the models we transform the loan-to-value ratio (`r ttcode("lvrat")`) into a factor variable, where
\begin{align*}
lvrat =
\begin{cases}
\text{low} & \text{if} \ \ lvrat < 0.8, \\
\text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\
\text{high} & \text{if} \ \ lvrat > 0.95
\end{cases}
\end{align*}
and convert both credit scores to numeric variables.
```{r}
# define low, medium and high loan-to-value ratio
HMDA$lvrat <- factor(
ifelse(HMDA$lvrat < 0.8, "low",
ifelse(HMDA$lvrat >= 0.8 & HMDA$lvrat <= 0.95, "medium", "high")),
levels = c("low", "medium", "high"))
# convert credit scores to numeric
HMDA$mhist <- as.numeric(HMDA$mhist)
HMDA$chist <- as.numeric(HMDA$chist)
```
Next we reproduce the estimation results presented in Table 11.2 of the book.
```{r}
# estimate all 6 models for the denial probability
lpm_HMDA <- lm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp, data = HMDA)
logit_HMDA <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp,
family = binomial(link = "logit"),
data = HMDA)
probit_HMDA_1 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp,
family = binomial(link = "probit"),
data = HMDA)
probit_HMDA_2 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist
+ insurance + selfemp + single + hschool + unemp,
family = binomial(link = "probit"),
data = HMDA)
probit_HMDA_3 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist
+ phist + insurance + selfemp + single + hschool + unemp + condomin
+ I(mhist==3) + I(mhist==4) + I(chist==3) + I(chist==4) + I(chist==5)
+ I(chist==6),
family = binomial(link = "probit"),
data = HMDA)
probit_HMDA_4 <- glm(deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist
+ insurance + selfemp + single + hschool + unemp,
family = binomial(link = "probit"),
data = HMDA)
```
Just as in previous chapters, we store heteroskedasticity-robust standard errors of the coefficient estimators in a `r ttcode("list")` which is then used as the argument `r ttcode("se")` in `r ttcode("stargazer()")`.
```{r, eval=FALSE}
rob_se <- list(sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1"))))
stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1,
probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,
digits = 3,
type = "latex",
header = FALSE,
se = rob_se,
model.numbers = FALSE,
column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)"))
```
<!--html_preserve-->
```{r hmdad, message=F, warning=F, results='asis', echo=F, eval=my_output == "html"}
library(stargazer)
rob_se <- list(
sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1")))
)
stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1, probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,
digits = 3,
type = "html",
se = rob_se,
header = FALSE,
dep.var.caption = "Dependent Variable: Mortgage Application Denial",
model.numbers = FALSE,
column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)")
)
stargazer_html_title("HMDA Data: LPM, Probit and Logit Models", "hmdad")
```
<!--/html_preserve-->
```{r, message=F, warning=F, results='asis', echo=F, eval=my_output == "latex"}
library(stargazer)
rob_se <- list(
sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1")))
)
stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1, probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,
title = "\\label{tab:hmdad} HMDA Data: LPM, Probit and Logit Models",
digits = 3,
type = "latex",
float.env = "sidewaystable",
column.sep.width = "-5pt",
no.space = T,
single.row = T,
header = FALSE,
se = rob_se,
model.numbers = FALSE,
column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)")
)
```
In Table \@ref(tab:hmdad), models (1), (2) and (3) are baseline specifications that include several financial control variables. They differ only in the way they model the denial probability. Model (1) is a linear probability model, model (2) is a Logit regression and model (3) uses the Probit approach.
In the linear model (1), the coefficients have direct interpretation. For example, an increase in the consumer credit score by $1$ unit is estimated to increase the probability of a loan denial by about $0.031$ percentage points. Having a high loan-to-value ratio is detriment for credit approval: the coefficient for a loan-to-value ratio higher than $0.95$ is $0.189$ so clients with this property are estimated to face an almost $19\%$ larger risk of denial than those with a low loan-to-value ratio, ceteris paribus. The estimated coefficient on the race dummy is $0.084$, which indicates the denial probability for African Americans is $8.4\%$ larger than for white applicants with the same characteristics except for race. Apart from the housing-expense-to-income ratio and the mortgage credit score, all coefficients are significant.
Models (2) and (3) provide similar evidence that there is racial discrimination in the U.S. mortgage market. All coefficients except for the housing expense-to-income ratio (which is not significantly different from zero) are significant at the $1\%$ level. As discussed above, the nonlinearity makes the interpretation of the coefficient estimates more difficult than for model (1). In order to make a statement about the effect of being black, we need to compute the estimated denial probability for two individuals that differ only in race. For the comparison we consider two individuals that share mean values for all numeric regressors. For the qualitative variables we assign the property that is most representative for the data at hand. For example, consider self-employment: we have seen that about $88\%$ of all individuals in the sample are not self-employed such that we set `r ttcode("selfemp = no")`. Using this approach, the estimate for the effect on the denial probability of being African American of the Logit model (2) is about $4\%$. The next code chunk shows how to apply this approach for models (1) to (7) using `r ttcode("R")`.
```{r}
# comppute regressor values for an average black person
new <- data.frame(
"pirat" = mean(HMDA$pirat),
"hirat" = mean(HMDA$hirat),
"lvrat" = "low",
"chist" = mean(HMDA$chist),
"mhist" = mean(HMDA$mhist),
"phist" = "no",
"insurance" = "no",
"selfemp" = "no",
"black" = c("no", "yes"),
"single" = "no",
"hschool" = "yes",
"unemp" = mean(HMDA$unemp),
"condomin" = "no")
# differnce predicted by the LPM
predictions <- predict(lpm_HMDA, newdata = new)
diff(predictions)
# differnce predicted by the logit model
predictions <- predict(logit_HMDA, newdata = new, type = "response")
diff(predictions)
# difference predicted by probit model (3)
predictions <- predict(probit_HMDA_1, newdata = new, type = "response")
diff(predictions)
# difference predicted by probit model (4)
predictions <- predict(probit_HMDA_2, newdata = new, type = "response")
diff(predictions)
# difference predicted by probit model (5)
predictions <- predict(probit_HMDA_3, newdata = new, type = "response")
diff(predictions)
# difference predicted by probit model (6)
predictions <- predict(probit_HMDA_4, newdata = new, type = "response")
diff(predictions)
```
The estimates of the impact on the denial probability of being black are similar for models (2) and (3). It is interesting that the magnitude of the estimated effects is much smaller than for Probit and Logit models that do not control for financial characteristics (see section 11.2). This indicates that these simple models produce biased estimates due to omitted variables.
Regressions (4) to (6) use regression specifications that include different applicant characteristics and credit rating indicator variables as well as interactions. However, most of the corresponding coefficients are not significant and the estimates of the coefficient on `r ttcode("black")` obtained for these models as well as the estimated difference in denial probabilities do not differ much from those obtained for the similar specifications (2) and (3).
An interesting question related to racial discrimination can be investigated using the Probit model (6) where the interactions `r ttcode("blackyes:pirat")` and `r ttcode("blackyes:hirat")` are added to model (4). If the coefficient on `r ttcode("blackyes:pirat")` was different from zero, the effect of the payment-to-income ratio on the denial probability would be different for black and white applicants. Similarly, a non-zero coefficient on `r ttcode("blackyes:hirat")` would indicate that loan officers weight the risk of bankruptcy associated with a high loan-to-value ratio differently for black and white mortgage applicants. We can test whether these coefficients are jointly significant at the $5\%$ level using an $F$-Test.
```{r}
linearHypothesis(probit_HMDA_4,
test = "F",
c("blackyes:pirat=0", "blackyes:hirat=0"),
vcov = vcovHC, type = "HC1")
```
Since $p\text{-value} \approx 0.77$ for this test, the null cannot be rejected. Nonetheless, we can reject the hypothesis that there is no racial discrimination at all since the corresponding $F$-test has a $p\text{-value}$ of about $0.002$.
```{r}
linearHypothesis(probit_HMDA_4,
test = "F",
c("blackyes=0", "blackyes:pirat=0", "blackyes:hirat=0"),
vcov = vcovHC, type = "HC1")
```
#### Summary {-}
Models (1) to (6) provide evidence that there is an effect of being African American on the probability of a mortgage application denial: in all specifications, the effect is estimated to be positive (ranging from $4\%$ to $5\%$) and is significantly different from zero at the $1\%$ level. While the linear probability model seems to slightly overestimate this effect, it still can be used as an approximation to an intrinsically nonlinear relationship.
See Chapters 11.4 and 11.5 of the book for a discussion of external and internal validity of this study and some concluding remarks on regression models where the dependent variable is binary.
## Exercises
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 1. Titanic Survival Data {-}
Chapter \\@ref(palr) presented three approaches to model the conditional expectation function of a binary dependent variable: the linear probability model as well as Probit and Logit regression.
The exercises in this Chapter use data on the fate of the passengers of the ocean linear *Titanic*. We aim to explain survival, a binary variable, by socioeconomic variables using the above approaches.
In this exercise we start with the aggregated data set <tt>Titanic</tt>. It is part of the package <tt>datasets</tt> which is part of base <tt>R</tt>. The following quote from the [description](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html) of the dataset motivates the attempt to predict the *probability* of survival:
*The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts — from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class — are reflected in the survival rates for various classes of passenger.*
**Instructions:**
+ Assign the <tt>Titanic</tt> data to <tt>Titanic_1</tt> and get an overview.
+ Visualize the conditional survival rates for travel class (<tt>Class</tt>), gender (<tt>Sex</tt>) and age (<tt>Age</tt>) using <tt>mosaicplot()</tt>.
<iframe src="DCL/ex11_1.html" frameborder="0" scrolling="no" style="width:100%;height:360px"></iframe>
</div>') } else {
cat("\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}")
}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 2. Titanic Survival Data --- Ctd. {-}
The <tt>Titanic</tt> data set from Exercise 1 is not useful for regression analysis because it is highly aggregated. In this exercise you will work with <tt>titanic.csv</tt> which is available under the URL https://stanford.io/2O9RUCF.
The columns of <tt>titanic.csv</tt> contain the following variables:
<tt>Survived</tt> --- The survived indicator
<tt>Pclass</tt> --- passenger class
<tt>Name</tt> --- passenger\'s Name
<tt>Sex</tt> --- passenger\'s gender
<tt>Age</tt> --- passengers\'s age
<tt>Siblings</tt> --- number of siblings aboard
<tt>Parents.Children.Aboard</tt> --- number of parents and children aboard
<tt>fare</tt> --- the fare paid in british pound
**Instructions:**
+ Import the data from <tt>titanic.csv</tt> using the function <tt>read.csv2()</tt>. Save it to <tt>Titanic_2</tt>.
+ Assign the following column names to <tt>Titanic_2</tt>:
<tt>Survived, Class, Name, Sex, Age, Siblings, Parents</tt> and <tt>Fare</tt>.
+ Get an overview over the data set. Drop the column <tt>Name</tt>.
+ Attach the packages <tt>corrplot</tt> and <tt>dplyr</tt>. Check whether there is multicollinearity in the data using <tt>corrplot()</tt>.
<iframe src="DCL/ex11_2.html" frameborder="0" scrolling="no" style="width:100%;height:560px"></iframe>
**Hints:**
+ <tt>read_csv()</tt> guesses the column specification as well as the seperators used in the <tt>.csv</tt> file. You should always check if the result is correct.
+ You may use <tt>select_if()</tt> from the <tt>dplyr</tt> package to select all numeric columns from the data set.
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 3. Titanic Survival Data --- Survival Rates {-}
Contingency tables similar to those provided by the data set <tt>Titanic</tt> from Exercise 1 may shed some light on the distribution of survival conditional and possible determinants thereof, e.g., the passenger class. Contingency tables are easily created using the base <tt>R</tt> function <tt>table</tt>.
**Instructions:**
+ Generate a contingency table for <tt>Survived</tt> and <tt>Class</tt> using <tt>table()</tt>. Save the table to <tt>t_abs</tt>.
+ <tt>t_abs</tt> reports absolute frequencies. Transform <tt>t_abs</tt> into a table which reports relative frequencies (relative to the total number of observations). Save the result to <tt>t_rel</tt>.
+ Visualize the relative frequencies in <tt>t_rel</tt> using <tt>barplot()</tt>. Use different colors for better distinquishablitly among survival and non-survival rate (it does not matter which colors you use).
<iframe src="DCL/ex11_4_3.html" frameborder="0" scrolling="no" style="width:100%;height:320px"></iframe>
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 4. Titanic Survival Data --- Conditional Distributions of <tt>Age</tt> {-}
Contingency tables are useful for summarizing distribution of categorical variables like <tt>Survived</tt> and <tt>Class</tt> in Exercise 3. They are, however, not useful when the variable of interest takes on many different integers (and they are even impossible to generate when the variable is continuous).
In this exercise you are asked to generate and visualize density estimates of the distribution of <tt>Age</tt> conditional on <tt>Survived</tt> to see whether there are indications how age relates to the chance of survival (despite that the data set reports integers, we treat <tt>Age</tt> as a continuous variable here). For example, it is interesting to see if the \'women and children first\' policy was effective.
The data set <tt>Titanic_2</tt> from the previous exercises is available in your working environment.
**Instructions:**
+ Obtain kernel density estimates of the distributions of <tt>Age</tt> for both the survivors and the deceased.
+ Save the results to <tt>dens_age_surv</tt> (survived) and <tt>dens_age_died</tt> (died).
+ Plot both kernel density estimates (overlay them in a single plot!). Use different colors of your choive to make the estimates distinguishable.
<iframe src="DCL/ex11_5_4.html" frameborder="0" scrolling="no" style="width:100%;height:330px"></iframe>
**Hints:**
+ Kernel density estimates can be obtained using the functon <tt>density()</tt>.
+ Use <tt>plot()</tt> and <tt>lines()</tt> to plot the density estimates.
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 5. Titanic Survival Data --- A Linear Probability Model for <tt>Survival</tt> I {-}
How do socio-economic characteristics of the passengers impact the probability of survival? In particular, are there systematic differences between the three passenger classes? Do the data reflect the \'children and women first\' policy?
It is natural to start the analysis by estimating a simple linear probability model like (LMP) $$Survived_i = \\beta_0 + \\beta_1 Class2_i + \\beta_2 Class3_i + u_i$$ with dummy variables $Class2_i$ and $Class3_i$.
The data set <tt>Titanic_2</tt> from the previous exercises is available in your working environment.
**Instructions:**
+ Attach the <tt>AER</tt> package.
+ <tt>Class</tt> is of type <tt>int</tt> (integer), Convert <tt>Class</tt> to a factor variable.
+ Estimate the linear probability model and save the result to <tt>surv_mod</tt>.
+ Obtain a robust summary of the model coefficients.
+ Use <tt>surv_mod</tt> to predict the probability of survival for the three passenger classes.
<iframe src="DCL/ex11_lpm.html" frameborder="0" scrolling="no" style="width:100%;height:320px"></iframe>
**Hints:**
+ Linear probability models can be estimated using <tt>lm()</tt>.
+ Use <tt>predict()</tt> to obtain the predictions. Remember that a <tt>data.frame</tt> must be provided to the argument <tt>newdata</tt>.
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 6. Titanic Survival Data --- A Linear Probability Model for <tt>Survival</tt> II {-}
Consider again the outcome from Exercise 5:
$$\\widehat{Survived}_i = \\underset{(0.03)}{0.63} - \\underset{(0.05)}{0.16} Class2_i - \\underset{(0.04)}{0.39} Class3_i + u_i $$
(The estimated coefficients in this model are related to the class specific sample means of <tt>Survived</tt>. You are asked to compute them below.)
The highly significant coefficients indicate that the probability of survival decreases with the passenger class, that is, passengers from a less luxurious class are less likely to survive.
This result could be affected by omitted variable bias arising from correlation of the passenger class with determinants of the probability of survival not included in the model. We therefore augment the model such that it includes all remaining variables as regressors.
The data set <tt>Titanic_2</tt> as well as the model <tt>surv_mod</tt> from the previous exercises are available in your working environment. The <tt>AER</tt> package is attached.
**Instructions:**
+ Use the model object <tt>surv_mod</tt> to obtain the class specific estimates for the probability of survival. Store them in <tt>surv_prob_c1</tt>, <tt>surv_prob_c2</tt> and <tt>surv_prob_c3</tt>.
+ Fit the augmented LMP and assign the result to the object <tt>LPM_mod</tt>.
+ Obtain a robust summary of the model coefficients.
<iframe src="DCL/ex11_lpm2.html" frameborder="0" scrolling="no" style="width:100%;height:320px"></iframe>
**Hint:**
+ Remember that the formula <tt>a ~ .</tt> specifies a regression of <tt>a</tt> on all other variables in the data set provided as the argument <tt>data</tt> in <tt>glm()</tt>.
</div>')}
```
```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
cat('
<div class = "DCexercise">
#### 7. Titanic Survival Data --- Logistic Regression {-}