11-ch11.Rmd

# Regression with a Binary Dependent Variable {#rwabdv}

This chapter, we discusses a special class of regression models that aim to explain a limited dependent variable. In particular, we consider models where the dependent variable is binary. We will see that in such models, the regression function can be interpreted as a conditional probability function of the binary dependent variable. 

We review the following concepts:

- the linear probability model
- the Probit model
- the Logit model
- maximum likelihood estimation of nonlinear regression models

Of course, we will also see how to estimate above models using `r ttcode("R")` and discuss an application where we examine the question whether there is racial discrimination in the U.S. mortgage market.

The following packages and their dependencies are needed for reproduction of the code chunks presented throughout this chapter on your computer:

+ `r ttcode("AER")` [@R-AER]
+ `r ttcode("stargazer")` [@R-stargazer]

Check whether the following code chunk runs without any errors.

```{r, warning=FALSE, message=FALSE}
library(AER)
library(stargazer)
```

## Binary Dependent Variables and the Linear Probability Model

```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC11.1">
<h3 class = "right"> Key Concept 11.1 </h3>
<h3 class = "left"> The Linear Probability Model </h3>

The linear regression model 

$$Y_i = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki} + u_i$$
with a binary dependent variable $Y_i$ is called the linear probability model. In the linear probability model we have $$E(Y\\vert X_1,X_2,\\dots,X_k) = P(Y=1\\vert X_1, X_2,\\dots, X_3)$$ where $$ P(Y = 1 \\vert X_1, X_2, \\dots, X_k) = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki}.$$

Thus, $\\beta_j$ can be interpreted as the change in the probability that $Y_i=1$, holding constant the other $k-1$ regressors. Just as in common multiple regression, the $\\beta_j$ can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals. 

In most linear probability models, $R^2$ has no meaningful interpretation since the regression line can never fit the data perfectly if the dependent variable is binary and the regressors are continuous. This can be seen in the application below.

It is *essential* to use robust standard errors since the $u_i$ in a linear probability model are always heteroskedastic.

Linear probability models are easily estimated in <tt>R</tt> using the function <tt>lm()</tt>.

</div>
')
```

```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[The Linear Probability Model]{11.1}
The linear regression model 
$$Y_i = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki} + u_i$$
with a binary dependent variable $Y_i$ is called the linear probability model. In the linear probability model we have $$E(Y\\vert X_1,X_2,\\dots,X_k) = P(Y=1\\vert X_1, X_2,\\dots, X_3)$$ where $$ P(Y = 1 \\vert X_1, X_2, \\dots, X_k) = \\beta_0 + \\beta_1 + X_{1i} + \\beta_2 X_{2i} + \\dots + \\beta_k X_{ki}.$$

Thus, $\\beta_j$ can be interpreted as the change in the probability that $Y_i=1$, holding constant the other $k-1$ regressors. Just as in common multiple regression, the $\\beta_j$ can be estimated using OLS and the robust standard error formulas can be used for hypothesis testing and computation of confidence intervals.\\newline 

In most linear probability models, $R^2$ has no meaningful interpretation since the regression line can never fit the data perfectly if the dependent variable is binary and the regressors are continuous. This can be seen in the application below.\\newline

It is \\textit{essential} to use robust standard errors since the $u_i$ in a linear probability model are always heteroskedastic.\\newline

Linear probability models are easily estimated in \\texttt{R} using the function \\texttt{lm()}.
\\end{keyconcepts}
')
```

#### Mortgage Data {-}

Following the book, we start by loading the data set `r ttcode("HMDA")` which provides data that relate to mortgage applications filed in Boston in the year of 1990. 

```{r, warning=FALSE, message=FALSE}
# load `AER` package and attach the HMDA data
library(AER)
data(HMDA)
```

We continue by inspecting the first few observations and compute summary statistics afterwards.

```{r}
# inspect the data
head(HMDA)
summary(HMDA)
```

The variable we are interested in modelling is `r ttcode("deny")`, an indicator for whether an applicant's mortgage application has been accepted (`r ttcode("deny = no")`) or denied (`r ttcode("deny = yes")`). A regressor that ought to have power in explaining whether a mortgage application has been denied is `r ttcode("pirat")`, the size of the anticipated total monthly loan payments relative to the the applicant's income. It is straightforward to translate this into the simple regression model

\begin{align}
  deny = \beta_0 + \beta_1 \times P/I\ ratio + u. (\#eq:denymod1)
\end{align}

We estimate this model just as any other linear regression model using `r ttcode("lm()")`. Before we do so, the variable `r ttcode("deny")` must be converted to a numeric variable using `r ttcode("as.numeric()")` as `r ttcode("lm()")` does not accepts the *dependent variable* to be of class `r ttcode("factor")`. Note that `as.numeric(HMDA$deny)` will turn `r ttcode("deny = no")` into `r ttcode("deny = 1")` and `r ttcode("deny = yes")` into `r ttcode("deny = 2")`, so using `r ttcode("as.numeric(HMDA$deny)-1")` we obtain the values `r ttcode("0")` and `r ttcode("1")`.

```{r}
# convert 'deny' to numeric
HMDA$deny <- as.numeric(HMDA$deny) - 1

# estimate a simple linear probabilty model
denymod1 <- lm(deny ~ pirat, data = HMDA)
denymod1
```

Next, we plot the data and the regression line to reproduce Figure 11.1 of the book.

```{r}
# plot the data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Scatterplot Mortgage Application Denial and the Payment-to-Income Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.8)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add the estimated regression line
abline(denymod1, 
       lwd = 1.8, 
       col = "steelblue")
```

According to the estimated model, a payment-to-income ratio of $1$ is associated with an expected probability of mortgage application denial of roughly $50\%$. The model indicates that there is a positive relation between the payment-to-income ratio and the probability of a denied mortgage application so individuals with a high ratio of loan payments to income are more likely to be rejected.

We may use `r ttcode("coeftest()")` to obtain robust standard errors for both coefficient estimates.

```{r}
# print robust coefficient summary
coeftest(denymod1, vcov. = vcovHC, type = "HC1")
```

The estimated regression line is 
\begin{align}
\widehat{deny} = -\underset{(0.032)}{0.080} + \underset{(0.098)}{0.604} P/I \ ratio. (\#eq:lpm)
\end{align}
The true coefficient on $P/I \ ratio$ is statistically different from $0$ at the $1\%$ level. Its estimate can be interpreted as follows: a 1 percentage point increase in $P/I \ ratio$ leads to an increase in the probability of a loan denial by $0.604 \cdot 0.01 = 0.00604 \approx 0.6\%$.

Following the book we augment the simple model \@ref(eq:denymod1) by an additional regressor $black$ which equals $1$ if the applicant is an African American and equals $0$ otherwise. Such a specification is the baseline for investigating if there is racial discrimination in the mortgage market: if being black has a significant (positive) influence on the probability of a loan denial when we control for factors that allow for an objective assessment of an applicants credit worthiness, this is an indicator for discrimination.

```{r}
# rename the variable 'afam' for consistency
colnames(HMDA)[colnames(HMDA) == "afam"] <- "black"

# estimate the model
denymod2 <- lm(deny ~ pirat + black, data = HMDA)
coeftest(denymod2, vcov. = vcovHC)
```

The estimated regression function is
\begin{align}
  \widehat{deny} =& \, -\underset{(0.029)}{0.091} + \underset{(0.089)}{0.559} P/I \ ratio + \underset{(0.025)}{0.177} black. (\#eq:denymod2)
\end{align}

The coefficient on $black$ is positive and significantly different from zero at the $0.01\%$ level. The interpretation is that, holding constant the $P/I \ ratio$, being black increases the probability of a mortgage application denial by about $17.7\%$. This finding is compatible with racial discrimination. However, it might be distorted by omitted variable bias so discrimination could be a premature conclusion.

## Probit and Logit Regression {#palr}

The linear probability model has a major flaw: it assumes the conditional probability function to be linear. This does not restrict $P(Y=1\vert X_1,\dots,X_k)$ to lie between $0$ and $1$. We can easily see this in our reproduction of Figure 11.1 of the book: for $P/I \ ratio \geq 1.75$, \@ref(eq:lpm) predicts the probability of a mortgage application denial to be bigger than $1$. For applications with $P/I \ ratio$ close to $0$, the predicted probability of denial is even negative so that the model has no meaningful interpretation here.

This circumstance calls for an approach that uses a nonlinear function to model the conditional probability function of a binary dependent variable. Commonly used methods are Probit and Logit regression.

### Probit Regression {-}

In Probit regression, the cumulative standard normal distribution function $\Phi(\cdot)$ is used to model the regression function when the dependent variable is binary, that is, we assume
\begin{align}
  E(Y\vert X) = P(Y=1\vert X) = \Phi(\beta_0 + \beta_1 X). (\#eq:probitmodel)
\end{align}

$\beta_0 + \beta_1 X$ in \@ref(eq:probitmodel) plays the role of a quantile $z$. Remember that $$\Phi(z) = P(Z \leq z) \ , \ Z \sim \mathcal{N}(0,1)$$ such that the Probit coefficient $\beta_1$ in \@ref(eq:probitmodel) is the change in $z$ associated with a one unit change in $X$. Although the effect on $z$ of a change in $X$ is linear, the link between $z$ and the dependent variable $Y$ is nonlinear since $\Phi$ is a nonlinear function of $X$.

Since the dependent variable is a nonlinear function of the regressors, the coefficient on $X$ has no simple interpretation. According to Key Concept 8.1, the expected change in the probability that $Y=1$ due to a change in $P/I \ ratio$ can be computed as follows:

1. Compute the predicted probability that $Y=1$ for the original value of $X$.
2. Compute the predicted probability that $Y=1$ for $X + \Delta X$.
3. Compute the difference between both predicted probabilities.

Of course we can generalize \@ref(eq:probitmodel) to Probit regression with multiple regressors to mitigate the risk of facing omitted variable bias. Probit regression essentials are summarized in Key Concept 11.2.

```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC11.2">
<h3 class = "right"> Key Concept 11.2 </h3>
<h3 class = "left"> Probit Model, Predicted Probabilities and Estimated Effects</h3>

Assume that $Y$ is a binary variable. The model

$$ Y= \\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k + u $$
with
$$P(Y = 1 \\vert X_1, X_2, \\dots ,X_k) = \\Phi(\\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)$$
is the population Probit model with multiple regressors $X_1, X_2, \\dots, X_k$ and $\\Phi(\\cdot)$ is the cumulative standard normal distribution function.

The predicted probability that $Y=1$ given $X_1, X_2, \\dots, X_k$ can be calculated in two steps:

1. Compute $z = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k$

2. Look up $\\Phi(z)$ by calling <tt>pnorm()</tt>.

$\\beta_j$ is the effect on $z$ of a one unit change in regressor $X_j$, holding constant all other $k-1$ regressors.

The effect on the predicted probability of a change in a regressor can be computed as in Key Concept 8.1.

In <tt>R</tt>, Probit models can be estimated using the function <tt>glm()</tt> from the package <tt>stats</tt>. Using the argument <tt>family</tt> we specify that we want to use a Probit link function.

</div>
')
```

```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Probit Model, Predicted Probabilities and Estimated Effects]{11.2}
Assume that $Y$ is a binary variable. The model

$$ Y= \\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k + u $$
with
$$P(Y = 1 \\vert X_1, X_2, \\dots ,X_k) = \\Phi(\\beta_0 + \\beta_1 + X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)$$
is the population Probit model with multiple regressors $X_1, X_2, \\dots, X_k$ and $\\Phi(\\cdot)$ is the cumulative standard normal distribution function.\\newline

The predicted probability that $Y=1$ given $X_1, X_2, \\dots, X_k$ can be calculated in two steps:\\newline

\\begin{enumerate}
\\item Compute $z = \\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k$
\\item Look up $\\Phi(z)$ by calling \\texttt{pnorm()}.
\\end{enumerate}\\vspace{0.5cm}

$\\beta_j$ is the effect on $z$ of a one unit change in regressor $X_j$, holding constant all other $k-1$ regressors.\\newline

The effect on the predicted probability of a change in a regressor can be computed as in Key Concept 8.1.\\newline

In \\texttt{R}, Probit models can be estimated using the function \\texttt{glm()} from the package \\texttt{stats}. Using the argument \\texttt{family} we specify that we want to use a Probit link function.
\\end{keyconcepts}
')
```

We now estimate a simple Probit model of the probability of a mortgage denial.

```{r}
# estimate the simple probit model
denyprobit <- glm(deny ~ pirat, 
                  family = binomial(link = "probit"), 
                  data = HMDA)

coeftest(denyprobit, vcov. = vcovHC, type = "HC1")
```

The estimated model is

\begin{align}
  \widehat{P(deny\vert P/I \ ratio}) = \Phi(-\underset{(0.19)}{2.19} + \underset{(0.54)}{2.97} P/I \ ratio). (\#eq:denyprobit)
\end{align}

Just as in the linear probability model we find that the relation between the probability of denial and the payments-to-income ratio is positive and that the corresponding coefficient is highly significant.

The following code chunk reproduces Figure 11.2 of the book.

<div class="unfolded">
```{r, fig.align='center'}
# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit Model of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.85)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line
x <- seq(0, 3, 0.01)
y <- predict(denyprobit, list(pirat = x), type = "response")

lines(x, y, lwd = 1.5, col = "steelblue")
```
</div>

The estimated regression function has a stretched "S"-shape which is typical for the CDF of a continuous random variable with symmetric PDF like that of a normal random variable. The function is clearly nonlinear and flattens out for large and small values of $P/I \ ratio$. The functional form thus also ensures that the predicted conditional probabilities of a denial lie between $0$ and $1$. 

We use `r ttcode("predict()")` to compute the predicted change in the denial probability when $P/I \ ratio$ is increased from $0.3$ to $0.4$.

```{r}
# 1. compute predictions for P/I ratio = 0.3, 0.4
predictions <- predict(denyprobit, 
                       newdata = data.frame("pirat" = c(0.3, 0.4)),
                       type = "response")

# 2. Compute difference in probabilities
diff(predictions)
```

We find that an increase in the payment-to-income ratio from $0.3$ to $0.4$ is predicted to increase the probability of denial by approximately $6.2\%$.

We continue by using an augmented Probit model to estimate the effect of race on the probability of a mortgage application denial.

```{r}
denyprobit2 <- glm(deny ~ pirat + black, 
                   family = binomial(link = "probit"), 
                   data = HMDA)

coeftest(denyprobit2, vcov. = vcovHC, type = "HC1")
```

The estimated model equation is

\begin{align}
  \widehat{P(deny\vert P/I \ ratio, black)} = \Phi (-\underset{(0.18)}{2.26} + \underset{(0.50)}{2.74} P/I \ ratio + \underset{(0.08)}{0.71} black). (\#eq:denyprobit2) 
\end{align}

While all coefficients are highly significant, both the estimated coefficients on the payments-to-income ratio and the indicator for African American descent are positive. Again, the coefficients are difficult to interpret but they indicate that, first, African Americans have a higher probability of denial than white applicants, holding constant the payments-to-income ratio and second, applicants with a high payments-to-income ratio face a higher risk of being rejected.

How big is the estimated difference in denial probabilities between two hypothetical applicants with the same payments-to-income ratio? As before, we may use `r ttcode("predict()")` to compute this difference.

```{r}
# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denyprobit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")

# 2. compute difference in probabilities
diff(predictions)
```

In this case, the estimated difference in denial probabilities is about $15.8\%$.

### Logit Regression {-}

Key Concept 11.3 summarizes the Logit regression function. 

```{r, eval = my_output == "html", results='asis', echo=F, purl=F}
cat('
<div class = "keyconcept" id="KC11.3">
<h3 class = "right"> Key Concept 11.3 </h3>
<h3 class = "left"> Logit Regression </h3>

The population Logit regression function is

\\begin{align*}
  P(Y=1\\vert X_1, X_2, \\dots, X_k) =& \\, F(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k) \\\\
  =& \\, \\frac{1}{1+e^{-(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)}}.
\\end{align*}

The idea is similar to Probit regression except that a different CDF is used: $$F(x) = \\frac{1}{1+e^{-x}}$$ is the CDF of a standard logistically distributed random variable.

</div>
')
```

```{r, eval = my_output == "latex", results='asis', echo=F, purl=F}
cat('\\begin{keyconcepts}[Logit Regression]{11.3}
The population Logit regression function is

\\begin{align*}
  P(Y=1\\vert X_1, X_2, \\dots, X_k) =& \\, F(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k) \\\\
  =& \\, \\frac{1}{1+e^{-(\\beta_0 + \\beta_1 X_1 + \\beta_2 X_2 + \\dots + \\beta_k X_k)}}.
\\end{align*}

The idea is similar to Probit regression except that a different CDF is used: $$F(x) = \\frac{1}{1+e^{-x}}$$ is the CDF of a standard logistically distributed random variable.
\\end{keyconcepts}
')
```


As for Probit regression, there is no simple interpretation of the model coefficients and it is best to consider predicted probabilities or differences in predicted probabilities. Here again, $t$-statistics and confidence intervals based on large sample normal approximations can be computed as usual. 

It is fairly easy to estimate a Logit regression model using `r ttcode("R")`.

```{r}
denylogit <- glm(deny ~ pirat, 
                 family = binomial(link = "logit"), 
                 data = HMDA)

coeftest(denylogit, vcov. = vcovHC, type = "HC1")
```

The subsequent code chunk reproduces Figure 11.3 of the book. 

<div class="unfolded">
```{r, fig.align='center'}
# plot data
plot(x = HMDA$pirat, 
     y = HMDA$deny,
     main = "Probit and Logit Models Model of the Probability of Denial, Given P/I Ratio",
     xlab = "P/I ratio",
     ylab = "Deny",
     pch = 20,
     ylim = c(-0.4, 1.4),
     cex.main = 0.9)

# add horizontal dashed lines and text
abline(h = 1, lty = 2, col = "darkred")
abline(h = 0, lty = 2, col = "darkred")
text(2.5, 0.9, cex = 0.8, "Mortgage denied")
text(2.5, -0.1, cex= 0.8, "Mortgage approved")

# add estimated regression line of Probit and Logit models
x <- seq(0, 3, 0.01)
y_probit <- predict(denyprobit, list(pirat = x), type = "response")
y_logit <- predict(denylogit, list(pirat = x), type = "response")

lines(x, y_probit, lwd = 1.5, col = "steelblue")
lines(x, y_logit, lwd = 1.5, col = "black", lty = 2)

# add a legend
legend("topleft",
       horiz = TRUE,
       legend = c("Probit", "Logit"),
       col = c("steelblue", "black"), 
       lty = c(1, 2))
```
</div>

Both models produce very similar estimates of the probability that a mortgage application will be denied depending on the applicants payment-to-income ratio.

Following the book we extend the simple Logit model of mortgage denial with the additional regressor $black$.

```{r}
# estimate a Logit regression with multiple regressors
denylogit2 <- glm(deny ~ pirat + black, 
                  family = binomial(link = "logit"), 
                  data = HMDA)

coeftest(denylogit2, vcov. = vcovHC, type = "HC1")
```

We obtain

\begin{align}
  \widehat{P(deny=1 \vert P/I ratio, black)} = F(-\underset{(0.35)}{4.13} + \underset{(0.96)}{5.37} P/I \ ratio + \underset{(0.15)}{1.27} black). (\#eq:denylogit2)
\end{align}

As for the Probit model \@ref(eq:denyprobit2) all model coefficients are highly significant and we obtain positive estimates for the coefficients on $P/I \ ratio$ and $black$. For comparison we compute the predicted probability of denial for two hypothetical applicants that differ in race and have a $P/I \ ratio$ of $0.3$.

```{r}
# 1. compute predictions for P/I ratio = 0.3
predictions <- predict(denylogit2, 
                       newdata = data.frame("black" = c("no", "yes"), 
                                            "pirat" = c(0.3, 0.3)),
                       type = "response")

predictions

# 2. Compute difference in probabilities
diff(predictions)
```

We find that the white applicant faces a denial probability of only $7.5\%$, while the African American is rejected with a probability of $22.4\%$, a difference of $14.9\%$.

#### Comparison of the Models {-}

The Probit model and the Logit model deliver only approximations to the unknown population regression function $E(Y\vert X)$. It is not obvious how to decide which model to use in practice. The linear probability model has the clear drawback of not being able to capture the nonlinear nature of the population regression function and it may predict probabilities to lie outside the interval $[0,1]$. Probit and Logit models are harder to interpret but capture the nonlinearities better than the linear approach: both models produce predictions of probabilities that lie inside the interval $[0,1]$. Predictions of all three models are often close to each other. The book suggests to use the method that is easiest to use in the statistical software of choice. As we have seen, it is equally easy to estimate Probit and Logit model using `r ttcode("R")`. We can therefore give no general recommendation which method to use.  

## Estimation and Inference in the Logit and Probit Models

So far nothing has been said about *how* Logit and Probit models are estimated by statistical software. The reason why this is interesting is that both models are *nonlinear in the parameters* and thus cannot be estimated using OLS. Instead one relies on *maximum likelihood estimation* (MLE). Another approach is estimation by *nonlinear least squares* (NLS).

#### Nonlinear Least Squares {-}

Consider the multiple regression Probit model 
\begin{align}
  E(Y_i\vert X_{1i}, \dots, X_{ki}) = P(Y_i=1\vert X_{1i}, \dots, X_{ki}) = \Phi(\beta_0 + \beta_1 X_{1i} + \dots + \beta_k X_{ki}). (\#eq:multprobit)
\end{align}
Similarly to OLS, NLS estimates the parameters $\beta_0,\beta_1,\dots,\beta_k$ by minimizing the sum of squared mistakes
$$\sum_{i=1}^n\left[ Y_i - \Phi(b_0 + b_1 X_{1i} + \dots + b_k X_{ki}) \right]^2.$$
NLS estimation is a consistent approach that produces estimates which are normally distributed in large samples. In `r ttcode("R")` there are functions like `r ttcode("nls()")` from package `r ttcode("stats")` which provide algorithms for solving nonlinear least squares problems.
However, NLS is inefficient, meaning that there are estimation techniques that have a smaller variance which is why we will not dwell any further on this topic. 

#### Maximum Likelihood Estimation {-}

In MLE we seek to estimate the unknown parameters choosing them such that the likelihood of drawing the sample observed is maximized. This probability is measured by means of the likelihood function, the joint probability distribution of the data treated as a function of the unknown parameters. Put differently, the maximum likelihood estimates of the unknown parameters are the values that result in a model which is most likely to produce the data observed. It turns out that MLE is more efficient than NLS.

As maximum likelihood estimates are normally distributed in large samples, statistical inference for coefficients in nonlinear models like Logit and Probit regression can be made using the same tools that are used for linear regression models: we can compute $t$-statistics and confidence intervals.

Many software packages use an MLE algorithm for estimation of nonlinear models. The function `r ttcode("glm()")` uses an algorithm named *iteratively reweighted least squares*. 

#### Measures of Fit {-}

It is important to be aware that the usual $R^2$ and $\bar{R}^2$ are *invalid* for nonlinear regression models. The reason for this is simple: both measures assume that the relation between the dependent and the explanatory variable(s) is linear. This obviously does not hold for Probit and Logit models. Thus $R^2$ need not lie between $0$ and $1$ and there is no meaningful interpretation. However, statistical software sometimes reports these measures anyway.

There are many measures of fit for nonlinear regression models and there is no consensus which one should be reported. The situation is even more complicated because there is no measure of fit that is generally meaningful. For models with a binary response variable like $deny$ one could use the following rule: \newline If $Y_i = 1$ and $\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} > 0.5$ or if $Y_i = 0$ and $\widehat{P(Y_i|X_{i1}, \dots, X_{ik})} < 0.5$, consider the $Y_i$ as correctly predicted. Otherwise $Y_i$ is said to be incorrectly predicted. The measure of fit is the share of correctly predicted observations. The downside of such an approach is that it does not mirror the quality of the prediction: whether $\widehat{P(Y_i = 1|X_{i1}, \dots, X_{ik}) = 0.51}$ or $\widehat{P(Y_i =1|X_{i1}, \dots, X_{ik}) = 0.99}$ is not reflected, we just predict $Y_i = 1$.^[This is in contrast to the case of a numeric dependent variable where we use the squared errors for assessment of the quality of the prediction.]  

An alternative to the latter are so called pseudo-$R^2$ measures. In order to measure the quality of the fit, these measures compare the value of the maximized (log-)likelihood of the model with all regressors (the *full model*) to the likelihood of a model with no regressors (*null model*, regression on a constant). 

For example, consider a Probit regression. The $\text{pseudo-}R^2$ is given by $$\text{pseudo-}R^2 = 1 - \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}$$ where $f^{max}_j \in [0,1]$ denotes the maximized likelihood for model $j$.

The reasoning behind this is that the maximized likelihood increases as additional regressors are added to the model, similarly to the decrease in $SSR$ when regressors are added in a linear regression model. If the full model has a similar maximized likelihood as the null model, the full model does not really improve upon a model that uses only the information in the dependent variable, so $\text{pseudo-}R^2 \approx 0$. If the full model fits the data very well, the maximized likelihood should be close to $1$ such that $\ln(f^{max}_{full}) \approx 0$ and $\text{pseudo-}R^2 \approx 1$. See Appendix 11.2 of the book for more on MLE and pseudo-$R^2$ measures.

`r ttcode("summary()")` does not report $\text{pseudo-}R^2$ for models estimated by `r ttcode("glm()")` but we can use the entries *residual deviance* (`r ttcode("deviance")`) and *null deviance* (`r ttcode("null.deviance")`) instead. These are computed as
$$\text{deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{full}) \right]$$
and

$$\text{null deviance} = -2 \times \left[\ln(f^{max}_{saturated}) - \ln(f^{max}_{null}) \right]$$
where $f^{max}_{saturated}$ is the maximized likelihood for a model which assumes that each observation has its own parameter (there are $n+1$ parameters to be estimated which leads to a perfect fit). For models with a binary dependent variable, it holds that $$\text{pseudo-}R^2 = 1 - \frac{\text{deviance}}{\text{null deviance}} = 1- \frac{\ln(f^{max}_{full})}{\ln(f^{max}_{null})}.$$

We now compute the $\text{pseudo-}R^2$ for the augmented Probit model of mortgage denial.

```{r}
# compute pseudo-R2 for the probit model of mortgage denial
pseudoR2 <- 1 - (denyprobit2$deviance) / (denyprobit2$null.deviance)
pseudoR2
```

Another way to obtain the $\text{pseudo-}R^2$ is to estimate the null model using `r ttcode("glm()")` and extract the maximized log-likelihoods for both the null and the full model using the function `r ttcode("logLik()")`. 

```{r}
# compute the null model
denyprobit_null <- glm(formula = deny ~ 1, 
                       family = binomial(link = "probit"), 
                       data = HMDA)

# compute the pseudo-R2 using 'logLik'
1 - logLik(denyprobit2)[1]/logLik(denyprobit_null)[1]
```

## Application to the Boston HMDA Data

Models \@ref(eq:denyprobit2) and \@ref(eq:denylogit2) indicate that denial rates are higher for African American applicants holding constant the payment-to-income ratio. Both results could be subject to omitted variable bias. In order to obtain a more trustworthy estimate of the effect of being black on the probability of a mortgage application denial we estimate a linear probability model as well as several Logit and Probit models. We thereby control for financial variables and additional applicant characteristics which are likely to influence the probability of denial and differ between black and white applicants.

Sample averages as shown in Table 11.1 of the book can be easily reproduced using the functions `r ttcode("mean()")` (as usual for numeric variables) and `r ttcode("prop.table()")` (for factor variables).

```{r}
# Mean P/I ratio
mean(HMDA$pirat)

# inhouse expense-to-total-income ratio
mean(HMDA$hirat)

# loan-to-value ratio
mean(HMDA$lvrat)

# consumer credit score
mean(as.numeric(HMDA$chist))

# mortgage credit score
mean(as.numeric(HMDA$mhist))

# public bad credit record
mean(as.numeric(HMDA$phist)-1)

# denied mortgage insurance
prop.table(table(HMDA$insurance))

# self-employed
prop.table(table(HMDA$selfemp))

# single
prop.table(table(HMDA$single))

# high school diploma
prop.table(table(HMDA$hschool))

# unemployment rate
mean(HMDA$unemp)

# condominium
prop.table(table(HMDA$condomin))

# black
prop.table(table(HMDA$black))

# deny
prop.table(table(HMDA$deny))
```

See Chapter 11.4 of the book or use `r ttcode("R")`'s help function for more on variables contained in the `r ttcode("HMDA")` dataset.

Before estimating the models we transform the loan-to-value ratio (`r ttcode("lvrat")`) into a factor variable, where

\begin{align*}
  lvrat = 
  \begin{cases}
    \text{low} & \text{if} \ \ lvrat < 0.8, \\
    \text{medium} & \text{if} \ \ 0.8 \leq lvrat \leq 0.95, \\
    \text{high} & \text{if} \ \ lvrat > 0.95
  \end{cases}
\end{align*}

and convert both credit scores to numeric variables.

```{r}
# define low, medium and high loan-to-value ratio
HMDA$lvrat <- factor(
  ifelse(HMDA$lvrat < 0.8, "low",
  ifelse(HMDA$lvrat >= 0.8 & HMDA$lvrat <= 0.95, "medium", "high")),
  levels = c("low", "medium", "high"))

# convert credit scores to numeric
HMDA$mhist <- as.numeric(HMDA$mhist)
HMDA$chist <- as.numeric(HMDA$chist)
```

Next we reproduce the estimation results presented in Table 11.2 of the book.

```{r}
# estimate all 6 models for the denial probability
lpm_HMDA <- lm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
               + insurance + selfemp, data = HMDA)

logit_HMDA <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                  + insurance + selfemp, 
                  family = binomial(link = "logit"), 
                  data = HMDA)

probit_HMDA_1 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                     + insurance + selfemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_2 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist + phist 
                     + insurance + selfemp + single + hschool + unemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_3 <- glm(deny ~ black + pirat + hirat + lvrat + chist + mhist 
                     + phist + insurance + selfemp + single + hschool + unemp + condomin 
                     + I(mhist==3) + I(mhist==4) + I(chist==3) + I(chist==4) + I(chist==5) 
                     + I(chist==6), 
                     family = binomial(link = "probit"), 
                     data = HMDA)

probit_HMDA_4 <- glm(deny ~ black * (pirat + hirat) + lvrat + chist + mhist + phist 
                     + insurance + selfemp + single + hschool + unemp, 
                     family = binomial(link = "probit"), 
                     data = HMDA)
```

Just as in previous chapters, we store heteroskedasticity-robust standard errors of the coefficient estimators in a `r ttcode("list")` which is then used as the argument `r ttcode("se")` in `r ttcode("stargazer()")`.

```{r, eval=FALSE}
rob_se <- list(sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
               sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
               sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1"))))

stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1, 
          probit_HMDA_2, probit_HMDA_3, probit_HMDA_4,  
          digits = 3,
          type = "latex", 
          header = FALSE,
          se = rob_se,
          model.numbers = FALSE,
          column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)"))
```

<!--html_preserve-->

```{r hmdad, message=F, warning=F, results='asis', echo=F, eval=my_output == "html"}
library(stargazer)

rob_se <- list(
  sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
  sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1")))
)

stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1, probit_HMDA_2, probit_HMDA_3, probit_HMDA_4, 
          digits = 3,
          type = "html", 
          se = rob_se,
          header = FALSE,          
          dep.var.caption = "Dependent Variable: Mortgage Application Denial",
          model.numbers = FALSE,
          column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)")
          )

stargazer_html_title("HMDA Data: LPM, Probit and Logit Models", "hmdad")
```

<!--/html_preserve-->

```{r, message=F, warning=F, results='asis', echo=F, eval=my_output == "latex"}
library(stargazer)

rob_se <- list(
  sqrt(diag(vcovHC(lpm_HMDA, type = "HC1"))),
  sqrt(diag(vcovHC(logit_HMDA, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_1, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_2, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_3, type = "HC1"))),
  sqrt(diag(vcovHC(probit_HMDA_4, type = "HC1")))
)

stargazer(lpm_HMDA, logit_HMDA, probit_HMDA_1, probit_HMDA_2, probit_HMDA_3, probit_HMDA_4, 
          title = "\\label{tab:hmdad} HMDA Data: LPM, Probit and Logit Models",
          digits = 3,
          type = "latex",
          float.env = "sidewaystable",
          column.sep.width = "-5pt",
          no.space = T,
          single.row = T,
          header = FALSE,
          se = rob_se,
          model.numbers = FALSE,
          column.labels = c("(1)", "(2)", "(3)", "(4)", "(5)", "(6)")
          )
```

In Table \@ref(tab:hmdad), models (1), (2) and (3) are baseline specifications that include several financial control variables. They differ only in the way they model the denial probability. Model (1) is a linear probability model, model (2) is a Logit regression and model (3) uses the Probit approach.

In the linear model (1), the coefficients have direct interpretation. For example, an increase in the consumer credit score by $1$ unit is estimated to increase the probability of a loan denial by about $0.031$ percentage points. Having a high loan-to-value ratio is detriment for credit approval: the coefficient for a loan-to-value ratio higher than $0.95$ is $0.189$ so clients with this property are estimated to face an almost $19\%$ larger risk of denial than those with a low loan-to-value ratio, ceteris paribus. The estimated coefficient on the race dummy is $0.084$, which indicates the denial probability for African Americans is $8.4\%$ larger than for white applicants with the same characteristics except for race. Apart from the housing-expense-to-income ratio and the mortgage credit score, all coefficients are significant.

Models (2) and (3) provide similar evidence that there is racial discrimination in the U.S. mortgage market. All coefficients except for the housing expense-to-income ratio (which is not significantly different from zero) are significant at the $1\%$ level. As discussed above, the nonlinearity makes the interpretation of the coefficient estimates more difficult than for model (1). In order to make a statement about the effect of being black, we need to compute the estimated denial probability for two individuals that differ only in race. For the comparison we consider two individuals that share mean values for all numeric regressors. For the qualitative variables we assign the property that is most representative for the data at hand. For example, consider self-employment: we have seen that about $88\%$ of all individuals in the sample are not self-employed such that we set `r ttcode("selfemp = no")`. Using this approach, the estimate for the effect on the denial probability of being African American of the Logit model (2) is about $4\%$. The next code chunk shows how to apply this approach for models (1) to (7) using `r ttcode("R")`.

```{r}
# comppute regressor values for an average black person
new <- data.frame(
  "pirat" = mean(HMDA$pirat),
  "hirat" = mean(HMDA$hirat),
  "lvrat" = "low",
  "chist" = mean(HMDA$chist),
  "mhist" = mean(HMDA$mhist),
  "phist" = "no",
  "insurance" = "no",
  "selfemp" = "no",
  "black" = c("no", "yes"),
  "single" = "no",
  "hschool" = "yes",
  "unemp" = mean(HMDA$unemp),
  "condomin" = "no")

# differnce predicted by the LPM
predictions <- predict(lpm_HMDA, newdata = new)
diff(predictions)

# differnce predicted by the logit model
predictions <- predict(logit_HMDA, newdata = new, type = "response")
diff(predictions)

# difference predicted by probit model (3)
predictions <- predict(probit_HMDA_1, newdata = new, type = "response")
diff(predictions)

# difference predicted by probit model (4)
predictions <- predict(probit_HMDA_2, newdata = new, type = "response")
diff(predictions)

# difference predicted by probit model (5)
predictions <- predict(probit_HMDA_3, newdata = new, type = "response")
diff(predictions)

# difference predicted by probit model (6)
predictions <- predict(probit_HMDA_4, newdata = new, type = "response")
diff(predictions)
```

The estimates of the impact on the denial probability of being black are similar for models (2) and (3). It is interesting that the magnitude of the estimated effects is much smaller than for Probit and Logit models that do not control for financial characteristics (see section 11.2). This indicates that these simple models produce biased estimates due to omitted variables.

Regressions (4) to (6) use regression specifications that include different applicant characteristics and credit rating indicator variables as well as interactions. However, most of the corresponding coefficients are not significant and the estimates of the coefficient on `r ttcode("black")` obtained for these models as well as the estimated difference in denial probabilities do not differ much from those obtained for the similar specifications (2) and (3).

An interesting question related to racial discrimination can be investigated using the Probit model (6) where the interactions `r ttcode("blackyes:pirat")` and `r ttcode("blackyes:hirat")` are added to model (4). If the coefficient on `r ttcode("blackyes:pirat")` was different  from zero, the effect of the payment-to-income ratio on the denial probability would be different for black and white applicants. Similarly, a non-zero coefficient on `r ttcode("blackyes:hirat")` would indicate that loan officers weight the risk of bankruptcy associated with a high loan-to-value ratio differently for black and white mortgage applicants. We can test whether these coefficients are jointly significant at the $5\%$ level using an $F$-Test.

```{r}
linearHypothesis(probit_HMDA_4,
                 test = "F",
                 c("blackyes:pirat=0", "blackyes:hirat=0"),
                 vcov = vcovHC, type = "HC1")
```

Since $p\text{-value} \approx 0.77$ for this test, the null cannot be rejected. Nonetheless, we can reject the hypothesis that there is no racial discrimination at all since the corresponding $F$-test has a $p\text{-value}$ of about $0.002$.

```{r}
linearHypothesis(probit_HMDA_4,
                 test = "F",
                 c("blackyes=0", "blackyes:pirat=0", "blackyes:hirat=0"),
                 vcov = vcovHC, type = "HC1")
```

#### Summary {-}

Models (1) to (6) provide evidence that there is an effect of being African American on the probability of a mortgage application denial: in all specifications, the effect is estimated to be positive (ranging from $4\%$ to $5\%$) and is significantly different from zero at the $1\%$ level. While the linear probability model seems to slightly overestimate this effect, it still can be used as an approximation to an intrinsically nonlinear relationship.

See Chapters 11.4 and 11.5 of the book for a discussion of external and internal validity of this study and some concluding remarks on regression models where the dependent variable is binary.

## Exercises 

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 1. Titanic Survival Data {-}

Chapter \\@ref(palr) presented three approaches to model the conditional expectation function of a binary dependent variable: the linear probability model as well as Probit and Logit regression.

The exercises in this Chapter use data on the fate of the passengers of the ocean linear *Titanic*. We aim to explain survival, a binary variable, by socioeconomic variables using the above approaches. 

In this exercise we start with the aggregated data set <tt>Titanic</tt>. It is part of the package <tt>datasets</tt> which is part of base <tt>R</tt>. The following quote from the [description](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/Titanic.html) of the dataset motivates the attempt to predict the *probability* of survival:

*The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts — from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class — are reflected in the survival rates for various classes of passenger.*

**Instructions:**

+ Assign the <tt>Titanic</tt> data to <tt>Titanic_1</tt> and get an overview.

+ Visualize the conditional survival rates for travel class (<tt>Class</tt>), gender (<tt>Sex</tt>) and age (<tt>Age</tt>) using <tt>mosaicplot()</tt>.

<iframe src="DCL/ex11_1.html" frameborder="0" scrolling="no" style="width:100%;height:360px"></iframe>

</div>') } else {
  cat("\\begin{center}\\textit{This interactive part of the book is only available in the HTML version.}\\end{center}")
}
```

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 2. Titanic Survival Data --- Ctd. {-}

The <tt>Titanic</tt> data set from Exercise 1 is not useful for regression analysis because it is highly aggregated. In this exercise you will work with <tt>titanic.csv</tt> which is available under the URL https://stanford.io/2O9RUCF.

The columns of <tt>titanic.csv</tt> contain the following variables:

  <tt>Survived</tt> --- The survived indicator
    
  <tt>Pclass</tt> --- passenger class
    
  <tt>Name</tt> --- passenger\'s Name
    
  <tt>Sex</tt> --- passenger\'s gender
    
  <tt>Age</tt> --- passengers\'s age
    
  <tt>Siblings</tt> --- number of siblings aboard
    
  <tt>Parents.Children.Aboard</tt> --- number of parents and children aboard
    
  <tt>fare</tt> --- the fare paid in british pound

**Instructions:**

+ Import the data from <tt>titanic.csv</tt> using the function <tt>read.csv2()</tt>. Save it to <tt>Titanic_2</tt>.

+ Assign the following column names to <tt>Titanic_2</tt>: 
  
    <tt>Survived, Class, Name, Sex, Age, Siblings, Parents</tt> and <tt>Fare</tt>.

+ Get an overview over the data set. Drop the column <tt>Name</tt>.

+ Attach the packages <tt>corrplot</tt> and <tt>dplyr</tt>. Check whether there is multicollinearity in the data using <tt>corrplot()</tt>.

<iframe src="DCL/ex11_2.html" frameborder="0" scrolling="no" style="width:100%;height:560px"></iframe>

**Hints:**

+ <tt>read_csv()</tt> guesses the column specification as well as the seperators used in the <tt>.csv</tt> file. You should always check if the result is correct. 

+ You may use <tt>select_if()</tt> from the <tt>dplyr</tt> package to select all numeric columns from the data set.

</div>')}
```

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 3. Titanic Survival Data --- Survival Rates {-}

Contingency tables similar to those provided by the data set <tt>Titanic</tt> from Exercise 1 may shed some light on the distribution of survival conditional and possible determinants thereof, e.g., the passenger class. Contingency tables are easily created using the base <tt>R</tt> function <tt>table</tt>. 

**Instructions:**

+ Generate a contingency table for <tt>Survived</tt> and <tt>Class</tt> using <tt>table()</tt>. Save the table to <tt>t_abs</tt>.

+ <tt>t_abs</tt> reports absolute frequencies. Transform <tt>t_abs</tt> into a table which reports relative frequencies (relative to the total number of observations). Save the result to <tt>t_rel</tt>.

+ Visualize the relative frequencies in <tt>t_rel</tt> using <tt>barplot()</tt>. Use different colors for better distinquishablitly among survival and non-survival rate (it does not matter which colors you use).

<iframe src="DCL/ex11_4_3.html" frameborder="0" scrolling="no" style="width:100%;height:320px"></iframe>

</div>')}
```


```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 4. Titanic Survival Data --- Conditional Distributions of <tt>Age</tt> {-}

Contingency tables are useful for summarizing distribution of categorical variables like <tt>Survived</tt> and <tt>Class</tt> in Exercise 3. They are, however, not useful when the variable of interest takes on many different integers (and they are even impossible to generate when the variable is continuous).  

In this exercise you are asked to generate and visualize density estimates of the distribution of <tt>Age</tt> conditional on <tt>Survived</tt> to see whether there are indications how age relates to the chance of survival (despite that the data set reports integers, we treat <tt>Age</tt> as a continuous variable here). For example, it is interesting to see if the \'women and children first\' policy was effective.

The data set <tt>Titanic_2</tt> from the previous exercises is available in your working environment.

**Instructions:**

+ Obtain kernel density estimates of the distributions of <tt>Age</tt> for both the survivors and the deceased. 

+ Save the results to <tt>dens_age_surv</tt> (survived) and <tt>dens_age_died</tt> (died).

+ Plot both kernel density estimates (overlay them in a single plot!). Use different colors of your choive to make the estimates distinguishable.

<iframe src="DCL/ex11_5_4.html" frameborder="0" scrolling="no" style="width:100%;height:330px"></iframe>

**Hints:**

+ Kernel density estimates can be obtained using the functon <tt>density()</tt>.

+ Use <tt>plot()</tt> and <tt>lines()</tt> to plot the density estimates.

</div>')}
```

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 5. Titanic Survival Data --- A Linear Probability Model for <tt>Survival</tt> I {-}

How do socio-economic characteristics of the passengers impact the probability of survival? In particular, are there systematic differences between the three passenger classes? Do the data reflect the \'children and women first\' policy? 

It is natural to start the analysis by estimating a simple linear probability model like (LMP) $$Survived_i = \\beta_0 + \\beta_1 Class2_i + \\beta_2 Class3_i + u_i$$ with dummy variables $Class2_i$ and $Class3_i$.

The data set <tt>Titanic_2</tt> from the previous exercises is available in your working environment.

**Instructions:**

+ Attach the <tt>AER</tt> package.

+ <tt>Class</tt> is of type <tt>int</tt> (integer), Convert <tt>Class</tt> to a factor variable.

+ Estimate the linear probability model and save the result to <tt>surv_mod</tt>.

+ Obtain a robust summary of the model coefficients.

+ Use <tt>surv_mod</tt> to predict the probability of survival for the three passenger classes.

<iframe src="DCL/ex11_lpm.html" frameborder="0" scrolling="no" style="width:100%;height:320px"></iframe>

**Hints:**

+ Linear probability models can be estimated using <tt>lm()</tt>.

+ Use <tt>predict()</tt> to obtain the predictions. Remember that a <tt>data.frame</tt> must be provided to the argument <tt>newdata</tt>.

</div>')}
```

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('

<div  class = "DCexercise">

#### 6. Titanic Survival Data --- A Linear Probability Model for <tt>Survival</tt> II {-}

Consider again the outcome from Exercise 5:

$$\\widehat{Survived}_i = \\underset{(0.03)}{0.63} - \\underset{(0.05)}{0.16} Class2_i - \\underset{(0.04)}{0.39} Class3_i + u_i $$

(The estimated coefficients in this model are related to the class specific sample means of <tt>Survived</tt>. You are asked to compute them below.)

The highly significant coefficients indicate that the probability of survival decreases with the passenger class, that is, passengers from a less luxurious class are less likely to survive. 

This result could be affected by omitted variable bias arising from correlation of the passenger class with determinants of the probability of survival not included in the model. We therefore augment the model such that it includes all remaining variables as regressors.

The data set <tt>Titanic_2</tt> as well as the model <tt>surv_mod</tt> from the previous exercises are available in your working environment. The <tt>AER</tt> package is attached.

**Instructions:**

+ Use the model object <tt>surv_mod</tt> to obtain the class specific estimates for the probability of survival. Store them in <tt>surv_prob_c1</tt>, <tt>surv_prob_c2</tt> and <tt>surv_prob_c3</tt>.

+ Fit the augmented LMP and assign the result to the object <tt>LPM_mod</tt>.

+ Obtain a robust summary of the model coefficients.

<iframe src="DCL/ex11_lpm2.html" frameborder="0" scrolling="no" style="width:100%;height:320px"></iframe>

**Hint:**

+ Remember that the formula <tt>a ~ .</tt> specifies a regression of <tt>a</tt> on all other variables in the data set provided as the argument <tt>data</tt> in <tt>glm()</tt>.

</div>')}
```

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 7. Titanic Survival Data --- Logistic Regression {-}

Chapter \\@ref(palr) introduces Logistic regression, also called Logit regression, which is a more suitable than the LPM for modelling the conditional probability function of a dichotomous outcome variable. Logit regression uses a nonlinear link function that restricts the fitted values to lie between $0$ and $1$: in Logit regression, the *log-odds* of the outcome are modeled as a linear combination of the predictors while the LPM assumes that the conditional probability function of outcome is linear.

The data set <tt>Titanic_2</tt> from Exercise 2 is available in your working environment. The package <tt>AER</tt> is attached.

**Instructions:**

+ Use <tt>glm()</tt> to estimate the model
\\begin{align*}
\\log\\left(\\frac{P(survived_i = 1)}{1-P(survived_i = 1)}\\right) =& \\, \\beta_0 + \\beta_1 Class2_i + \\beta_2 Ckass3_i + \\beta_3 Sex_i \\\\ +& \\, \\beta_4 Age_i + \\beta_5 Siblings_i + \\beta_6 Perents_i + \\beta_7 Fare_i + u_i.
\\end{align*}

+ Obtain a robust summary of the model coefficients.

+ The data frame <tt>passengers</tt> contains data on three hypothetical male passengers that differ only in their passenger class (the other variables are set to the respective sample average). Use <tt>Logit_mod</tt> to predict the probability of survival for these passengers.

<iframe src="DCL/ex11_3_6.html" frameborder="0" scrolling="no" style="width:100%;height:430px"></iframe>

**Hints:**

+ Remember that the formula <tt>a ~ .</tt> specifies a regression of <tt>a</tt> on all other variables in the data set provided as the argument <tt>data</tt> in <tt>glm()</tt>.

+ You need to specify the correct type of prediction in <tt>predict()</tt>.

</div>')}
```

```{r, echo=F, purl=F, results='asis'}
if (my_output=="html") {
  cat('
<div  class = "DCexercise">

#### 8. Titanic Survival Data --- Probit Regression {-}

Repeat Exercise 7 but this time estimate the Probit model
\\begin{align*}
P(Survived_i = 1\\vert Class2_i, Class3_i, \\dots, Fare_i) =& \\, \\Phi (\\beta_0 + \\beta_1 Class2_i + \\beta_2 Class3_i + \\beta_3 Sex_i \\\\ +& \\, \\beta_4 Age_i + \\beta_5 Siblings_i + \\beta_6 Parents_i + \\beta_7 Fare_i + u_i).
\\end{align*}

The data set <tt>Titanic_2</tt> from the previous exercises as well as the Logit model <tt>Logit_mod</tt> are available in your working environment. The package <tt>AER</tt> is attached.

**Instructions:**

+ Use <tt>glm()</tt> to estimate the above Probit model. Save the result to <tt>Probit_mod</tt>.

+ Obtain a robust summary of the model coefficients.

+ The data frame <tt>passengers</tt> contains data on three hypothetical male passengers that differ only in their passenger class (the other variables are set to the respective sample average). Use <tt>Probit_mod</tt> to predict the probability of survival for these passengers.

<iframe src="DCL/ex11_8.html" frameborder="0" scrolling="no" style="width:100%;height:430px"></iframe>

**Hints:**

+ Remember that the formula <tt>a ~ .</tt> specifies a regression of <tt>a</tt> on all other variables in the data set provided as the argument <tt>data</tt> in <tt>glm()</tt>.

+ You need to specify the correct type of prediction in <tt>predict()</tt>.

</div>')}
```