-
Notifications
You must be signed in to change notification settings - Fork 0
/
final_project.Rmd
188 lines (141 loc) · 12.7 KB
/
final_project.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
---
title: "R Notebook"
output: html_notebook
---
**Part 1: Econometrics**
Maximum possible points: 14
\
The libraries you will need for the project are the following:
```{r include = FALSE}
library(tidyverse) # ggplot(), %>%, mutate(), and friends
library(scales) # Format numbers with functions like comma(), percent(), and dollar()
library(broom) # Convert models to data frames
library(stargazer) # Makes nice tables
library(lmtest) # To test linear regression models
library(AER) # Applied econometrics in R
library(rdrobust) # For robust nonparametric regression discontinuity
library(rddensity) # For nonparametric regression discontinuity density tests
library(dplyr)
library(data.table)
# Make all figures use specific dimensions by default
knitr::opts_chunk$set(fig.align = "center", retina = 2,
fig.width = 7, fig.height = 4.2)
```
\
You will use each of the four econometric approaches seen in class for estimating causal effects to measure the effect of HISP on household health expenditures. **Don't worry about conducting in-depth baseline checks and robustness checks.** The purpose is not to have an in-depth understanding of the variables of the database, but to be able to use several methods and to explain the results. Also remember that, when reporting the solutions, a numeric answer is not enough! We highly recommend to provide an **intuitive argument** to your answers.
\
**OLS Regression**
\
**Preliminary step A:** Create a dataset based on `HISP` that only includes observations from after the intervention (`round == "After"`). Note that you will use this database for all the **OLS and IV** exercises.
```{r include = FALSE}
HISP_after <- read.csv("HISP.csv") %>% filter(round == "After")
```
\
**1.** Build a regression model that estimates the effect of HISP enrollment on health expenditures. You'll need to use the `enrolled_rp` variable instead of `enrolled`, since we're measuring enrollment after the promotion campaign. Report the regression results in a table. What does this model tell us about the effect of enrolling in HISP?
```{r}
model1 <- lm(health_expenditures ~ enrolled_rp, data = HISP_after) #linear model m1
stargazer(model1, type = "text") #to check what my model1 gives
```
There is a significant negative effect of enrollment on health expenditures. It means that when households are in HISP following random promotion, they spend less money on health expenditures.\
**2.** Build the same regression model but now include the following control variables: `age_hh`, `age_sp`, `educ_hh`, `educ_sp`, `female_hh`, `indigenous`, `hhsize`,`dirtfloor`, `bathroom`, `land`, `hospital_distance`, `park`, and `sports`. Report the results in a table.
```{r}
model2 <- lm(health_expenditures ~ enrolled_rp + age_hh + age_sp + educ_hh + educ_sp + female_hh + indigenous + hhsize + dirtfloor + bathroom + land + hospital_distance + park + sports, data = HISP_after)
stargazer(model2, type = "text")
```
1. Compared to the model in question **1**, is the coefficient of enrollment underestimated or overestimated in Q1?\
When controlling for other variables, enrolled_rp is lower in the second model (`-9.815` vs `-12.708`) so it is overestimated in Q1.
2. Interpret the coefficient of `age_hh`
It means that if the "head of the household" is one year older, the out of pocket health expenditures per person per year increases by `0.074` dollars.
3. What is the estimated effect in health expenditures for a household with a private bathroom, holding all other variables constant?
Holding all over variables constant, the household' health_expenditures increases by `0.687` dollars.
\
**Instrumental Variables Regression**
*Remark*: For the IV exercises, we will use the database created in **Preliminary Step A**, that only includes observations from after the intervention (`round == "After"`).
\
Consider the following model:
\
health_expenditures $=\beta_{0}+\beta_{1}$ enrolled_rp $+\beta_{2}$ age_hh $+\beta_{3}$ age_sp $+\beta_{4}$ educ_hh $+\beta_{5}$ educ_sp $+\beta_{6}$ female_hh $+$ $\beta_{7}$ indigenous $+\beta_{8}$ hhsize $+\beta_{9}$ dirtfloor $+\beta_{10}$ bathroom $+\beta_{11}$ land $+\beta_{12}$ hospital_distance $+\beta_{13}$ park $+\beta_{14}$ sports
\
**3.** Is there a possible endogeneity in one of the variables of the model? If so, why?
Yes, the variable sport could be endogeneous with health_expenditures. Depending on
\
**4.** If there is endogeneity in one of the variables, find a possible instrument in the data and discuss if it is suitable to correct it.
\
**5.** Run a 2SLS regression model using the previous instrument. Report the regression results in a table. After removing the endogeneity issue, what is the causal effect of enrollment in the HISP?
\
## **Difference-in-Differences Regression**
We can estimate the causal effect using a difference-in-difference approach. We have data indicating if households were enrolled in the program (`enrolled`) and data indicating if they were surveyed before or after the intervention (`round`), which means we can find the differences between enrolled/not enrolled before and after the program.
*Remark* Since we do not have enough data before the program started, we assume that these two groups share parallel trends before the treatment. Hence, it is fine to perform a Diff-in-Diff estimation.
\
**Preliminary step B:** Make a new dataset based on `HISP` that only includes observations from the localities that were randomly chosen for treatment (`treatment_locality == "Treatment"`). *Remark*: Use this dataset for the **Diff-in-Diff and RDD** exercises.
```{r}
HISP_treatment <- read.csv("HISP.csv") %>% filter(treatment_locality == "Treatment")
distinct(HISP_treatment,treatment_locality) #just to check
distinct(HISP_treatment,round)
```
\
**6.** Obtain the average of `health_expenditures`, `age_hh`, `age_sp`, `educ_hh`, `educ_sp`, `female_hh`, `indigenous`, `hhsize`,`dirtfloor`, `bathroom`, `land`,`hospital_distance`, `park`,`sports` for every time period in both the treatment and control groups (i.e. enrolled and not enrolled), and report them in a table. Analyze the differences in the variables for both treatment and control groups.
```{r}
HISP_treatment <- HISP_treatment %>%
mutate(enrolled_dummy = ifelse(enrolled == "Enrolled", 1, 0), round_dummy = ifelse(round == "After", 1, 0))
#if enrolled, enrolled_dummy = 1; if round=after, round_dummy=1
```
```{r}
HISP_treatment %>%
group_by(round,enrolled) %>%
summarise(mean(health_expenditures), mean(age_hh), mean(age_sp), mean(educ_hh), mean(educ_sp), mean(female_hh), mean(indigenous), mean(hhsize), mean(dirtfloor), mean(bathroom), mean(land), mean(land), mean(hospital_distance), mean(park), mean(sports))
```
There appears to be little changes in the variables except for the `health_expenditures` variable and `age`. Concerning `age`, people not enrolled in the HISP program are older, which could be explained by the fact that, the older households are the more money they usually have, until retirement. (https://www.federalreserve.gov/publications/files/scf20.pdf, page 7).
However, we can notice that being enrolled in the HISP program has a significant impact on health expenditures, and even more after intervention. Indeed, people not enrolled in the HISP program after intervention have about three times the health expenditures of not enrolled households (`22.304911` vs `7.840179`)
\
**7.** Run a regression model that estimates the difference-in-difference effect of being enrolled in the HISP program. Report the regression results in a table. What is the causal effect of HISP on health expenditures?
```{r}
modeldif <- lm(health_expenditures ~ round_dummy + enrolled_dummy + round_dummy*enrolled_dummy, data = HISP_treatment)
stargazer(modeldif, type = "text")
```
The difference-in-difference estimate is `-8.1629` (enrolled_dummy:round_dummy’ row). It means that the enrolling in the HISP has a negative impact on health expenditure: the health expenditures decreased by `8,163%`
\
```{r}
HISP_treatment$enrolled <- as.factor(HISP_treatment$enrolled)
HISP_treatment$round <- as.factor(HISP_treatment$round)
ggplot(HISP_treatment, mapping = aes(x = round, y = health_expenditures, color = enrolled)) +
geom_point(size = 2, alpha = 0.4)
```
```{r}
cdplot(enrolled ~ health_expenditures , data=HISP_treatment)
```
```{r}
cdplot(round ~ health_expenditures , data=HISP_treatment)
```
**8.** Run a second model that estimates the difference-in-difference effect, but control for the following variables: `age_hh`, `age_sp`, `educ_hh`, `educ_sp`, `female_hh`, `indigenous`, `hhsize`,`dirtfloor`, `bathroom`, `land`,`hospital_distance`, `park`, `sports`. Report the regression results in a table. How does the causal effect change?
```{r}
modeldif2 <- lm(health_expenditures ~ enrolled_dummy + round_dummy + enrolled_dummy * round_dummy + age_hh + age_sp + educ_hh + educ_sp + female_hh + indigenous + hhsize + dirtfloor + bathroom + land + hospital_distance + park + sports, data = HISP_treatment)
summary(modeldif2)
```
By controlling for other variables, we see no notable change in `health_expenditures` - it is still about -`8.16x` - nor in the other variables.
\
**Summary**
\
**9.** Summarize the results from the three methods (OLS, IV and Diff-in-Diff) in the following table, and discuss the advantages and disadvantages of one method of your choice.
```{r}
Method = c('OLS', 'OLS', 'IV', 'Diff-in-Diff', 'Diff-in-Diff')
Including_control_variables = c("No","Yes","Yes","No","Yes")
Estimate = c(model1$coefficients[2],model2$coefficients[2],"0",modeldif$coefficients[4],modeldif2$coefficients[17])
cbind.data.frame(Method,Including_control_variables,Estimate)
```
The diff-in-diff method is useful because it enables to estimate the impact of the enrollment in the HISP program as if it was planned as an experiment (control and treatment groups), doing ex-post analysis on available data.
\
**Regression Discontinuity Design**
Eligibility for the HISP is determined by income. Households that have an income of less than 58 on a standardized 1-100 scale (`poverty_index`) qualify for the program and are automatically enrolled. Because we have an arbitrary cutoff in a running variable, we can use regression discontinuity to measure the effect of the program on health expenditures.
\
**10.** Why choosing the bandwidth is important in regression discontinuity? What are the pros and cons of choosing a small/large bandwidth
RD is about comparing two groups that are very similar except for the treatment because the treatment depends discontinuously on the cutoff (`poverty_index` here) (Green et al., 2009). Therefore the bandwidth must have some proximity to the cutoff (local strategy), so that the differences observed at each group are attributable to the treatment and not to a difference in the characteristics of the treated groups. For instance, we can assume there is little to no difference of characteristics between households who have a `poverty_index` of 58 and households who have a `poverty_index` of 57 even if the latter are not eligible for the HISP program. The `health_expenditures` comparison on either side of 57 has high internal validity.
A smaller bandwidth facilitate local linear regression but it may generate estimates that are too uncertain to be useful.
There is a tradeoff to find between choosing a smaller bandwidth which allows to reduce bias (less data) but eventually omits valuable cases, and a larger bandwidth which allows the increase precision (there is more data) but will eventually take outliers: the results could be explained not only by the RD but by the characteristics of the households.
Literature gives insights about the optimal bandwidth choice (Imbens et al., 2009)
\
**11.** Suppose that you are including all households with poverty_index in between 53 and 63 (bandwidth=5). Before running an equation, what would you check to make sure that the regression discontinuity method is valid in this case, and why?
We could start by choosing a wider bandwidth=10 for instance, decreasing the bandwidth by iteration, and check for the distribution of predetermined characteristics of households by plotting them on a graph: they should be identical on either side of the cutoff, the smaller the bandwidth we chose. We should have a visual evidence of a "jump" at the cuttoff score (57) and test its statistical significance. We can also try to place placebo cutoffs and see if there is still a jump.
The plotting can be done by using bins.
We should check if the bandwidth=5 is the bandwidth that minimizes the mean square error between actual and estimated treatment effects because the bias-variance tradeoff is captured in it the MSE of the estimator. (see https://www.mattblackwell.org/files/teaching/s11-rdd-handout.pdf, slide 52)
\