-
Notifications
You must be signed in to change notification settings - Fork 36
/
robust.Rmd
222 lines (163 loc) · 7.52 KB
/
robust.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
# Heteroskedasticity and Robust Regression
## Prerequisites
`r rpkg("VGAM")` is needed for the Laplace distribution.
```{r message=FALSE}
library("VGAM")
```
## Linear Regression with Student t distributed errors
Like OLS, Bayesian linear regression with normally distributed errors is sensitive to outliers.
The normal distribution has narrow tail probabilities.
This plots the normal, Double Exponential (Laplace), and Student-t (df = 4) distributions all with mean 0 and scale 1, and the surprise ($- log(p)$) at each point.
Higher surprise is a lower log-likelihood.
Both the Student-t and Double Exponential distributions have surprise values well below the normal in the ranges (-6, 6). [^tailareas]
This means that outliers impose less of a penalty on the log-posterior models using these distributions, and the regression line would need to move less to incorporate those observations since the error distribution will not consider them as unusual.
```{r}
z <- seq(-6, 6, length.out = 100)
bind_rows(
tibble(z = z,
p = dnorm(z, 0, 1),
distr = "Normal"),
tibble(z = z,
p = dt(z, 4),
distr = "Student-t (df = 4)"),
tibble(z = z,
p = VGAM::dlaplace(z, 0, 1),
distr = "Double Exponential")) %>%
mutate(`-log(p)` = -log(p)) %>%
ggplot(aes(x = z, y = `-log(p)`, colour = distr)) +
geom_line()
```
```{r}
z <- seq(-6, 6, length.out = 100)
bind_rows(
tibble(z = z,
p = dnorm(z, 0, 1),
distr = "Normal"),
tibble(z = z,
p = dt(z, 4),
distr = "Student-t (df = 4)"),
tibble(z = z,
p = VGAM::dlaplace(z, 0, 1),
distr = "Double Exponential")) %>%
mutate(`-log(p)` = -log(p)) %>%
ggplot(aes(x = z, y = p, colour = distr)) +
geom_line()
```
[^tailareas]: The Double Exponential distribution still has a thinner tail than the Student-t at higher values.
```{r include=FALSE}
mod_t <- stan_model("stan/lm_student_t.stan")
```
```{r cache=FALSE}
mod_t
```
```{r}
unionization <- read_tsv("data/western1995/unionization.tsv",
col_types = cols(
country = col_character(),
union_density = col_double(),
left_government = col_double(),
labor_force_size = col_number(),
econ_conc = col_double()
))
mod_data <- preprocess_lm(union_density ~ left_government + log(labor_force_size) + econ_conc, data = unionization)
mod_data <- within(mod_data, {
b_loc <- 0
b_scale <- 1000
sigma_scale <- sd(y)
})
```
The `max_treedepth` parameter needed to be increased because in some runs it was hitting the maximum tree depth.
This is likely due to the wide tails of the Student t distribution.
```{r}
mod_t_fit <- sampling(mod_t, data = mod_data, control = list(max_treedepth = 11))
```
```{r output=FALSE}
summary(mod_t_fit, pars = c("b"))$summary
```
Compare those results when using a model with
```{r include=FALSE}
mod_normal <- stan_model("stan/lm.stan")
```
```{r cache=FALSE}
mod_normal
```
```{r}
mod_normal_fit <- sampling(mod_normal, data = mod_data)
```
```{r}
summary(mod_normal_fit, pars = c("b"))$summary
```
### Double Exponential (Laplace) Errors
An alternative form of "robust" regression is to use the Double Exponential (Laplace) distributions for the errors.
This is the equivalent to least median regression, where the regression line is the median (50% quantile)
```{r include=FALSE}
mod_dbl_exp <- stan_model("stan/lm_dbl_exp.stan")
```
```{r cache=FALSE}
mod_dbl_exp
```
```{r include=FALSE}
mod_dbl_exp_fit <- sampling(mod_dbl_exp, data = mod_data)
```
```{r}
summary(mod_dbl_exp_fit, par = c("b"))$summary
```
Model comparison
```{r}
loo_t <- loo(extract_log_lik(mod_normal_fit, "log_lik"))
loo_normal <- loo(extract_log_lik(mod_t_fit, "log_lik"))
loo_dbl_exp <- loo(extract_log_lik(mod_dbl_exp_fit, "log_lik"))
```
## Heteroskedasticity
In applied regression, heteroskedasticity consistent (HC) or robust standard errors are often used.
However, there is straightforwardly direct translation of HC standard error to regression model this in a Bayesian setting. The sandwich method of estimating HC errors uses the same point estimates for the regression coefficients as OLS, but estimates the standard errors of those coefficients in a second stage from the OLS residuals.
Disregarding differences in frequentist vs. Bayesian inference, it is clear that a direct translation of that method could not be fully Bayesian since the coefficients and errors are not estimated jointly.
In a linear normal regression model with heteroskedasticity, each observation has its own scale parameter, $\sigma_i$,
$$
\begin{aligned}[t]
y_i &\sim \dnorm(X \beta, \sigma_i) .
\end{aligned}
$$
It should be clear that without proper priors this model is not identified, meaning that the posterior distribution is improper.
To estimate this model we have to apply some model to the scale terms, $\sigma_i$.
In fact, you can think of homoskedasticity as the simplest such model; assuming that all $\sigma_i = \sigma$.
A more general model of $\sigma_i$ should encode any information the analyst has about the scale terms.
This can be a distribution or functions of covariates for how we think observations may have different values.
### Covariates
A simple model of heteroskedasticity is if the observations can be split into groups. Suppose the observations are partitioned into $k = 1, \dots, K$ groups, and $k[i]$ is the group of observation $i$,
$$
\sigma_i = \sigma_{k[i]}
$$
Another choice would be to model the scale term with a regression model, for example,
$$
\log(\sigma_i) \sim \dnorm(X \gamma, \tau)
$$
### Student-t Error
The Student-t distribution of error terms from the [Robust Regression] chapter is also model of heteroskedasticity.
A reparameterization that will be used quite often is to rewrite a normal distributions with unequal scale parameters as the product of a common global scale parameter ($\sigma$), and observation specific local scale parameters, $\lambda_i$,[^globalmixture]
$$
y_i \sim \dnorm(X\beta, \lambda_i \sigma) .
$$
If the local variance parameters are distributed inverse-gamma,
$$
\lambda^2 \sim \dinvgamma(\nu / 2, \nu / 2)
$$
then the above is equivalent to a regression with errors distributed Student-t errors with $\nu$ degrees of freedom,
$$
y_i \sim \dt{\nu}(X \beta, \sigma) .
$$
[^globalmixture] See [this](http://www.sumsar.net/blog/2013/12/t-as-a-mixture-of-normals/) for a visualization of a Student-t distribution a mixture of Normal distributions, and [this](https://www.johndcook.com/t_normal_mixture.pdf) for a derivation of the Student t distribution as a mixture of normal distributions. This scale mixture of normal representation will also be used with shrinkage priors on the regression coefficients.
**Example:** Simulate Student-t distribution with $\nu$ degrees of freedom as a scale mixture of normal. For *s in 1:S$,
1. Simulate $z_s \sim \dgamma(\nu / 2, \nu / 2)$
2. $x_s = 1 / \sqrt{z_s}2$ is draw from $\dt{\nu}(0, 1)$.
When using R, ensure that you are using the correct parameterization of the gamma distribution. **Left to reader**
## References
### Robust regression
- See @GelmanHill2007a [sec 6.6], @BDA3 [ch 17]
- @Stan2016a [Sec 8.4] for the Stan example using a Student-t distribution
### Heteroskedasticity
- @BDA3 [Sec. 14.7] for models with unequal variances and correlations.
- @Stan2016a reparameterizes the Student t distribution as a mixture of gamma distributions in Stan.
### Qunatile regression
- @BenoitPoel2017a
- @YuZhang2005a for the three-parameter asymmetric Laplace distribution