Skip to content

Commit

Permalink
fixing formatting
Browse files Browse the repository at this point in the history
  • Loading branch information
jrnold committed Apr 15, 2018
1 parent 33a341a commit f0a4d28
Show file tree
Hide file tree
Showing 38 changed files with 3,356 additions and 2,174 deletions.
15 changes: 15 additions & 0 deletions .remarkrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"plugins": [
"remark-preset-lint-recommended",
"remark-preset-lint-consistent",
"remark-preset-lint-markdown-style-guide",
"remark-frontmatter",
["remark-lint-file-extension", false],
["remark-lint-maximum-line-length", 300],
["remark-lint-no-shortcut-reference-link", false],
["remark-lint-list-item-indent", "tab-size"],
["remark-lint-no-undefined-references", false],
["remark-lint-emphasis-marker", false],
["remark-lint-fenced-code-flag", false]
]
}
413 changes: 391 additions & 22 deletions bayesian-computation.Rmd

Large diffs are not rendered by default.

48 changes: 13 additions & 35 deletions counts.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ $$

The two distributions most commonly used to model this are

- Poisson
- Negative Binomial
- Poisson
- Negative Binomial

## Poisson

Expand All @@ -17,8 +17,6 @@ $$
y_i \sim \dpois(\lambda_i) ,
$$



## Negative Binomial

The [Negative Binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution) model is also used for unbounded count data,
Expand All @@ -41,8 +39,7 @@ $$
\end{aligned}
$$


**Important** The negative binomial distribution has many different parameterizations.
**ImportantL** The negative binomial distribution has many different parameterizations.
An alternative parameterization of the negative binomial uses the mean and a over-dispersion parameter.
$$
y_i \sim \dnbinalt(\mu_i, \phi)
Expand All @@ -59,32 +56,18 @@ $$

In Stan, there are multiple parameterizations of the

- `neg_binomial_lpdf(y | alpha, beta)`with shape parameter `alpha` and inverse scale parameter `beta`.
- `neg_binomial_2_lpdf(y | mu, phi)` with mean `mu` and over-dispersion parameter `phi`.
- `neg_binomial_2_log_lpdf(y | eta, phi)` with log-mean `eta` and over-dispersion parameter `phi`
- `neg_binomial_lpdf(y | alpha, beta)`with shape parameter `alpha` and inverse scale parameter `beta`.
- `neg_binomial_2_lpdf(y | mu, phi)` with mean `mu` and over-dispersion parameter `phi`.
- `neg_binomial_2_log_lpdf(y | eta, phi)` with log-mean `eta` and over-dispersion parameter `phi`

Also, `rstanarm` supports Poisson and [negative binomial models](https://cran.r-project.org/web/packages/rstanarm/vignettes/count.html).

### Example: Number of Number o

```{r}
```

TODO

### References

For general references on count models see

- @BDA3 [Ch 16]
- @GelmanHill2007a [p. 109-116]
- @McElreath2016a [Ch 10]
- @Fox2016a [Ch. 14]
- @BDA3 [Ch. 16]


where $y_i \in 0, 1, 2, \dots$.

The parameter $\lambda_i \in (0, \infty)$ is both the mean and variance of the distribution.

### Link functions
Expand All @@ -104,14 +87,14 @@ $$

In Stan, the Poisson distribution has two implementations:

- `r stanfunc("poisson_lpdf")`
- `r stanfunc("poisson_log_lpdf")`
- `r stanfunc("poisson_lpdf")`
- `r stanfunc("poisson_log_lpdf")`

The Poisson with a log link. This is for numeric stability.

In **rstanarm** use `r rdoc("rstanarm", "stan_glm")` for a Poisson GLM.

- **rstanarm** vignette on [count models](https://cran.r-project.org/web/packages/rstanarm/vignettes/count.html).
- **rstanarm** vignette on [count models](https://cran.r-project.org/web/packages/rstanarm/vignettes/count.html).

## Example: Bilateral Sanctions

Expand Down Expand Up @@ -139,8 +122,6 @@ sanction_data$K <- ncol(sanction_data$X)
fit_sanction_pois <- sampling(mod_poisson1, data = sanction_data)
```



## Negative Binomial

The [Negative Binomial](https://en.wikipedia.org/wiki/Negative_binomial_distribution) model is also used for unbounded count data,
Expand All @@ -163,7 +144,6 @@ $$
\end{aligned}
$$


**Important** The negative binomial distribution has many different parameterizations.
An alternative parameterization of the negative binomial uses the mean and a over-dispersion parameter.
$$
Expand All @@ -179,22 +159,20 @@ $$

In Stan, there are multiple parameterizations of the

- `neg_binomial_lpdf(y | alpha, beta)`with shape parameter `alpha` and inverse scale parameter `beta`.
- `neg_binomial_2_lpdf(y | mu, phi)` with mean `mu` and over-dispersion parameter `phi`.
- `neg_binomial_2_log_lpdf(y | eta, phi)` with log-mean `eta` and over-dispersion parameter `phi`
- `neg_binomial_lpdf(y | alpha, beta)`with shape parameter `alpha` and inverse scale parameter `beta`.
- `neg_binomial_2_lpdf(y | mu, phi)` with mean `mu` and over-dispersion parameter `phi`.
- `neg_binomial_2_log_lpdf(y | eta, phi)` with log-mean `eta` and over-dispersion parameter `phi`

Also, `rstanarm` supports Poisson and [negative binomial models](https://cran.r-project.org/web/packages/rstanarm/vignettes/count.html).

### Example: Economic Sanctions II {ex-econ-sanctions-2}

Continuing the [economic sanctions example](#ex-econ-sanctions-2) of @Martin1992a.


```{r results='hide'}
mod_negbin1 <- stan_model("stan/negbin1.stan")
```


```{r result='hide'}
fit_sanction_nb <- sampling(mod_negbin1, data = sanction_data,
control = list(adapt_delta = 0.95))
Expand Down
105 changes: 64 additions & 41 deletions diagnostics.Rmd
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
---
output: html_document
editor_options:
editor_options:
chunk_output_type: console
---

# MCMC Diagnostics

There are two parts of checking a Bayesian model:

1. diagnostics: Is the sampler working? Is it adequately approximating the specified posterior distribution: $p(\theta | D)$.
2. model fit: Does the model adequately represent the data?
1. diagnostics: Is the sampler working? Is it adequately approximating the specified posterior distribution: $p(\theta | D)$.
1. model fit: Does the model adequately represent the data?

This chapter covers the former.

Expand All @@ -21,12 +22,11 @@ library("rstan")
library("tidyverse")
```


## Reparameterize Models

1. Reduce correlation between parameters (e.g. see `mcmc_pairs`)
2. Put parameters on the same scale. The samplers work best when all parameters are roughly on the same scale, e.g. $\approx 1$. Try to avoid situations where parameters are orders of magnitude different, e.g. 1e-5 and 1e+10.
3. Increase the informativeness of priors. If parameters are too uninformative, the posterior distribution may have wide tails that hamper sampling. One way of thinking about it is that the model is only "weakly identified" and requires either more data or more informative priors to estimate.
1. Reduce correlation between parameters (e.g. see `mcmc_pairs`)
1. Put parameters on the same scale. The samplers work best when all parameters are roughly on the same scale, e.g. $\approx 1$. Try to avoid situations where parameters are orders of magnitude different, e.g. 1e-5 and 1e+10.
1. Increase the informativeness of priors. If parameters are too uninformative, the posterior distribution may have wide tails that hamper sampling. One way of thinking about it is that the model is only "weakly identified" and requires either more data or more informative priors to estimate.

## Convergence Diagnostics

Expand All @@ -50,25 +50,22 @@ See @Stan2016a [Sec 28.2] for the equations to calculate these.

**TODO:** Examples of passing and non-passing $\hat{R}$ chains using fake data generated from known functions with a given autocorrelation.


**Rule of Thumb:** The rule of thumb is that R-hat values for all less than 1.1 [source](https://cran.r-project.org/web/packages/rstanarm/vignettes/rstanarm.html).
Note that **all** parameters must show convergence.
This is a necessary but not sufficient condition for convergence.


### References

- @BDA3 [p. 267]
- @Stan2016a [Ch 28.] for how Stan calculates Hat, autocorrelations, and ESS.
- @GelmanRubin1992a introduce the R-hat statistic
- @BDA3 [p. 267]
- @Stan2016a [Ch 28.] for how Stan calculates Hat, autocorrelations, and ESS.
- @GelmanRubin1992a introduce the R-hat statistic

## Autocorrelation, Effective Sample Size, and MCSE

MCMC samples are dependent. This does not effect the validity of inference on the posterior if the samplers has time to explore the posterior distribution, but it does affect the efficiency of the sampler.

In other words, highly correlated MCMC samplers requires more samples to produce the same level of Monte Carlo error for an estimate.


### Effective Sample Size

The effective sample size (ESS) measures the amount by which autocorrelation in samples increases uncertainty (standard errors) relative to an independent sample.
Expand All @@ -92,8 +89,11 @@ $$
\hat{n}_{eff} = \frac{mn}{1 + 2 \sum_{t = 1}^T \hat{\rho}_t}
$$

- See also @Stan2016a [Sec 28.4], @Geyer2011a, and @BDA3 [Sec 11.5].
- This isn't the only way to calculate the effective sample size. The `r rpkg("coda")` package function `r rdoc("coda", "effectiveSize")` uses a different method. The differences are due to how the autocorrelation is calculated.
- See also @Stan2016a [Sec 28.4], @Geyer2011a, and @BDA3 [Sec 11.5].

- This isn't the only way to calculate the effective sample size. The
`r rpkg("coda")` package function `r rdoc("coda", "effectiveSize")`
uses a different method. The differences are due to how the autocorrelation is calculated.

**Example:** Comparison of the effective sample sizes for data generated with various levels of autocorrelation.
The package `rstan` does not directly expose the function it uses to calculate ESS, so this `ess` function does so (for a single chain).
Expand All @@ -112,7 +112,7 @@ sims <- map_df(c(0, 0.5, 0.75, 0.99),
function(ar) {
tibble(ar = ar,
y = if (ar == 0) {
rnorm(n)
rnorm(n)
} else {
as.numeric(arima.sim(list(ar = ar), n))
},
Expand All @@ -137,15 +137,15 @@ Due to the autocorrelation, the reduction in the number of effective samples wil

Both of these will produce 1,000 samples from the posterior, but effective sample size of $B$ will be greater than the effective sample size of $A$, since after thinning g the autocorrelation in $B$ will be lower.

- *A* Generating 1,000 samples after convergence and save all of them
- *B* Generating 10,000 samples after convergence and save every 10th sample
- *A* Generating 1,000 samples after convergence and save all of them
- *B* Generating 10,000 samples after convergence and save every 10th sample

In this case, A produces 10,000 samples, and B produces 1,000.
The effective sample size of A will be higher than B.
However, due to autocorrelation, the proportional reduction in the effective sample size in B will be less than the thinning: $N_{eff}(A) / N_{eff}(B) < 10$.

- *A* Generating 10,000 samples after convergence and save all of them
- *B* Generating 10,000 samples after convergence and save every 10th sample
- *A* Generating 10,000 samples after convergence and save all of them
- *B* Generating 10,000 samples after convergence and save every 10th sample

Thinning trades off sample size for memory, and due to autocorrelation in samples, loss in effective sample size is less than the loss in sample size.

Expand All @@ -172,22 +172,22 @@ map_df(c(1, 2, 4, 8, 16, 32), thin_ess,
x = arima.sim(list(ar = .9), 4096))
```

- @BDA3 [p. 282-283]
- @Stan2016a [p. 354-355]
- @BDA3 [p. 282-283]
- @Stan2016a [p. 354-355]

### Traceplots

Trace plots are a time series of sampler iterations, e.g. as produced by `r rdoc("bayesplot", "mcmc_trace")`.

These can, but **should not**, be used to assess convergence,

- such visual inspection is 'notoriously unreliable' [@BDA3, p. 285]
- it cannot scale to many parameters
- such visual inspection is 'notoriously unreliable' [@BDA3, p. 285]
- it cannot scale to many parameters

Trace plots may be useful for diagnosing convergence problems *after* $\hat{R}$ or or $n_eff$ indicates problems. Some possible issues to check in these plots are

- multimodality (the traceplot jumps between different distributions)
- wide posterior tails (the traceplot shows regions where the sampler will reach and have difficulty returning to the main distribution)
- multimodality (the traceplot jumps between different distributions)
- wide posterior tails (the traceplot shows regions where the sampler will reach and have difficulty returning to the main distribution)

### Monte Carlo Standard Error (MCSE)

Expand Down Expand Up @@ -219,10 +219,9 @@ See @FlegalHaranJones2008a and the `r rpkg("mcmcse")` for methods to calculate M

The estimation of standard errors for quantiles, as would be used in is more complicated. See the package `r rpkg("mcmcse")` for Monte Carlo standard errors of quantiles (though calculated in a different method than **rstan**).

- @BDA3 [Sec. 10.5]
- @FlegalHaranJones2008a
- [Talk by Geyer on MCSE ](http://www.stat.umn.edu/geyer/mcmc/talk/mcmc.pdf)

- @BDA3 [Sec. 10.5]
- @FlegalHaranJones2008a
- [Talk by Geyer on MCSE](http://www.stat.umn.edu/geyer/mcmc/talk/mcmc.pdf)

## HMC-NUT Specific Diagnostics

Expand All @@ -231,42 +230,66 @@ This is unusual, as most Bayesian sampling methods do not give indication of whe

Three specific HMC-NUTS diagnostics are

1. divergent transitions
2. maximum tree-depth
3. Bayesian fraction of missing information
1. divergent transitions
1. maximum tree-depth
1. Bayesian fraction of missing information

The general way to fix these issues is the manually adjust the HMC-NUTS sampler parameters.n

1. Stepsize: Length of the steps to take
2. Tree-fdepth: Number of steps to take
1. Stepsize: Length of the steps to take
1. Tree-fdepth: Number of steps to take

During the warmup period, Stan tunes these values, however these auto-tuned parameters may not always be optimal.
The other alternative is to reparameterize the models.


### Divergent transitions

**The problem:** The details of the HMC are technical and can be found **TODO**. The gist of the problem is that Stan is using a discrete approximation of a continuous function when integrating.
The details of the HMC are technical and can be found **TODO**. The gist of the problem is that Stan is using a discrete approximation of a continuous function when integrating.
If the step sizes are too large, the discrete approximation does not work.
Helpfully, when the approximation is poor it does not fail without any indication but will produce "divergent transitions".

*If there are too many divergent transitions, then the sampler is not drawing samples from the entire posterior and inferences will be biased*
If there are too many divergent transitions, then the sampler is not drawing samples from the entire posterior and inferences will be biased.

**The solution:** Reduce the step size. This can be done by increasing the the `adapt_delta` parameter.
Reduce the step size. This can be done by increasing the the `adapt_delta` parameter.
This is the target average proposal acceptance probability in the adaptation, which is used to determine the step size during warmup.
A higher desired acceptance probability (closer to 1) reduces the the step size. A smaller step size means that it will require more steps to explore the posterior distribution.

See @Stan2016a [p. 380]

### Maximum Tree-depth

**The problem:** NUTS is an intelligent method to select the number of steps to take in each iteration. However, there is still a maximum number of steps that NUTS will try.
NUTS is an intelligent method to select the number of steps to take in each iteration. However, there is still a maximum number of steps that NUTS will try.
If the sampler is often hitting the maximum number of steps, it means that the optimal number of steps to take in each iteration is higher than the maximum.
While divergent transitions bias inference, a too-small maximum tree-depth only affects efficiency.
The sampler is still exploring the posterior distribution, but the exploration will be slower and the autocorrelation higher (effective sample size lower) than if the maximum tree-depth were set higher.

**The solution:** Increase the maximum tree-depth.
Increase the maximum tree-depth.

### Bayesian Fraction of Missing Information

This is rather technical. See @Betancourt2016a.

## Debugging Bayesian Computating

See @BDA3 [p. 270], ...

1. Pick a reasonable value for the "true" parameter vector $\theta^* \sim p(\theta)$. This should be a random draw from the prior, but if $\theta$ is uniformative, then any reaonsable value of $\theta$ should work.
1. Draw all other parameters that are conditional on $\theta^*$.
1. Simulate a large fake dataset $y^{fake}$ from the data distribution $p(y | \theta^*)$
1. Calculate the posterior $p(\theta | y^{rep})$.
1. Compare the posterior distribution $p(\theta | y)$ to its true value $\theta^*$. For example there should be a 50% probability that the 50% posterior interval of $p(\theta | y)$ contains $\theta^*$.

Notes

- for large number of parameters: calculate the proportion of 50% credible intervals that contain the true value.

- residual plot: for all scalar parameters $\theta_j$, calculate the residual $\E[\theta_j] - \theta^*$. The residuals should have mean 0.

- for models with few parameters, perform many fake data simulations.

- convergence and posterior predictive tests may also indicate issues with the model.

- it could be either a problem with the model, or a problem in computation.
- simplify the model by removing parameters or setting them to fixed values

- Start with a simple model and build complexity. This is useful both because the simple model may be "good enough" for your purposes, and because adding one part at a time makes it easier to catch and fix bugs.
6 changes: 3 additions & 3 deletions distributions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,8 @@ The [Probability Distributions](https://cran.r-project.org/web/views/Distributio

There are a few variations of this chart online:

- [Univariate Distribution Relationships](http://www.math.wm.edu/~leemis/chart/UDR/UDR.html)
- [Diagram of distribution relationships](https://www.johndcook.com/blog/distribution_chart/)
- [Univariate Distribution Relationships](http://www.math.wm.edu/~leemis/chart/UDR/UDR.html)
- [Diagram of distribution relationships](https://www.johndcook.com/blog/distribution_chart/)

<!--
### Beta Distribution
Expand Down Expand Up @@ -66,7 +66,7 @@ Shape $k > 0$ and scale $\theta > 0$,
$$
\dgamma(x | k, \theta) = \frac{1}{\Gamma(k) \theta^k}x^{k - 1}\exp(- x / \theta) .
$$
with
with
$$
\begin{aligned}[t]
\mu = \E[X] &= k \theta \\
Expand Down
Loading

0 comments on commit f0a4d28

Please sign in to comment.