Skip to content

Commit

Permalink
some fixes + merge branch 'master' of github.com:avehtari/BDA_R_demos
Browse files Browse the repository at this point in the history
  • Loading branch information
avehtari committed Nov 12, 2017
2 parents dd5aacb + a90c571 commit fee62cd
Show file tree
Hide file tree
Showing 5 changed files with 321 additions and 309 deletions.
48 changes: 25 additions & 23 deletions demos_rstan/geomagnetic.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ title: "Extreme value analysis and user defined probability functions in Stan"
author: "Aki Vehtari"
date: "`r format(Sys.Date())`"
output:
pdf_document: default
html_document:
keep_md: yes
pdf_document: default
html_notebook: default
---

This notebook demonstrates how to implement user defined probability functions in Stan language. As a an example I use generalized Pareto distribution (GPD) and geomagnetic storm data by World Data Center for Geomagnetism.
This notebook demonstrates how to implement user defined probability functions in Stan language. As an example I use the generalized Pareto distribution (GPD) to model geomagnetic storm data from the World Data Center for Geomagnetism.

Load some libraries:
```{r, comment=NA, results='hide', message=FALSE, warning=FALSE}
Expand All @@ -30,18 +30,19 @@ Read the data. This file has magnitudes of 373 geomagnetic storms which lasted l
```{r}
# file preview shows a header row
d <- read.csv("geomagnetic_tail_data.csv", header = FALSE)
# dst are the absolute magnitudes
colnames(d) <- "dst"
d <- d %>% mutate(dst = abs(dst)) %>% arrange(dst)
```

Compute empirical complementary cumulative distribution function.
Compute the empirical complementary cumulative distribution function
```{r}
n <- dim(d)[1]
d$ccdf <- seq(n,1,-1)/n
head(d)
```

Plot just the data and empirical ccdf
Plot just empirical ccdf against the measured magnitudes
```{r}
ggplot() +
geom_point(aes(dst, ccdf), data = d, size = 1, colour = "blue") +
Expand All @@ -55,30 +56,29 @@ ggplot() +
theme_bw()
```

The largest event in the data is the March 1989 geomagnetic storm also known as Quebec blackout event (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). Now we are interested in estimating the probability of observing a same magnitude event in the future which would be helpful, for example, for insurance companies (for a short term geomagnetic storm predictions there are very elaborate models using observations closer to sun, too). Extreme value theory says that many distributions have a tail which is well modelled with generalized Pareto distribution given the cutoff point is far enough in the tail. For a more detailed model we could also take into account that geomagnetic storms are temporally correlated and tend to appear in bursts.
The largest event in the data is the March 1989 geomagnetic storm, also known as Quebec blackout event (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). Now we are interested in estimating the probability of observing a same magnitude event in the future which would be helpful, for example, for insurance companies (for a short term geomagnetic storm predictions there are very elaborate models using observations closer to sun, too). Extreme value theory says that many distributions have a tail which is well modelled with generalized Pareto distribution given the cutoff point is far enough in the tail. For a more detailed model we could also take into account that geomagnetic storms are temporally correlated and tend to appear in bursts.

Generalized Pareto distribution is defined as
The Generalized Pareto distribution is defined as
$$
p(y|u,\sigma,k)=
\begin{cases}
\frac{1}{\sigma}\left(1+k\left(\frac{y-u}{\sigma}\right)\right)^{-\frac{1}{k}-1}, & k\neq 0 \\
\frac{1}{\sigma}\exp\left(\frac{y-u}{\sigma}\right), & k = 0,
\end{cases}
$$
where $u$ is a lower bound parameter, $y$ is restricted to the range
$(u,\infty)$, $\sigma$ is a scale parameter, and $k$ is a shape parameter (see, e.g., https://en.wikipedia.org/wiki/Generalized_Pareto_distribution for cdf and random number generation).
where $u$ is a lower bound parameter, $\sigma$ is a scale parameter, $k$ is a shape parameter, and $y$ is the data restricted to the range $(u, \inf)$ (see, e.g., https://en.wikipedia.org/wiki/Generalized_Pareto_distribution for cdf and random number generation).

For probability distributions implemented in Stan math library there are functions
For probability distributions implemented in the Stan math library there are functions

* _lpdf (or _lpmf): log probability density (mass) function
* _cdf: cumulative distribution function
* _lcdf: log cumulative distribution function
* _lccdf: log complementary cumulative distribution function
* _rng: generate random variables from the distribution

The user defined functions can have only one function signature unlike the functions implemented in C++ in Stan math library. In this example, I have chosen the function signatures so that the behavior is as close as possible to most usual ways to call these functions. The main feature is that all these function return a scalar real value (except _rng would return scalar integer for a discrete integer valued distribution).
Unlike the functions implemented in the C++ Stan math library, the user defined functions can have only signature. In this example, I have chosen the function signatures so that the behavior is as close as possible to most usual ways to call these functions. The main feature is that all these function return a scalar real value (except _rng would return scalar integer for a discrete integer valued distribution).

For generalized Pareto distribution we implement
For the Generalized Pareto distribution we implement

* real **gpareto_lpdf**(vector *y* | real *ymin*, real *k*, real *sigma*)
* real **gpareto_cdf**(vector *y* | real *ymin*, real *k*, real *sigma*)
Expand All @@ -90,19 +90,19 @@ As we can define only one function signature for user defined functions, I have

The whole code for functions, the basic model and the generated quantities for posterior predictive checking (ppc), leave-one-out cross-validation (loo), and prediction of rare events is shown below. I will next go through some practical details.

```{r}
```{r, comment=NA}
writeLines(readLines("gpareto.stan"))
```

For each function we do basic argument checking. In addition of invalid values due to user errors, due to the limited accuracy of the floating point presentation of the values, sometimes we may get invalid parameters. For example, due to the underflow, we may get sigma equal to 0, even if we have declared it as `real<lower=0> sigma;`. For these cases, it is useful to use `reject` statement, and the corresponding MCMC proposal will be rejected and the sampling can continue.
For each function we do basic argument checking. In addition of invalid values due to user errors, due to the limited accuracy of the floating point presentation of the values, sometimes we may get invalid parameters. For example, due to the underflow, we may get sigma equal to 0, even if we have declared it as `real<lower=0> sigma;`. For these latter cases, it is useful to use `reject` statement so the corresponding MCMC proposal will be rejected and the sampling can continue.
```
if (k<0 && max(y-ymin)/sigma > -inv_k)
reject("k<0 and max(y-ymin)/sigma > -1/k; found k, sigma =", k, sigma)
if (sigma<=0)
reject("sigma<=0; found sigma =", sigma)
```

Stan documentation warns about the use of `fabs` and conditional evaluation as they may lead to discontinuous energy or gradient. In GPD we need to handle the special case of $k \rightarrow 0$. The following will cause discontinuity, but the magnitude of the discontinuity is the same order as the accuracy of the floating point presentation, and thus we are not
Stan documentation warns about use of `fabs` and conditional evaluation as they may lead to discontinuous energy or gradient. In GPD we need to handle a special case of $k \rightarrow 0$. The following will cause discontinuity in theory, but the magnitude of the discontinuity is smaller than the accuracy of the floating point presentation, and thus this should not cause any additional numerical instability.
```
if (fabs(k) > 1e-15)
return -(1+inv_k)*sum(log1p((y-ymin) * (k/sigma))) -N*log(sigma);
Expand Down Expand Up @@ -196,11 +196,11 @@ sprintf('True sigma=%.2f, k=%.2f, estimated sigmahat=%.2f, khat=%.2f', sigma, k,
Next we use the defined Stan model to analyse the distribution of the largest geomagnetic storms.

Fit the Stan model
```{r}
```{r, warning=FALSE}
yt<-append(10^seq(2,3,.01),850)
ds<-list(ymin=100, N=n, y=d$dst, Nt=length(yt), yt=yt)
fit_gpd <- stan(file='gpareto.stan', data=ds, refresh=0,
chains=4, seed=100) #850
chains=4, seed=100)
```

Run the usual diagnostics (see [Robust Statistical Workflow with RStan](http://mc-stan.org/users/documentation/case-studies/rstan_workflow.html))
Expand All @@ -209,9 +209,9 @@ check_treedepth(fit_gpd)
check_energy(fit_gpd)
check_div(fit_gpd)
```
Diagnostics do not find anything alarming.
The diagnostics do not reveal anything alarming.

Plot the model fit with 90% posterior interval. The largest observed magnitude in the data corresponds to Quebec blackout in 1989 (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). The vertical dashed line at 850 shows the estimated magnitude of Carrington event in 1859 (https://en.wikipedia.org/wiki/Solar_storm_of_1859).
Plot the model fit with 90% posterior interval. The largest observed magnitude in the data corresponds to Quebec blackout in 1989 (https://en.wikipedia.org/wiki/March_1989_geomagnetic_storm). The vertical dashed line at 850 shows the estimated magnitude of the Carrington event in 1859 (https://en.wikipedia.org/wiki/Solar_storm_of_1859).
```{r}
gpd_params <- rstan::extract(fit_gpd)
mu <- apply(t(gpd_params$predccdf), 1, quantile, c(0.05, 0.5, 0.95)) %>%
Expand All @@ -232,7 +232,7 @@ ggplot() +
theme_bw()
```

For additional model checking plot 1) kernel density estimate of log(dst) and posterior predictive replicates, 2) max log magnitude (Quebec event) and histogram of maximums of posterior predictive replicates, 3) leave-one-out cross-validation probability-integral-transformation, and 4) khat values from the PSIS-LOO. None of these diagnostics can find problems in the model fit.
For additional model checking plot 1) kernel density estimate of log(dst) and posterior predictive replicates, 2) max log magnitude (Quebec event) and histogram of maximums of posterior predictive replicates, 3) leave-one-out cross-validation probability-integral-transformation, and 4) khat values from the PSIS-LOO. None of these diagnostics reveal problems in the model fit.
```{r}
ppc1 <- ppc_dens_overlay(log(d$dst),log(gpd_params$yrep[1:50,])) + labs(x="log(absolute dst)")
ppc2 <- ppc_stat(log(d$dst), log(gpd_params$yrep), stat = "max") + labs(x="max(log(absolute dst))")
Expand All @@ -244,17 +244,19 @@ ppc3 <- ppc_loo_pit(log(d$dst), log(gpd_params$yrep), lw=psis$lw_smooth)
grid.arrange(ppc1,ppc2,ppc3,pkhats,ncol=2)
```

We compute also PSIS-LOO estimate, although this is not much useful without alternative model to compare.
We compute also PSIS-LOO estimate, although this is not much use without an alternative model to compare against.
```{r}
(loo_gpd<-loo(gpd_params$log_lik))
```

Here the analysis was simplified by assuming the geomagnetic storm events to be independent in time although they appear in bursts. Generalized Pareto distribution is valid fit also for correlated data, but for predicting the probability of Quebec blackout magnitude event in the future we should take into account also the correlation. Using the simplified independence assumption and assuming same number of events larger than 100 in the next 57 years, we can compute
Here the analysis was simplified by assuming the geomagnetic storm events to be independent in time although they appear in bursts. The Generalized Pareto distribution can be used for correlated data, but we should really take the correlation into account for predicting the magnitude of storms in the future . Based on the simplified independence assumption though, and assuming same number of events larger than 100 in the next 57 years, we can compute
```{r}
round(mean(1-(1-gpd_params$predccdf[,78])^(373)),2)
round(mean(1-(1-gpd_params$predccdf[,102])^(373)),2)
```
That is, the probability that we would observe Quebec blackout magnitude event in the next 57 years is 80% and the probability of observing Carrington level event in the next 57 years is 40%. Now go read the effects of these geomagnetic storms.
That is, the probability that we would observe Quebec blackout magnitude event in the next 57 years is 80% and the probability of observing Carrington level event in the next 57 years is 40%.

Now go read about the effects of these geomagnetic storms!

<br />

Expand All @@ -268,7 +270,7 @@ sessionInfo()

### Appendix: Acknowledgments

I thank Ben Goodrich for useful comments on the draft of this notebook.
I thank Ben Goodrich and Ben Bales for useful comments on the draft of this notebook.

### Appendix: Licenses

Expand Down
285 changes: 142 additions & 143 deletions demos_rstan/geomagnetic.html

Large diffs are not rendered by default.

Loading

0 comments on commit fee62cd

Please sign in to comment.