Skip to content

Commit

Permalink
rewrite the modeling stuff
Browse files Browse the repository at this point in the history
  • Loading branch information
jannes-m committed Apr 12, 2022
1 parent e42fe33 commit 22560b7
Showing 1 changed file with 72 additions and 84 deletions.
156 changes: 72 additions & 84 deletions 15-eco.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -261,7 +261,7 @@ Often ordinations\index{ordination} using presence-absence data yield better res
Ordination techniques such as NMDS\index{NMDS} require at least one observation per site.
Hence, we need to dismiss all sites in which no species were found.

```{r 14-eco-11, eval=FALSE}
```{r 15-eco-11, eval=FALSE}
# presence-absence matrix
pa = decostand(comm, "pa") # 100 rows (sites), 69 columns (species)
# keep only sites in which at least one species was found
Expand All @@ -273,7 +273,7 @@ The resulting output matrix serves as input for the NMDS\index{NMDS}.
NMDS\index{NMDS} is an iterative procedure trying to make the ordinated space more similar to the input matrix in each step.
To make sure that the algorithm converges, we set the number of steps to 500 (`try` parameter).

```{r 14-eco-12, eval=FALSE, message=FALSE}
```{r 15-eco-12, eval=FALSE, message=FALSE}
set.seed(25072018)
nmds = metaMDS(comm = pa, k = 4, try = 500)
nmds$stress
Expand All @@ -288,11 +288,11 @@ nmds$stress
#> 0.08831395
```

```{r 14-eco-13, eval=FALSE, echo=FALSE}
```{r 15-eco-13, eval=FALSE, echo=FALSE}
saveRDS(nmds, "extdata/15-nmds.rds")
```

```{r 14-eco-14, include=FALSE, eval=FALSE}
```{r 15-eco-14, include=FALSE, eval=FALSE}
nmds = readRDS("extdata/15-nmds.rds")
```

Expand All @@ -319,7 +319,7 @@ plot(y = sc[, 1], x = elev, xlab = "elevation in m",
knitr::include_graphics("figures/xy-nmds-1.png")
```

```{r 14-eco-15, eval=FALSE, echo=FALSE}
```{r 15-eco-15, eval=FALSE, echo=FALSE}
# scores and rotated scores in one figure
p1 = xyplot(scores(rotnmds)[, 2] ~ scores(rotnmds)[, 1], pch = 16,
col = "lightblue", xlim = c(-3, 2), ylim = c(-2, 2),
Expand Down Expand Up @@ -380,9 +380,9 @@ Here, we shortly introduce decision trees and bagging, since they form the basis
We refer the reader to @james_introduction_2013 for a more detailed description of random forests\index{random forest} and related techniques.

To introduce decision trees by example, we first construct a response-predictor matrix by joining the rotated NMDS\index{NMDS} scores to the field observations (`random_points`).
We will also use the resulting data frame for the **mlr**\index{mlr (package)} modeling later on.
We will also use the resulting data frame for the **mlr3**\index{mlr3 (package)} modeling later on.

```{r 14-eco-16, eval=FALSE}
```{r 15-eco-16, eval=FALSE}
# construct response-predictor matrix
# id- and response variable
rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1])
Expand All @@ -393,19 +393,19 @@ rp = inner_join(random_points, rp, by = "id")
Decision trees split the predictor space into a number of regions.
To illustrate this, we apply a decision tree to our data using the scores of the first NMDS\index{NMDS} axis as the response (`sc`) and altitude (`dem`) as the only predictor.

```{r 14-eco-17, eval=FALSE}
```{r 15-eco-17, eval=FALSE}
library("tree")
tree_mo = tree(sc ~ dem, data = rp)
plot(tree_mo)
text(tree_mo, pretty = 0)
```

```{r 14-eco-18, echo=FALSE, eval=FALSE}
```{r 15-eco-18, echo=FALSE, eval=FALSE}
library("tree")
tree_mo = tree(sc ~ dem, data = rp)
```

```{r 14-eco-19, eval=FALSE, echo=FALSE}
```{r 15-eco-19, eval=FALSE, echo=FALSE}
png("figures/15_tree.png", width = 1100, height = 700, units = "px", res = 300)
par(mar = rep(1, 4))
plot(tree_mo)
Expand Down Expand Up @@ -441,15 +441,15 @@ for each split of the tree—in other words, that bagging should be done.
@james_introduction_2013
-->

### **mlr** building blocks
### **mlr3** building blocks

The code in this section largely follows the steps we have introduced in Section \@ref(svm).
The only differences are the following:

1. The response variable is numeric, hence a regression\index{regression} task will replace the classification\index{classification} task of Section \@ref(svm).
1. Instead of the AUROC\index{AUROC} which can only be used for categorical response variables, we will use the root mean squared error (RMSE\index{RMSE}) as performance measure.
1. We use a random forest\index{random forest} model instead of a support vector machine\index{SVM} which naturally goes along with different hyperparameters\index{hyperparameter}.
1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see Exercises).
1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see Exercises).
Instead we show how to tune hyperparameters\index{hyperparameter} for (spatial) predictions.

Remember that 125,500 models were necessary to retrieve bias-reduced performance estimates when using 100-repeated 5-fold spatial cross-validation\index{cross-validation!spatial CV} and a random search of 50 iterations (see Section \@ref(svm)).
Expand All @@ -463,42 +463,26 @@ This means, the inner hyperparameter\index{hyperparameter} tuning level is no lo
Therefore, we tune the hyperparameters\index{hyperparameter} for a good spatial prediction on the complete dataset via a 5-fold spatial CV\index{cross-validation!spatial CV} with one repetition.
<!-- If we used more than one repetition (say 2) we would retrieve multiple optimal tuned hyperparameter combinations (say 2) -->

The preparation for the modeling using the **mlr**\index{mlr (package)} package includes the construction of a response-predictor matrix containing only variables which should be used in the modeling and the construction of a separate coordinate data frame.
Having already constructed the input variables (`rp`), we are all set for specifying the **mlr3**\index{mlr3 (package)} building blocks (task, learner, and resampling).
For specifying a spatial task, we use again the **mlr3spatiotempcv** package (section \@ref(spatial-cv-with-mlr3)).
Since our response (`sc`) is numeric, we use a regression\index{regression} task.

```{r 14-eco-20, eval=FALSE}
# extract the coordinates into a separate data frame
coords = sf::st_coordinates(rp) %>%
as.data.frame() %>%
rename(x = X, y = Y)
# only keep response and predictors which should be used for the modeling
rp = dplyr::select(rp, -id, -spri) %>%
st_drop_geometry()
```{r 15-eco-20, eval=FALSE}
# create task
task = TaskRegrST$new(id = "mongon", backend = dplyr::select(rp, -id, -spri),
target = "sc")
```

Having constructed the input variables, we are all set for specifying the **mlr**\index{mlr (package)} building blocks (task, learner, and resampling).
We will use a regression\index{regression} task since the response variable is numeric.
The learner is a random forest\index{random forest} model implementation from the **ranger** package.
Using an `sf` object as the backend automatically provides the geometry information needed for the spatial partitioning later on.
Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling.
Next, we go on to contruct the a random forest\index{random forest} learner from the **ranger** package.

```{r 14-eco-21, eval=FALSE}
# create task
task = makeRegrTask(data = rp, target = "sc", coordinates = coords)
# learner
lrn_rf = makeLearner(cl = "regr.ranger", predict.type = "response")
```{r 15-eco-21, eval=FALSE}
lrn_rf = lrn("regr.ranger", predict_type = "response")
```

As opposed to for example support vector machines\index{SVM} (see Section \@ref(svm)), random forests often already show good performances when used with the default values of their hyperparameters (which may be one reason for their popularity).
Still, tuning often moderately improves model results, and thus is worth the effort [@probst_hyperparameters_2018].
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)).
Specifically, we will use a five-fold spatial partitioning with only one repetition (`makeResampleDesc()`).
In each of these spatial partitions, we run 50 models (`makeTuneControlRandom()`) to find the optimal hyperparameter\index{hyperparameter} combination.

```{r 14-eco-22, eval=FALSE}
# spatial partitioning
perf_level = makeResampleDesc("SpCV", iters = 5)
# specifying random search
ctrl = makeTuneControlRandom(maxit = 50L)
```

In random forests\index{random forest}, the hyperparameters\index{hyperparameter} `mtry`, `min.node.size` and `sample.fraction` determine the degree of randomness, and should be tuned [@probst_hyperparameters_2018].
`mtry` indicates how many predictor variables should be used in each tree.
If all predictors are used, then this corresponds in fact to bagging (see beginning of Section \@ref(modeling-the-floristic-gradient)).
Expand All @@ -510,16 +494,35 @@ Naturally, as trees and computing time become larger, the lower the `min.node.si
Hyperparameter\index{hyperparameter} combinations will be selected randomly but should fall inside specific tuning limits (`makeParamSet()`).
`mtry` should range between 1 and the number of predictors
<!-- (`r # ncol(rp) - 1`), -->
(4)
`sample.fraction`
should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10.
(4), `sample.fraction` should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10.

```{r 14-eco-23, eval=FALSE}
# specifying the search space
ps = makeParamSet(
makeIntegerParam("mtry", lower = 1, upper = ncol(rp) - 1),
makeNumericParam("sample.fraction", lower = 0.2, upper = 0.9),
makeIntegerParam("min.node.size", lower = 1, upper = 10)
search_space = ps(
mtry = p_int(lower = 1, upper = ncol(task$data()) - 1),
sample.fraction = p_dbl(lower = 0.2, upper = 0.9),
min.node.size = p_int(lower = 1, upper = 10)
)
```

Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function.
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)).
Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`).
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination.

```{r 15-eco-22, eval=FALSE}
at = AutoTuner$new(
learner = lrn_rf,
# spatial partitioning
resampling = rsmp("spcv_coords", folds = 5),
# performance measure
measure = msr("regr.rmse"),
# specify 50 iterations
terminator = trm("evals", n_evals = 50),
# predefined hyperparameter search space
search_space = search_space,
# specify random search
tuner = tnr("random_search")
)
```

Expand All @@ -528,34 +531,33 @@ The performance measure is the root mean squared error (RMSE\index{RMSE}).

```{r 14-eco-24, eval=FALSE}
# hyperparamter tuning
set.seed(02082018)
tune = tuneParams(learner = lrn_rf,
task = task,
resampling = perf_level,
par.set = ps,
control = ctrl,
measures = mlr::rmse)
set.seed(0412022)
at$train(task)
#>...
#> [Tune-x] 49: mtry=3; sample.fraction=0.533; min.node.size=5
#> [Tune-y] 49: rmse.test.rmse=0.5636692; time: 0.0 min
#> [Tune-x] 50: mtry=1; sample.fraction=0.68; min.node.size=5
#> [Tune-y] 50: rmse.test.rmse=0.6314249; time: 0.0 min
#> [Tune] Result: mtry=4; sample.fraction=0.887; min.node.size=10 :
#> rmse.test.rmse=0.5104918
#> INFO [11:39:31.375] [mlr3] Finished benchmark
#> INFO [11:39:31.427] [bbotk] Result of batch 50:
#> INFO [11:39:31.432] [bbotk] mtry sample.fraction min.node.size regr.rmse warnings errors runtime_learners
#> INFO [11:39:31.432] [bbotk] 2 0.2059293 10 0.5118128 0 0 0.149
#> INFO [11:39:31.432] [bbotk] uhash
#> INFO [11:39:31.432] [bbotk] 81dfa6f8-0109-410d-bdb5-a1490e7caf8e
#> INFO [11:39:31.448] [bbotk] Finished optimizing after 50 evaluation(s)
#> INFO [11:39:31.451] [bbotk] Result:
#> INFO [11:39:31.455] [bbotk] mtry sample.fraction min.node.size learner_param_vals x_domain regr.rmse
#> INFO [11:39:31.455] [bbotk] 4 0.8999753 7 <list[4]> <list[3]> 0.3751501
```

```{r 14-eco-25, eval=FALSE, echo=FALSE}
saveRDS(tune, "extdata/15-tune.rds")
```{r 15-eco-25, eval=FALSE, echo=FALSE}
saveRDS(at, "extdata/15-tune.rds")
```

```{r 14-eco-26, echo=FALSE, eval=FALSE}
tune = readRDS("extdata/15-tune.rds")
```

An `mtry` of 4, a `sample.fraction` of 0.887, and a `min.node.size` of 10 represent the best hyperparameter\index{hyperparameter} combination.
An `mtry` of 4, a `sample.fraction` of 0.9, and a `min.node.size` of 7 represent the best hyperparameter\index{hyperparameter} combination.
An RMSE\index{RMSE} of
<!-- `r # round(tune$y[attr(tune$y, "names") == "rmse.test.rmse"], 2)` -->
0.51
<!-- `r round(tune$tuning_result$regr.rmse, 2)` -->
0.38
is relatively good when considering the range of the response variable which is
<!-- `r # round(diff(range(rp$sc)), 2)` -->
3.04
Expand All @@ -567,32 +569,18 @@ The tuned hyperparameters\index{hyperparameter} can now be used for the predicti
We simply have to modify our learner using the result of the hyperparameter tuning, and run the corresponding model.

```{r 14-eco-27, eval=FALSE}
# learning using the best hyperparameter combination
lrn_rf = makeLearner(cl = "regr.ranger",
predict.type = "response",
mtry = tune$x$mtry,
sample.fraction = tune$x$sample.fraction,
min.node.size = tune$x$min.node.size)
# doing the same more elegantly using setHyperPars()
# lrn_rf = setHyperPars(makeLearner("regr.ranger", predict.type = "response"),
# par.vals = tune$x)
# train model
model_rf = train(lrn_rf, task)
# to retrieve the ranger output, run:
# mlr::getLearnerModel(model_rf)
# which corresponds to:
# ranger(sc ~ ., data = rp,
# mtry = tune$x$mtry,
# sample.fraction = tune$x$sample.fraction,
# min.node.sie = tune$x$min.node.size)
# predicting using the best hyperparameter combination
at$predict(task)
```

The last step is to apply the model to the spatially available predictors, i.e., to the raster stack\index{raster!stack}.
So far, `raster::predict()` does not support the output of **ranger** models, hence, we will have to program the prediction ourselves.
So far, `terra::predict()` does not support the output of **ranger** models, hence, we will have to program the prediction ourselves.
First, we convert `ep` into a prediction data frame which secondly serves as input for the `predict.ranger()` function.
Thirdly, we put the predicted values back into a `RasterLayer`\index{raster} (see Section \@ref(raster-subsetting) and Figure \@ref(fig:rf-pred)).

```{r 14-eco-28, eval=FALSE}
terra::predict(ep, model = at)
# convert raster stack into a data frame
new_data = as.data.frame(as.matrix(ep))
# apply the model to the data frame
Expand Down

0 comments on commit 22560b7

Please sign in to comment.