Skip to content

Commit

Permalink
Merge
Browse files Browse the repository at this point in the history
  • Loading branch information
Robinlovelace committed Apr 24, 2022
2 parents 5ac54d0 + 6a2ea4a commit 3b7dd43
Show file tree
Hide file tree
Showing 10 changed files with 226 additions and 230 deletions.
162 changes: 72 additions & 90 deletions 12-spatial-cv.Rmd

Large diffs are not rendered by default.

92 changes: 51 additions & 41 deletions 15-eco.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ In it you will also make use of R's\index{R} interfaces to dedicated GIS\index{G

The chapter uses the following packages:

```{r 15-eco-1, message=FALSE, eval=FALSE}
```{r 15-eco-1, message=FALSE}
library(data.table)
library(dplyr)
library(mlr3)
Expand All @@ -31,7 +31,7 @@ To do so, we will bring together concepts presented in previous chapters and eve
Fog oases are one of the most fascinating vegetation formations we have ever encountered.
These formations, locally termed *lomas*, develop on mountains along the coastal deserts of Peru and Chile.^[Similar vegetation formations develop also in other parts of the world, e.g., in Namibia and along the coasts of Yemen and Oman [@galletti_land_2016].]
The deserts' extreme conditions and remoteness provide the habitat for a unique ecosystem, including species endemic to the fog oases.
Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austal winter.
Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austral winter.
This results in green southern-facing mountain slopes along the coastal strip of Peru (Figure \@ref(fig:study-area-mongon)).
This fog, which develops below the temperature inversion caused by the cold Humboldt current in austral winter, provides the name for this habitat.
Every few years, the El Niño phenomenon brings torrential rainfall to this sun-baked environment [@dillon_lomas_2003].
Expand Down Expand Up @@ -70,7 +70,7 @@ To guarantee an optimal prediction, it is advisable to tune beforehand the hyper

All the data needed for the subsequent analyses is available via the **spDataLarge** package.

```{r 15-eco-2, eval=FALSE}
```{r 15-eco-2}
# spatial vector objects
data("study_area", "random_points", "comm", package = "spDataLarge")
# spatial raster objects
Expand Down Expand Up @@ -229,13 +229,13 @@ ep$carea = log10(ep$carea)

As a convenience to the reader, we have added `ep` to **spDataLarge**:

```{r 15-eco-9, eval=FALSE}
```{r 15-eco-9, cache.lazy=FALSE}
ep = terra::rast(system.file("raster/ep.tif", package = "spDataLarge"))
```

Finally, we can extract the terrain attributes to our field observations (see also Section \@ref(raster-extraction)).

```{r 15-eco-10, eval=FALSE}
```{r 15-eco-10, cache=TRUE, cache.lazy=FALSE, message=FALSE, warning=FALSE}
random_points[, names(ep)] =
# terra::extract adds automatically a for our purposes unnecessary ID column
terra::extract(ep, terra::vect(random_points)) |>
Expand Down Expand Up @@ -272,7 +272,7 @@ Often ordinations\index{ordination} using presence-absence data yield better res
Ordination techniques such as NMDS\index{NMDS} require at least one observation per site.
Hence, we need to dismiss all sites in which no species were found.

```{r 15-eco-11, eval=FALSE}
```{r 15-eco-11}
# presence-absence matrix
pa = vegan::decostand(comm, "pa") # 100 rows (sites), 69 columns (species)
# keep only sites in which at least one species was found
Expand Down Expand Up @@ -303,7 +303,7 @@ nmds$stress
saveRDS(nmds, "extdata/15-nmds.rds")
```

```{r 15-eco-14, include=FALSE, eval=FALSE}
```{r 15-eco-14, include=FALSE}
nmds = readRDS("extdata/15-nmds.rds")
```

Expand All @@ -326,7 +326,13 @@ plot(y = sc[, 1], x = elev, xlab = "elevation in m",
ylab = "First NMDS axis", cex.lab = 0.8, cex.axis = 0.8)
```

```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%"}
```{r xy-nmds, fig.cap="Plotting the first NMDS axis against altitude.", fig.scap = "First NMDS axis against altitude plot.", fig.asp=1, out.width="60%", message=FALSE, echo=FALSE}
elev = dplyr::filter(random_points, id %in% rownames(pa)) |>
dplyr::pull(dem)
# rotating NMDS in accordance with altitude (proxy for humidity)
rotnmds = vegan::MDSrotate(nmds, elev)
# extracting the first two axes
sc = vegan::scores(rotnmds, choices = 1:2)
knitr::include_graphics("figures/15_xy_nmds.png")
```

Expand Down Expand Up @@ -392,15 +398,22 @@ We refer the reader to @james_introduction_2013 for a more detailed description

To introduce decision trees by example, we first construct a response-predictor matrix by joining the rotated NMDS\index{NMDS} scores to the field observations (`random_points`).
We will also use the resulting data frame for the **mlr3**\index{mlr3 (package)} modeling later on.

```{r 15-eco-16, eval=FALSE}
<!-- JM: build process stops telling us that sc[, 1] causes the problem though I really don't know why... -->
```{r 15-eco-16, message=FALSE, eval=FALSE}
# construct response-predictor matrix
# id- and response variable
rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1])
# join the predictors (dem, ndvi and terrain attributes)
rp = inner_join(random_points, rp, by = "id")
```

```{r attach-rp, echo=FALSE}
# rp = data.frame(id = as.numeric(rownames(sc)), sc = sc[, 1])
# rp = inner_join(random_points, rp, by = "id")
# saveRDS(rp, "extdata/15-rp.rds")
rp = readRDS("extdata/15-rp.rds")
```

Decision trees split the predictor space into a number of regions.
To illustrate this, we apply a decision tree to our data using the scores of the first NMDS\index{NMDS} axis as the response (`sc`) and altitude (`dem`) as the only predictor.

Expand All @@ -424,7 +437,7 @@ text(tree_mo, pretty = 0)
dev.off()
```

```{r tree, echo=FALSE, fig.cap="Simple example of a decision tree with three internal nodes and four terminal nodes.", fig.scap="Simple example of a decision tree."}
```{r tree, echo=FALSE, fig.cap="Simple example of a decision tree with three internal nodes and four terminal nodes.", out.width="60%", fig.scap="Simple example of a decision tree."}
knitr::include_graphics("figures/15_tree.png")
```

Expand Down Expand Up @@ -475,20 +488,20 @@ Therefore, we tune the hyperparameters\index{hyperparameter} for a good spatial
<!-- If we used more than one repetition (say 2) we would retrieve multiple optimal tuned hyperparameter combinations (say 2) -->

Having already constructed the input variables (`rp`), we are all set for specifying the **mlr3**\index{mlr3 (package)} building blocks (task, learner, and resampling).
For specifying a spatial task, we use again the **mlr3spatiotempcv** [@schratz_mlr3spatiotempcv_2021] package (section \@ref(spatial-cv-with-mlr3)).
For specifying a spatial task, we use again the **mlr3spatiotempcv** package [@schratz_mlr3spatiotempcv_2021 & Section \@ref(spatial-cv-with-mlr3)].
Since our response (`sc`) is numeric, we use a regression\index{regression} task.

```{r 15-eco-20, eval=FALSE}
```{r 15-eco-20}
# create task
task = mlr3spatiotempcv::TaskRegrST$new(
id = "mongon", backend = dplyr::select(rp, -id, -spri), target = "sc")
```

Using an `sf` object as the backend automatically provides the geometry information needed for the spatial partitioning later on.
Additionally, we got rid of the columns `id` and `spri` since these variables should not be used as predictors in the modeling.
Next, we go on to contruct the a random forest\index{random forest} learner from the **ranger** package.
Next, we go on to construct the a random forest\index{random forest} learner from the **ranger** package.

```{r 15-eco-21, eval=FALSE}
```{r 15-eco-21}
lrn_rf = lrn("regr.ranger", predict_type = "response")
```

Expand All @@ -507,7 +520,7 @@ Hyperparameter\index{hyperparameter} combinations will be selected randomly but
<!-- (`r # ncol(rp) - 1`), -->
(4), `sample.fraction` should range between 0.2 and 0.9 and `min.node.size` should range between 1 and 10.

```{r 14-eco-23, eval=FALSE}
```{r 15-eco-22}
# specifying the search space
search_space = paradox::ps(
mtry = paradox::p_int(lower = 1, upper = ncol(task$data()) - 1),
Expand All @@ -517,13 +530,13 @@ search_space = paradox::ps(
```

Having defined the search space, we are all set for specifying our tuning via the `AutoTuner()` function.
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr)).
Since we deal with geographic data, we will again make use of spatial cross-validation to tune the hyperparameters\index{hyperparameter} (see Sections \@ref(intro-cv) and \@ref(spatial-cv-with-mlr3)).
Specifically, we will use a five-fold spatial partitioning with only one repetition (`rsmp()`).
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination.
In each of these spatial partitions, we run 50 models (`trm()`) while using randomly selected hyperparameter configurations (`tnr()`) within predefined limits (`seach_space`) to find the optimal hyperparameter\index{hyperparameter} combination [see also Section \@ref(svm) and https://mlr3book.mlr-org.com/optimization.html#autotuner, @becker_mlr3_2022].
The performance measure is the root mean squared error (RMSE\index{RMSE}).

```{r 15-eco-22, eval=FALSE}
at = mlr3tuning::AutoTuner$new(
```{r 15-eco-23}
autotuner_rf = mlr3tuning::AutoTuner$new(
learner = lrn_rf,
# spatial partitioning
resampling = mlr3::rsmp("spcv_coords", folds = 5),
Expand All @@ -540,10 +553,10 @@ at = mlr3tuning::AutoTuner$new(

Calling the `train()`-method of the `AutoTuner`-object finally runs the hyperparameter\index{hyperparameter} tuning, and will find the optimal hyperparameter\index{hyperparameter} combination for the specified parameters.

```{r 14-eco-24, eval=FALSE}
# hyperparamter tuning
```{r 15-eco-24, eval=FALSE}
# hyperparameter tuning
set.seed(0412022)
at$train(task)
autotuner_rf$train(task)
#>...
#> INFO [11:39:31.375] [mlr3] Finished benchmark
#> INFO [11:39:31.427] [bbotk] Result of batch 50:
Expand All @@ -558,40 +571,36 @@ at$train(task)
```

```{r 15-eco-25, eval=FALSE, echo=FALSE}
saveRDS(at, "extdata/15-tune.rds")
saveRDS(autotuner_rf, "extdata/15-tune.rds")
```

```{r 15-eco-26, echo=FALSE, eval=FALSE}
tune = readRDS("extdata/15-tune.rds")
```{r 15-eco-26, echo=FALSE, eval=TRUE, cache.lazy=FALSE}
autotuner_rf = readRDS("extdata/15-tune.rds")
```

An `mtry` of 4, a `sample.fraction` of 0.9, and a `min.node.size` of 7 represent the best hyperparameter\index{hyperparameter} combination.
An RMSE\index{RMSE} of
<!-- `r # round(tune$tuning_result$regr.rmse, 2)` -->
0.38
An `mtry` of `r autotuner_rf$tuning_result$mtry`, a `sample.fraction` of `r round(autotuner_rf$tuning_result$sample.fraction, 2)`, and a `min.node.size` of `r autotuner_rf$tuning_result$min.node.size` represent the best hyperparameter\index{hyperparameter} combination.
An RMSE\index{RMSE} of `r round(autotuner_rf$tuning_result$regr.rmse, 2)`
is relatively good when considering the range of the response variable which is
<!-- `r # round(diff(range(rp$sc)), 2)` -->
3.04
(`diff(range(rp$sc))`).
`r round(diff(range(rp$sc)), 2)` (`diff(range(rp$sc))`).

### Predictive mapping

The tuned hyperparameters\index{hyperparameter} can now be used for the prediction.
To do so, we only need to run the `predict` method of our fitted `AutoTuner` object.

```{r 15-eco-27, eval=FALSE}
```{r 15-eco-27, eval=TRUE, cache.lazy=FALSE}
# predicting using the best hyperparameter combination
at$predict(task)
autotuner_rf$predict(task)
```

The `predict` method will apply the model to all observations used in the modeling.
Given a multilayer `SpatRaster` containing rasters named as the predictors used in the modeling, `terra::predict()` will also make spatial predictions, i.e., predict to new data.

```{r 15-eco-28, eval=FALSE}
pred = terra::predict(ep, model = at, fun = predict)
```{r 15-eco-28, eval=TRUE, cache.lazy=FALSE}
pred = terra::predict(ep, model = autotuner_rf, fun = predict)
```

```{r rf-pred, echo=FALSE, fig.cap="Predictive mapping of the floristic gradient clearly revealing distinct vegetation belts.", fig.width = 10, fig.height = 10, fig.scap="Predictive mapping of the floristic gradient."}
```{r rf-pred, echo=FALSE, fig.cap="Predictive mapping of the floristic gradient clearly revealing distinct vegetation belts.", out.width="60%", fig.scap="Predictive mapping of the floristic gradient."}
# # restrict the prediction to your study area
# pred = terra::mask(pred, terra::vect(study_area)) |>
# terra::trim()
Expand Down Expand Up @@ -627,17 +636,18 @@ knitr::include_graphics("figures/15_rf_pred.png")

In case, `terra::predict()` does not support a model algorithm, you can still make the predictions manually.

```{r 15-eco-29, eval=FALSE}
```{r 15-eco-29, cache.lazy=FALSE}
newdata = as.data.frame(as.matrix(ep))
colSums(is.na(newdata)) # 0 NAs
# but assuming there were 0s results in a more generic approach
ind = rowSums(is.na(newdata)) == 0
tmp = at$predict_newdata(newdata = newdata[ind, ], task = task)
tmp = autotuner_rf$predict_newdata(newdata = newdata[ind, ], task = task)
newdata[ind, "pred"] = data.table::as.data.table(tmp)[["response"]]
pred_2 = ep$dem
# now fill the raster with the predicted values
pred_2[] = newdata$pred
identical(values(pred), values(pred_2)) # TRUE
# check if terra and our manual prediction is the same
all(values(pred - pred_2) == 0)
```

The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)).
Expand Down
18 changes: 15 additions & 3 deletions _12-ex.Rmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,6 @@
```{asis, message=FALSE}
The solutions assume the following packages are attached (other packages will be attached when needed):
```

```{r 12-ex-e0, message=FALSE, warning=FALSE}
library(dplyr)
Expand All @@ -16,6 +18,7 @@ library(tmap)
```

E1. Compute the following terrain attributes from the `elev` dataset loaded with `terra::rast(system.file("raster/ta.tif", package = "spDataLarge"))$elev` with the help of R-GIS bridges (see this [Chapter](https://geocompr.robinlovelace.net/gis.html#gis)):

- Slope
- Plan curvature
- Profile curvature
Expand Down Expand Up @@ -116,9 +119,11 @@ tm_shape(hs, bbox = bbx) +
legend.title.size = 0.9)
```

E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv).
E4. Compute a 100-repeated 5-fold non-spatial cross-validation and spatial CV based on the GLM learner and compare the AUROC values from both resampling strategies with the help of boxplots (see this [Figure](https://geocompr.robinlovelace.net/spatial-cv.html#fig:boxplot-cv)).

Hint: You need to specify a non-spatial resampling strategy.
Another hint: You might want to Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking).

Another hint: You might want to solve Excercises 4 to 6 in one go with the help of `mlr3::benchmark()` and `mlr3::benchmark_grid()` (for more information, please refer to https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking).
When doing so, keep in mind that the computation can take very long, probably several days.
This, of course, depends on your system.
Computation time will be shorter the more RAM and cores you have at your disposal.
Expand All @@ -140,11 +145,14 @@ task = TaskClassifST$new(
# construct learners (for all subsequent exercises)
# GLM
lrn_glm = lrn("classif.log_reg", predict_type = "prob")
lrn_glm$fallback = lrn("classif.featureless", predict_type = "prob")
# SVM
# construct SVM learner (using ksvm function from the kernlab package)
lrn_ksvm = lrn("classif.ksvm", predict_type = "prob", kernel = "rbfdot",
type = "C-svc")
lrn_ksvm$fallback = lrn("classif.featureless", predict_type = "prob")
# specify nested resampling and adjust learner accordingly
# five spatially disjoint partitions
tune_level = rsmp("spcv_coords", folds = 5)
Expand All @@ -167,6 +175,7 @@ at_ksvm = AutoTuner$new(
# QDA
lrn_qda = lrn("classif.qda", predict_type = "prob")
lrn_qda$fallback = lrn("classif.featureless", predict_type = "prob")
# SVM without tuning hyperparameters
vals = lrn_ksvm$param_set$values
Expand Down Expand Up @@ -194,7 +203,10 @@ library(future)
future::plan(list("sequential", "multisession"),
workers = floor(availableCores() / 2))
set.seed(021522)
bmr = benchmark(grid, store_backends = FALSE)
bmr = benchmark(grid,
store_backends = FALSE,
store_models = FALSE,
encapsulate = "evaluate")
# stop parallelization
future:::ClusterRegistry("stop")
# save your result, e.g. to
Expand Down
12 changes: 8 additions & 4 deletions _15-ex.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ E1. Run a NMDS\index{NMDS} using the percentage data of the community matrix.
Report the stress value and compare it to the stress value as retrieved from the NMDS using presence-absence data.
What might explain the observed difference?

```{r 15-ex-e1, eval=FALSE}
```{r 15-ex-e1, message=FALSE}
data("comm", package = "spDataLarge")
pa = decostand(comm, "pa")
pa = pa[rowSums(pa) != 0, ]
Expand All @@ -35,6 +35,7 @@ nmds_pa$stress
nmds_per$stress
```

```{asis, message=FALSE}
The NMDS using the presence-absence values yields a better result (`nmds_pa$stress`) than the one using percentage data (`nmds_per$stress`).
This might seem surprising at first sight.
On the other hand, the percentage matrix contains both more information and more noise.
Expand All @@ -47,6 +48,7 @@ The point here is that percentage data as specified during a field campaign migh
This again introduces noise which in turn will worsen the ordination result.
Still, it is a valuable information if one species had a higher frequency or coverage in one plot than another compared to just presence-absence data.
One compromise would be to use a categorical scale such as the Londo scale.
```

E2. Compute all the predictor rasters\index{raster} we have used in the chapter (catchment slope, catchment area), and put them into a `SpatRaster`-object.
Add `dem` and `ndvi` to it.
Expand All @@ -55,7 +57,7 @@ Finally, construct a response-predictor matrix.
The scores of the first NMDS\index{NMDS} axis (which were the result when using the presence-absence community matrix) rotated in accordance with elevation represent the response variable, and should be joined to `random_points` (use an inner join).
To complete the response-predictor matrix, extract the values of the environmental predictor raster object to `random_points`.

```{r 15-ex-e2, eval=FALSE}
```{r 15-ex-e2}
# first compute the terrain attributes we have also used in the chapter
library(dplyr)
library(terra)
Expand Down Expand Up @@ -118,7 +120,7 @@ Parallelize\index{parallelization} the tuning level (see Section \@ref(svm)).
Report the mean RMSE\index{RMSE} and use a boxplot to visualize all retrieved RMSEs.
Please not that this exercise is best solved using the mlr3 functions `benchmark_grid()` and `benchmark()` (see https://mlr3book.mlr-org.com/perf-eval-cmp.html#benchmarking for more information).

```{r 15-ex-e3, eval=FALSE}
```{r 15-ex-e3, message=FALSE}
library(dplyr)
library(future)
library(mlr3)
Expand Down Expand Up @@ -190,7 +192,7 @@ tictoc::toc()
# stop parallelization
future:::ClusterRegistry("stop")
# save your result, e.g. to
saveRDS(bmr, file = "extdata/15_bmr.rds")
# saveRDS(bmr, file = "extdata/15_bmr.rds")
# mean RMSE
bmr$aggregate(measures = msr("regr.rmse"))
Expand All @@ -213,5 +215,7 @@ ggplot(data = d, mapping = aes(x = learner_id, y = regr.rmse)) +
labs(y = "RMSE", x = "model")
```

```{asis, message=FALSE}
In fact, `lm` performs at least as good the random forest model, and thus should be preferred since it is much easier to understand and computationally much less demanding (no need for fitting hyperparameters).
But keep in mind that the used dataset is small in terms of observations and predictors and that the response-predictor relationships are also relatively linear.
```
Loading

0 comments on commit 3b7dd43

Please sign in to comment.