This repository contains the code and data for the paper "Random forests with spatial proxies for environmental modelling: opportunities and pitfalls" by Carles Milà, Marvin Ludwig, Edzer Pebesma, Cathryn Tonne, and Hanna Meyer. The manuscript has been published in the journal Geoscientific Model Development. The content of this repository is organised according to the structure of the article, i.e. a simulation study and two case studies.
The code to run the simulation study is organized as follows:
- sim_analysis.R: Script that contains the code to run the simulation study. It uses functions defined in sim_functions.R and sim_utils.R.
- fig_simulations.R: Script that contains the code to generate the figures with the results of the simulation study.
- fig_others.R: Script that contains the code to generate the examples of random fields with different autocorrelation ranges, as well as the study cases station maps.
The files containing the results of the simulation study can be found here.
The code to run the case studies is organized as follows:
- case_prepare.R: Script that runs the pre-processing steps necessary to clean the station data for the air temperature and pollution case studies and generate the stack of predictors.
- case_RFsp.R: Script that generates the distance-to-sampling points spatial proxies necessary to run RFsp for air temperature and pollution.
- case_extract.R: Script that stacks the generated predictors and the spatial proxies and extract their values at the station locations for modelling.
- case_modtemp.R: Script that runs the air temperature analysis. It uses functions defined in case_functions.R.
- case_modpm25.R: Script that runs the air pollution analysis. It uses functions defined in case_functions.R.
- figtab_temp.R: Script that generates the figures and tables related to the air temperature analysis.
- figtab_pm25.R: Script that generates the figures and tables related to the air pollution analysis.
- fig_others.R: Script that contains the code to generate the study area figure with the station locations (Figure 2).
The data for the case studies is available here and includes:
- Station data for temperature and air pollution.
- Predictor stack:
predictors.tif
. - Study area boundaries.
Due to file size constraints, we could not upload the original data and intermediate files after preprocessing. Nonetheless, the included data listed above should be enough to run the analysis and to generate the tables and figures of the results. Similarly, we could not upload the stack of distance-to-sample fields required to run RFsp; these can be generated by running the corresponding script.
The files containing the results of the study cases can be found here.
Most of the scripts included in the repository can be run in local. However, there are several steps we performed in a HPC due to long computational runtime and/or data size constraints. Those scripts are appropriately labeled in the script title and the first lines.