The h2oEnsemble
R package provides functionality to create ensembles from the base learning algorithms that are accessible via the h2o
R package. This type of ensemble learning is called "super learning", "stacked regression" or "stacking." The Super Learner algorithm learns the optimal combination of the base learner fits. In a 2007 article titled, "Super Learner," it was shown that the super learner ensemble represents an asymptotically optimal system for learning.
The h2oEnsemble
package can be installed using either of the following methods.
- Clone the main h2o repository and install the package:
git clone https://github.com/h2oai/h2o.git
R CMD INSTALL h2o/R/ensemble/h2oEnsemble-package
- Install in R using
devtools::install_github
:
library(devtools)
install_github("h2oai/h2o-2/R/ensemble/h2oEnsemble-package")
- An example of how to train and test an ensemble is in the
h2o.ensemble
function documentation in theh2oEnsemble
package. - The ensemble is defined by its set of base learning algorithms and the metalearning algorithm. Algorithm wrapper functions are used to specify these algorithms.
- The ensemble fit is an object of class, "h2o.ensemble", however this is just an R list.
- See the
example_twoClass_higgs.R
script or theh2o.ensemble
R documentation for an example.
- The ensemble works by using wrapper functions (located in the
wrappers.R
file in the package). These wrapper functions are used to specify the base learner and metalearner algorithms for the ensemble. - This methodology of using wrapper functions is modeled after the SuperLearner and subsemble ensemble learning packages. The use of wrapper functions makes the ensemble code cleaner by providing a unified interface.
- Often it is a good idea to include variants of one algorithm/function by specifying different tuning parameters for different base learners. There is an examples of how to create new variants of the wrapper functions in the
create_h2o_wrappers.R
script. - The wrapper functions must have unique names.
- Historically, techniques like non-negative least squares (NNLS) have been used to find the optimal weighted combination of the base learners, however any supervised learning algorithm can be used as a metalearner.
- If your base learning library includes several versions of a particular algorithm (with different tuning parameters), then you should consider using a metalearner that can perform well in the presence of correlated predictors (e.g. Ridge Regression).
- We allow the user to specify any learner wrapper to define a metalearner, and we can use the
SL.nnls
function (in theSuperLearner_wrappers.R
script) and theSL.glm
(included in the SuperLearner R package). - The ensembles that use an h2o-based metalearner have suboptimal performance, which will be addressed in a future release.
- This package is incompatible with R 3.0.0-3.1.0 due to a parser bug in R. Upgrade to R 3.1.1 or greater to resolve the issue. It may work on earlier versions of R but has not been tested.
- Sometimes while executing
h2o.ensemble
, the code hangs due to a communication issue with H2O. This seems to happen mostly when using the R interpreter, so if you run into this issue, try to run your script usingR CMD BATCH
orRscript
, or simply kill R and restart the interpreter. The output looks something like this:
GET /Cloud.json HTTP/1.1
Host: localhost:54321
Accept: */*
- The
h2o.*
algorithms are currently performing sub-optimally as metalearners. Work is being done to address this issue. As a result, we are currently supporting theSL.*
based metalearners from the SuperLearner R package. This means that a matrix of sizen x L
, whereL
is the number of base learners andn
is the number of training observations, must be pulled into R memory for the metalearning step. This is a memory bottleneck that will be addressed in a future release. The plan is to use H2O/Java-based metalearners to avoid having to pull data back into R. - When using a
h2o.deeplearning
model as a base learner, it is not possible to reproduce ensemble model results exactly even when using theseed
argument ofh2o.ensemble
is set if your H2O cluster uses multiple cores. This is due to the fact thath2o.deeplearning
results are only reproducible when trained on a single core. - The multicore and snow cluster functionality is not working (see the
parallel
option of theh2o.ensemble
function). There is a conflict with using the R parallel functionality in conjunction with the H2O parallel functionality. Theh2o.*
base learning algorithms will use all cores available, so even when theh2o.ensemble
function is executed with the defaultparallel = "seq"
option, you will be training in parallel. Theparallel
argument was intended to parallelize the cross-validation and base learning steps, but this functionality either needs to be re-architected to work in concert with H2O parallelism or removed in a future release. - Currently, the
h2o.ensemble
function outputs a list object which makes up the ensemble "model". This R object can be serialized to disk using the R basesave
function. However, if you save the ensemble model to disk, then use it in the future to generate predictions on a test set using a new H2O cluster instance (with a different cluster IP address), this will not work. This can be fixed by updating the cluster IP address in the saved object with the new one. The model saving process will probably be modified in the future to serialize each of the individual H2O base models using theh2o::saveModel
function. Therefore, the saved H2O base models will be accessible individually. Currently, the ensemble fit is stored as a single R list object which contains all the base learner fits, the metalearner fit, and a few other pieces of data.
Benchmarking code is available here: https://github.com/ledell/h2oEnsemble-benchmarks