All notable changes to grf
will be documented in this file.
The format is based on Keep a Changelog and this project adheres to Semantic Versioning.
- Minor patch release for CRAN Solaris compatibility. #1011
IMPORTANT Some of these changes might cause small differences in results compared to previous releases, even if the same random seed is used.
- Unify the interface to ATE-type estimators: 1)
average_treatment_effect
is the new entry point for all ATE summaries, meaningaverage_late
andaverage_partial_effect
is removed. 2) This function now targets population-type quantities for all forests, meaning some confidence intervals may be slightly wider than before. 3) Some ad-hoc normalization schemes are removed, but can be manually specified through thedebiasing.weights
argument. #723 - Remove all
tune_***_forest
functions, restricting the tuning interface to the pre-existingtune.parameters
argument in all tuning-compatible forests. #790 - Remove the optional
orthog.boosting
argument incausal_forest
, since tailoredm(x)
estimates can be passed through the existingY.hat
argument. #892 - Remove support for sparse
X
in forest training (as the internal C++ implementation did not leverage sparsity inX
beyond storage mode). To train a forest with sparse data doforest(as.matrix(X), Y)
. #939 - Remove
custom_forest
. For a template for getting started with a custom GRF estimator, consider using an existing simple forest, likeregression_forest
as a scaffold. For more details see the GRF developing document. #870 - Rename
get_sample_weights
toget_forest_weights
. #894 - Return
quantile_forest
predictions in thepredictions
attribute of a new output list in order to conform with the GRF convention of returning point predictions aspredict(forest)$predictions
. #822 - Change the way optional sample weights (passed through the
sample.weights
argument) interact with the GRF forest weights. When forming estimates according to (2) and (3) in the GRF paper, sample weights now enter through alpha_i(x)' = alpha_i(x) * sample.weight_i. In addition,causal_forest
andinstrumental_forest
now take sample weights into account in the relabeling step (#752). Sample weights are also explicitly disabled for local linear forests (#841). #796
- Add
causal_survival_forest
for estimating conditional average treatment effects with right-censored data. #660 - Add
multi_arm_causal_forest
, an extension ofcausal_forest
to multiple categorial treatmentsW
, and optionally multiple responsesY
. #748 - Add
multi_regression_forest
for estimating several conditional mean functions mu_i(x) = E[Y_i | X = x]. #742 - Add
probability_forest
for estimating conditional class probabilities P[Y = k | X = x]. #711 - Add
get_scores
returning doubly robust scores for a number of estimands. #732 - Add
get_leaf_node
utility function which given a GRF tree object returns the leaf node a test sample falls into. #739 - Add a
vcov.type
standard error option totest_calibration
andbest_linear_projection
. On large datasets with clusters, setting this option to"HC0"
or"HC1"
will significantly speed up the computation. - Add optional Nelson-Aalen estimates of the survival function. #685
- Add a docstring example to
survival_forest
on how to calculate concordance with the optionalsurvival
package. #956
- Fix the output name in
average_treatment_effect
whenmethod = "TMLE"
. #864 - Fix pointwise variance estimates in (the very unlikely) zero variance case. #907
- Fix
survival_forest
test set predictions with sample weights. #969 - Make forest tuning respect the
seed
argument when drawing a random grid of parameter values, allowing reproducibility without an explicitset.seed
before training. #704
- Add Survival Forest functionality. #647
- Add optional argument
debiasing.weights
toaverage_partial_effect
. #637 - Add optional
compute.oob.predictions
argument to Quantile Forest. #665
- Fix a performance regression in DefaultPredictionCollector. This improves prediction speed for forests such as Quantile Forest. #650
- Predict with training quantiles by default. #668
IMPORTANT These changes might cause small differences in results compared to previous releases, even if the same random seed is used.
- Performance improvement: remove an unnecessary splitting rule loop. Note: this may cause very small differences from earlier versions because it changes the order in which potential splits are evaluated. #592
- Add support for missing values in the covariates X with MIA splitting. #612
- Add local linear splitting. An experimental option
enable.ll.split
fits a forest with splits based on ridge residuals as opposed to standard CART splits. Note: local linear tuning does not take the new splits into account. #603 - Add sample weighted splitting. Previously, if a user passed
sample.weights
, they would only be used for prediction. Now they are used in splitting as well. Note: this will make results fitted with sample weights different from previous versions. #590
- Remove a superfluous predict call in tuning. #597
- Fix
average_partial_effect
calibration in case of low variation W.hat. #611 - Update
best_linear_projection
to handle non-binary treatment. #615 - Add an error message in case summary functions are passed a subset that refers to too few distinct units. #629
- Fix a bug where the nodes of the printed trees would be in the wrong order. #587
IMPORTANT These changes might cause small differences in results compared to previous releases, even if the same random seed is used.
- Rename
prune.empty.leaves
tohonesty.prune.leaves
. #529 - Simplify the forest tuning API. Previously, tuning was enabled during forest training by setting the option
tune.parameters=TRUE
. All relevant parameters were tuned by default, except for those that were explicitly passed to the forest (likemin.node.size=100
). Now the optiontune.parameters
directly takes a list of parameters to tune, for exampletune.parameters=c("min.node.size", "mtry")
, ortune.parameters="all"
. #534 - Change how data points are weighted in cluster-robust estimation. Previously, each cluster was given equal weight when training the forest and computing estimates. Now, each point is weighted equally regardless of its cluster size. This behavior can be controlled through a new option
equalize.cluster.weights
, which defaults toFALSE
but can be set toTRUE
to match the old behavior of weighting clusters equally. The old optionsamples.per.cluster
has been removed. #545.
- Improve the performance of
get_tree
. #528 - Add support for tuning instrumental forests (currently marked 'experimental'). #547
- Introduce optimizations to tree splitting. These improvements lead to a small speed-up in forest training. #560, #561
- Add
best_linear_projection
, a doubly robust estimate of the best linear projection of the conditional average treatment effect onto a set of covariates. #574 - Speed up forest prediction by introducing additional parallelization. #566, #576
- Allow the data matrix
X
to be a data frame. #540 - When merging forests, validate that all forests were trained on the same data. #543
- Fix a major performance issue in
get_sample_weights
. #578
IMPORTANT These changes might cause small differences in results compared to previous releases, even if the same random seed is used.
- Ensure forest estimates are consistent across platforms. #469, #492
- The number of trees used for orthogonalization was changed from
min(500, num.trees)
tomax(50, num.trees / 4)
. #439 - Solidify the parameter tuning procedure. If the optimization procedure fails, or if the selected parameters perform worse than defaults, we now return default parameters instead. #455
- Introduce parameters
honesty.fraction
andprune.empty.leaves
to help mitigate the effect of honesty on small datasets, and tune over them whentune.parameters=TRUE
. #456, #484
- Add variance estimates for local linear forests. #442
- Include information about leaf samples in plotting and printing. #460
- Add example of saving a plot with DiagrammeRsvg. #478
- Support average effect estimates for instrumental forests (ACLATE). #490
- Performance improvements to forest training. #514
IMPORTANT These changes might cause small differences in results compared to previous releases, even if the same random seed is used.
- Fix two bugs in the termination criterion for tree splitting.
- Remove the purity condition on outcomes during splitting. For all tree types, we used to stop splitting if all outcomes in a leaf are the same. This behavior does not make sense for causal forests (which incorporates other observations besides the outcome), so it was removed. #362
- Stop splitting if the objective can no longer be improved. With this change,
causal_forest
may split slightly less aggressively. #415
- In out-of-bag prediction, return the Monte Carlo error alongside the debiased error. #327
- Allow for passing a factor for the
cluster
parameter. #329 - Support taking a union of forests through the
merge_forests
method. #347 - Include a summary of the parameter tuning procedure in the forest object. #419
- Add experimental support for sample weighting to regression, causal, and instrumental forests. #376, #418
- Add an experimental new forest type
boosted_regression_forest
, which applies boosting to regression forests. Allow boosting to be used during orthogonalization through theorthog.boosting
parameter. #388
- Improve input data validation. #354, #378, #430
- Improve the
test_calibration
function by switching to one-sided p-values. #370 - For custom forests, fix a bug in OOB prediction where the train and tests datasets were switched. #372
- Decrease memory usage during training and out-of-bag prediction. #408, #412
- Allow roxygen to autogenerate the
NAMESPACE
file. #423, #428
- Add support for confidence intervals in local linear regression forests.
- Allow samples_per_cluster to be larger than smallest cluster size.
- Make sure average effect estimation doesn't error on data with a single feature.
- Fix a bug in local linear prediction where the penalty wasn't properly calculated.
- Fix two issues in causal forest tuning that could lead to unstable results.
- Ensure that the ATE and APE functions correctly account for cluster membership.
- Add basic support for tree plotting (through the
plot
method). - Add the method
test_calibration
, which performs an omnibus test for presence of heterogeneity via calibration. - For local linear regression forests, add support for selecting the value of
ll.lambda
through cross-validation. - Introduce a training option
honesty.fraction
that can be used to specify the fraction of data that should be used in selecting splits vs. performing estimation. Note that this parameter is only relevant when honesty is enabled (the default). - Start a practical guide to the
grf
algorithm (https://github.com/grf-labs/grf/blob/master/REFERENCE.md). - In
average_treatment_effect
andaverage_partial_effect
, add an optionsubset
to support estimating the treatment effect over a subsample of the data.
- Fix a bug in random sampling where features listed earlier in the data matrix were more likely to be selected for splitting.
- Make sure that the sample indices returned in
get_tree
are 1-indexed, as opposed to 0-indexed.
- Replace the local.linear option with linear.correction.variables, which allows for a subset of variables to be considered during local linear regression.
- Update causal_forest interface to allow user to specify Y.hat and W.hat. Note that these options supercede the precompute.nuisance parameter, which has been removed. To recreate the behavior of precompute.nuisance = TRUE, NULL can be provided for Y.hat and W.hat, and for precompute.nuisance = FALSE, Y.hat and W.hat should be 0.
- Add a causal forest example with variable selection and parameter tuning.
- Adjust the defaults for the causal forest tuning algorithm.
- Prevent tuning down to a min.node.size of 0.
- Debiased error criterion for measuring the out-of-bag accuracy of a forest using only a few trees.
- Automated tuning via cross-validation for regression and causal forests.
- Estimation of average partial effects with a continuous treatment.
- Overlap-weighted average treatment effects.
- Cluster-robust standard errors for regression and causal forests, and average effect estimates (contributed by @lminer).
- Locally linear prediction in regression forests (contributed by @rinafriedberg).
- Regularize splits in causal/instrumental forests via a variance penalty.
- Avoid causal forest leaves with all treated or all control samples (controlled via stabilize.splits = TRUE).
- Store in-bag rather than out-of-bag samples to save memory.
- Only support sampling with replacement (as some features are ambiguously defined with bootstrapping).
- Bug in IV CI construction (#209).
- Create a simple method for variable importance based on split frequency and depth.
- Add support for sparse data matrices of type 'dgCMatrix'.
- Use RcppEigen for the R package, as opposed to the eigen source.
- Fix a few places where we were still using the old default for mtry. This issue was causing poor performance for even moderately large numbers of features.
- Update the default for mtry to sqrt(p) + 20.
- Fix an issue where split_frequencies fails when p = 1.
- Use a Solaris-compatible version of std::sqrt.
- Fix an out of bounds error when there are fewer trees than threads.
- Fix a bug in the get_tree function where the same tree was always returned.
- Add an experimental regularized version of the regression splitting rule.
- Several bugfixes for CRAN compatibility.
- Add a regression.splits option to quantile forests to allow emulating the approach in Meinhausen (2006).
First official (beta) release. The package currently supports
- standard regression forests
- causal, instrumental, and quantile forests
- confidence intervals for causal, instrumental, and regression forests
- training 'honest' versions of the above forests