Last update: May 2019.
NGBoost is a Python library to use boosting for probabilistic forecasting of classification, regression, and survival predictions, built on top of Jax and Scikit-Learn. It is designed to be scalable and modular with respect to choice of proper scoring rule, distribution, and base learners.
We predict a parametric conditional distribution of an outcome using a combination of base learners [1],
In the training process, we first fit a base learner to predict for the marginal distribution. Then we iteratively fit base learners to gradients of the proper scoring rule with respect to , calculating corresponding scaling parameters via line search and taking steps of size .
Proper scoring rules are objective functions for forecasting that, when minimized, naturally yield calibrated predictions [2]. We provide support for maximum likelihood (MLE) and the continuous ranked probability score (CRPS), and their analogs in the classification and survival contexts. For parameters and an observed outcome , these proper scoring rule are defined as,
When the model is well-specified, both scoring rules will recover the true model. See [3] for a comprehensive discussion comparing the robustness of these scoring rules to model mis-specification.The choice of proper scoring rule implies a choice of divergence between distributions [4]. The divergence between the empirical distribution of training data and modeled distribution is minimized in the training process. It turns out the implied divergences are the familiar KL-divergence for MLE, and the Cramer divergence for CRPS ¹.
We model the conditional distribution of the outcome as a parametric probability distribution. As a concrete example, consider heteroskedastic regression with a Normal distribution:
Each parameter and is learned via rounds of gradient boosting.
The choice of parametric distribution implies assumptions about the noise-generating process. For a well-specified model, miscalibration arises when the assumed noise-generating process does not match the true data-generating process [3]. In particular, a true noise distribution that has heavier (or lighter) tails that the assumed noise distribution will result W-shaped (or M-shaped) probability integral transformed histograms.
Any choice of base learner compatible with the Scikit-Learn API (specifically, implementing the fit
and predict
functions) may be used. We recommend using heavily regularized decision trees or linear base learners, in the spirit of ensembling a set of weak learners. This is also motivated by the empirical success of tree-based gradient boosting methods for certain modalities (such as Kaggle dataset).
The natural gradient [5] is typically motivated as the direction of steepest descent in parameter space. By leveraging information geometry, it results in gradient descent steps that are invariant to choice of distribution parameterization.
For the MLE scoring rule, a natural gradient descent step is defined,
where the Fisher information matrix is defined,
For exponential family distributions, it turns out the Fisher information matrix is equivalent to the Hessian in the natural parameter space, and so a natural gradient step is equivalent to a Newton-Raphson step. For any other choice of parameterization however, the MLE score is non-convex and the Hessian is not positive semi-definite, so a direct Newton-Raphson step is not recommended. However, for exponential families the Fisher information matrix turns out to be equivalent to the Generalized Gauss-Newton matrix, no matter what the choice of parameterization. We can therefore interpret natural gradient descent as a Newton-Raphson step that uses a positive semi-definite approximation to the Hessian [6].For the CRPS scoring rule, a natural gradient descent step is defined,
For heteroskedastic prediction tasks in particular the use of natural gradient significantly improves the speed of the training process. In the example below we fit the marginal distribution of set of i.i.d. normal observations , parameterizing the distribution with .
Installation:
pip3 install ngboost
Below we show an example of fitting a linear model using 1-dimensional covariates.
# Todo.
The above examples result in the following prediction intervals.
For further details the examples/
folder.
¹ While outside the scope of univariate regression tasks, we note that the multivariate generalization of the Cramer divergence is the Energy Distance [7,8], defined as
where expectations are taken over independent draws of random variables The multivariate generalization of the KL divergence is a straightforward extension of the univariate case.[1] J. H. Friedman, Greedy Function Approximation: A Gradient Boosting Machine. The Annals of Statistics, 29 (2001) 1189–1232.
[2] T. Gneiting & A. E. Raftery, Strictly Proper Scoring Rules, Prediction, and Estimation. Journal of the American Statistical Association, 102 (2007) 359–378.
[3] M. Gebetsberger, J. W. Messner, G. J. Mayr, & A. Zeileis, Estimation Methods for Nonhomogeneous Regression Models: Minimum Continuous Ranked Probability Score versus Maximum Likelihood. Monthly Weather Review, 146 (2018) 4323–4338. https://doi.org/10.1175/MWR-D-17-0364.1.
[4] A. P. Dawid, The geometry of proper scoring rules. Annals of the Institute of Statistical Mathematics, 59 (2007) 77–93. https://doi.org/10.1007/s10463-006-0099-8.
[5] S. Amari, Natural Gradient Works Efficiently in Learning. Neural Computation, (1998) 29.
[6] J. Martens, New insights and perspectives on the natural gradient method (2014).
[7] G. J. Székely & M. L. Rizzo, Energy statistics: A class of statistics based on distances. Journal of Statistical Planning and Inference, 143 (2013) 1249–1272. https://doi.org/10.1016/j.jspi.2013.03.018.
[8] M. G. Bellemare, I. Danihelka, W. Dabney, S. Mohamed, B. Lakshminarayanan, S. Hoyer, & R. Munos, The Cramer Distance as a Solution to Biased Wasserstein Gradients (2017).
This library is available under the MIT License.