Density estimation using Gaussian mixtures in the presence of noisy, heterogeneous and incomplete data
Extreme-deconvolution (XD) is a general algorithm to infer a d-dimensional distribution function from a set of heterogeneous, noisy observations or samples. It is fast, flexible, and treats the data's individual uncertainties properly, to get the best description possible of the underlying distribution. It performs well over the full range of density estimation, from small data sets with only tens of samples per dimension, to large data sets with millions of data points.
The extreme-deconvolution algorithm is available here as a dynamic C-library that your programs can link to, or through Python, R, or IDL wrappers that allow you to call the fast underlying C-code in your high-level applications with minimal overhead.
- 2016/11/04: Check out the XDGMM wrapper class, which allows fitting, model selection, resampling, and conditioning of the Gaussian mixture model. Also check out the astroML alternative implementation of the XD algorithm (Python only).
- 2015/08/17: An interface in R was added by @gaow. See the updated "Installation notes" and "Usage notes" below.
- 2014/01/17: Code was migrated to github; previous news refers to the googlecode version.
- 2011/06/12: Version 1.3 released. This version incorporates OpenMP multiprocessing support and other improvements since version 1.2 (see other _News_ below).
- 2011/01/29: The code now has multiprocessing support using OpenMP and is automatically compiled with OpenMP. Use the environment variable OMP_NUM_THREADS to control the number of threads used. Only available in the trunk at this time. If you want to compile the code without OpenMP, remove the -fopenmp references in the Makefile and copy the src/_omp.h file to src/omp.h.
- 2010/06/11: Some addons were added: Use these small IDL programs to use the output from the core extreme-deconvolution code (e.g., calculate uncertainties on the best-fit parameters, calculate membership probabilities, perform a KS test with these membership probabilities). Download them here or check-out the code.
- 2010/04/01: The "weight" option was added to give the data points different weights in the calculation of the log likelihood (log likelihood = weight * log p(data|model) ). Only available in the trunk at this time.
- 2010/03/05: Version 1.2 released. The "projection" matrices are now optional and the data-uncertainties can be diagonal (i.e., uncorrelated).
GSL: The GNU Scientific Library.
The latest release of the code is extreme-deconvolution-1.4. Support for R is currently only available in the GitHub repository and requires you to get the latest version.
Get the latest version by checking out the git repository:
git clone https://github.com/jobovy/extreme-deconvolution.git
(to only install the python library, see the INSTALL_PYTHON file)
If you download the last version from the download tab, do:
tar xvzf extreme-deconvolution-1.3.tar.gz cd extreme-deconvolution-1.3/ make
To install the library do:
sudo make install
or:
make install INSTALL_DIR=/path/to/install/dir/
To install the IDL wrapper do:
make idlwrapper
Add INSTALL_DIR=/path/to/install/dir/ if you used this to install the library
To install the python wrapper do:
make pywrapper
Add INSTALL_DIR=/path/to/install/dir/ if you used this to install the library
To install the R package do:
make make rpackage R CMD INSTALL ExtremeDeconvolution_1.3.tar.gz
Fix the version number as needed. Note that options for compiling
packages in R are specified through the Makevars
file, which
should typically be located at ~/.R/Makevars
. For example, if you
need to override the default C compiler to gcc-4.9
, you would add
line CC=gcc-4.9
to the Makevars file before building the
package. (You also need to make sure that the proper CC
is set in
the main Makefile
as well.) For more details on customzing R package
installation, see here.
Alternatively, you may find that it is more convenient to use the
install.packages() function in R to install the package. In that
case, replace the second step (R CMD INSTALL ...
) with the
following call within your R environment
install.packages(pkgs = "ExtremeDeconvolution_1.3.tar.gz",repos = NULL)
This assumes that the R working directory is the same as the root of this git repository.
To test whether the code and the python wrapper is working do:
make testpy
To test whether the code and the IDL wrapper is working do (requires IDL and the IDL-wrapper to be installed):
make testidl
Clean up intermediate files:
make clean
Examples of use of the code are in the IDL example code in examples/fit_tf.pro and in the python doctest in py/extreme_deconvolution.py.
In python you would typically do something like:
from extreme_deconvolution import extreme_deconvolution #Set up your arrays: ydata has the data, ycovar the uncertainty covariances #initamp, initmean, and initcovar are initial guesses #get help on their shapes and other options using ?extreme_deconvolution #Run the code extreme_deconvolution(ydata,ycovar,initamp,initmean,initcovar) #initamp, initmean, and initcovar are now updated to their best fit values
In IDL this becomes:
;;Set up arrays and the number of Gaussians K=2 ;;K Gaussians ;;Run the code projected_gauss_mixtures_c, K, ydata, ycovar, initamp, initmean, initcovar, /quiet ;;initamp, initmean, and initcovar are now updated to their best fit values
In R:
library(ExtremeDeconvolution) ?extreme_deconvolution
`make` returns "file was built for unsupported file format which is not the architecture being linked (i386)" errors (or x86_64)
XD is trying to compile as a 32 (or 64) bit library while your GSL or OpenMP libraries were compiled as 64 (or 32) bit libraries. You can force XD to compile as a particular architecture by adding the ARCH option to make, e.g.:
make ARCH=x86_64
I do not have/want OpenMP
You can disable OpenMP support by removing the -fopenmp and -lgomp references in the Makefile. If you are installing the R package, also remove the -fopenmp and -lgomp references in r/src/Makefile.
Problems with clang
On Macs with OS X >= 10.9, gcc is no longer the default compiler, which is instead clang (although confusingly, gcc points to clang!). Clang does not have support for OpenMP (yet) and the code will therefore only run on a single CPU. To use the OpenMP parallelized version of the code, install gcc yourself and make sure that the Makefile is using it (using the CC variable). One recommended option on a Mac is to install gcc with openmp using Homebrew; e.g., after installing Homebrew on your Mac, run
brew install [email protected]
, then setCC=gcc-4.9
in the Makefile for this repository.
Thanks to Gao Wang and Peter Carbonetto for the R interface and Daniela Carollo, Joe Hennawi, Sergey Koposov, and Leonidas Moustakas for bug reports and fixes.
The algorithm that the code implements is described in the paper Extreme deconvolution: inferring complete distribution functions from noisy, heterogeneous and incomplete observations; a copy of the latest draft of this paper is included in the "doc/" directory of the repository or source archive. If you use the code, please cite this paper, e.g.:
Extreme deconvolution: inferring complete distribution functions from noisy, heterogeneous and incomplete observations Jo Bovy, David W. Hogg, & Sam T. Roweis, Ann. Appl. Stat. 5, 2B, 1657 (2011)
The velocity distribution of nearby stars (paper):
The metallicity distribution of nearby stars in the Milky Way disk as a mixture of a thin and thick disk (from this paper):
Quasar colors as a function of redshift (from this paper):
- The Velocity Distribution of Nearby Stars from Hipparcos Data. I. The Significance of the Moving Groups, Bovy, Jo, Hogg, David W., & Roweis, Sam T., 2009, Astrophys. J. 700, 1794 2009ApJ...700.1794B
- The Velocity Distribution of Nearby Stars from Hipparcos data II. The Nature of the Low-velocity Moving Groups, Bovy, Jo & Hogg, David W., 2010, Astrophys. J. 717, 617 2010ApJ...717..617B
- Think Outside the Color Box: Probabilistic Target Selection and the SDSS-XDQSO Quasar Targeting Catalog, Bovy, Jo, et al., 2011, Astrophys. J. 729, 141 2011ApJ...729..141B
- Carbon-Enhanced Metal-Poor Stars in the Inner and Outer Halo Components of the Milky Way, Carollo, Daniela, et al., 2012, Astrophys. J. 744, 195 2012ApJ...744..195C
- Photometric Redshifts and Quasar Probabilities from a Single, Data-driven Generative Model, Bovy, Jo, et al., 2012, Astrophys. J. 749, 41 2012ApJ...749...41B
- The Stellar Metallicity Distribution Function of the Galactic Halo from SDSS Photometry, An, Deokkeun, et al., 2013, Astrophys. J. 763, 65 2013ApJ...763...65A
- Sagittarius Stream Three-dimensional Kinematics from Sloan Digital Sky Survey Stripe 82, Koposov, Sergey, Belokurov, Vasily, & Wyn Evans, N., 2013, Astrophys. J. 766, 79 2013ApJ...766...79K
- Your paper here? email