Some implementations of data mining algorithms in Python.
These are not "inventions", but merely implementations of algorithms in the literature that are either not implemented in Python, or of which I required my own implementation so that I could build upon them.
The structure of the repository is currently structured in families:
- ensemble
- neuralnet
- svm
- quantile.
Also we have utilities for:
- preprocessing
- timeseries
- scoring metrics.
Notice the structure is in families, not function such as classification or ranking. Inside neuralnet, you can find the different function implementations. The exception being quantile regressions with its own directory.
A good amount of them were developed when writting:
- R. Cruz, K. Fernandes, J. S. Cardoso, and J. F. P. Costa. Tackling Class Imbalance with Ranking. In International Joint Conference on Neural Networks (IJCNN). IEEE, 2016. They were written with the supervision of Kelwin Fernandes and James S. Cardoso.
- smote: SMOTE is a famous oversampling technique that generates new synthetic samples when you have too few observations of one class; I have implemented SMOTE and the MSMOTE variation
- metacost: this is a clever method by Pedro Domingos to add costs support to a classifier by changing the classes
I work mostly on classification, but most of these could be adapted for regression problems as well.
Here I have:
- bagging: a random forest implementation only
- boosting: a AdaBoost, and a gradient boosting implementation (with a couple of different loss functions for the latter)
- extreme-learning: extreme machine learning model
- multiclass: one-vs-all and multiordinal ensembles, which turn binary classifiers into multiclass models
- neuralnet: here I have a simple neural network implemented in pure Python and in C++ with Python-bindings, implemented both with batch and online iteration
- svm: dual and primal implementations of SVM.
Ranking are models used to produce a ranking list, for instance in searches.
The models I have implemented are called "pairwise scoring rankers" which are trained in pairs, but can produce a ranking scoring for each individual observation. This ranking score is only meaningful when compared to the score of another observation.
- GBRank: adapation of gradient boosting for ranking
- RankBoost: adapation of AdaBoost for ranking
- RankNet: adapation of a neural network for ranking (I have also a C++ implementation in the classification folder)
- RankSVM: adapation of SVM with linear kernel for ranking
These are models which, instead of predict the average expected value, they produce the expected value for a given quantile. For instance, what the median prediction is, or what the lowest-10% value you can expect, et cetra.
I have here classification and regression models:
- QBag: simple bagging adapation for quantiles
- QBC and QBR: gradient boosting adapations for quantiles
And that's it!
These are some easy, but cumbersome, methods for timeseries that are sorely missing from python packages and that are always a pain to implement.
- GrowingWindow and SlidingWindow: timeseries cross validation methods
- delay: function to add a delay to a time-indexed variable, to use on an autoregressive model
Scores missing from sklearn:
- pinball: MAE can be used for the median; this is a generalization for other quantiles
- MMAE and AMAE: scoring functions used in imbalance ordinal contexts (it's the average and maximum MAE across classes, independent of frequency)
I meant to have some test files to unit-test the various algorithms. But I will probably never have the time to get around to do that. :) Please let me know if you use any, and whether you had problems using it.
(C) 2016 Ricardo Cruz under the GPLv3