Java library and command-line application for converting Scikit-Learn pipelines to PMML.
- Supported Estimator and Transformer types:
- Clustering:
- Composite Estimators:
- Matrix Decomposition:
- Discriminant Analysis:
- Dummies:
- Ensemble Methods:
ensemble.AdaBoostRegressor
ensemble.BaggingClassifier
ensemble.BaggingRegressor
ensemble.ExtraTreesClassifier
ensemble.ExtraTreesRegressor
ensemble.GradientBoostingClassifier
ensemble.GradientBoostingRegressor
ensemble.HistGradientBoostingClassifier
ensemble.HistGradientBoostingRegressor
ensemble.IsolationForest
ensemble.RandomForestClassifier
ensemble.RandomForestRegressor
ensemble.StackingClassifier
ensemble.StackingRegressor
ensemble.VotingClassifier
ensemble.VotingRegressor
- Feature Extraction:
- Feature Selection:
feature_selection.GenericUnivariateSelect
(only viasklearn2pmml.SelectorProxy
)feature_selection.RFE
(only viasklearn2pmml.SelectorProxy
)feature_selection.RFECV
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFdr
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFpr
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectFromModel
(either directly or viasklearn2pmml.SelectorProxy
)feature_selection.SelectFwe
(only viasklearn2pmml.SelectorProxy
)feature_selection.SelectKBest
(either directly or viasklearn2pmml.SelectorProxy
)feature_selection.SelectPercentile
(only viasklearn2pmml.SelectorProxy
)feature_selection.VarianceThreshold
(only viasklearn2pmml.SelectorProxy
)
- Impute:
- Isotonic regression:
- Generalized Linear Models:
linear_model.ARDRegression
linear_model.BayesianRidge
linear_model.ElasticNet
linear_model.ElasticNetCV
linear_model.GammaRegressor
linear_model.HuberRegressor
linear_model.Lars
linear_model.LarsCV
linear_model.Lasso
linear_model.LassoCV
linear_model.LassoLars
linear_model.LassoLarsCV
linear_model.LinearRegression
linear_model.LogisticRegression
linear_model.LogisticRegressionCV
linear_model.OrthogonalMatchingPursuit
linear_model.OrthogonalMatchingPursuitCV
linear_model.PoissonRegressor
linear_model.Ridge
linear_model.RidgeCV
linear_model.RidgeClassifier
linear_model.RidgeClassifierCV
linear_model.SGDClassifier
linear_model.SGDRegressor
linear_model.TheilSenRegressor
- Model Selection:
- Multiclass classification:
- Naive Bayes:
- Nearest Neighbors:
- Pipelines:
- Neural network models:
- Preprocessing and Normalization:
preprocessing.Binarizer
preprocessing.FunctionTransformer
preprocessing.Imputer
preprocessing.LabelBinarizer
preprocessing.LabelEncoder
preprocessing.MaxAbsScaler
preprocessing.MinMaxScaler
preprocessing.OneHotEncoder
preprocessing.OrdinalEncoder
preprocessing.PolynomialFeatures
preprocessing.RobustScaler
preprocessing.StandardScaler
- Support Vector Machines:
- Decision Trees:
- Supported third-party Estimator and Transformer types:
- Category Encoders:
- H2O.ai:
- Imbalanced-Learn (imblearn):
imblearn.combine.SMOTEENN
imblearn.combine.SMOTETomek
imblearn.ensemble.BalancedBaggingClassifier
imblearn,ensemble,BalancedRandomForestClassifier
imblearn.over_sampling.ADASYN
imblearn.over_sampling.BorderlineSMOTE
imblearn.over_sampling.KMeansSMOTE
imblearn.over_sampling.RandomOverSampler
imblearn.over_sampling.SMOTE
imblearn.over_sampling.SMOTENC
imblearn.over_sampling.SVMSMOTE
imblearn.pipeline.Pipeline
imblearn.under_sampling.AllKNN
imblearn.under_sampling.ClusterCentroids
imblearn.under_sampling.CondensedNearestNeighbour
imblearn.under_sampling.EditedNearestNeighbours
imblearn.under_sampling.InstanceHardnessThreshold
imblearn.under_sampling.NearMiss
imblearn.under_sampling.NeighbourhoodCleaningRule
imblearn.under_sampling.OneSidedSelection
imblearn.under_sampling.RandomUnderSampler
imblearn.under_sampling.RepeatedEditedNearestNeighbours
imblearn.under_sampling.TomekLinks
- LightGBM:
- SkLearn2PMML:
sklearn2pmml.EstimatorProxy
sklearn2pmml.SelectorProxy
sklearn2pmml.decoration.Alias
sklearn2pmml.decoration.CategoricalDomain
sklearn2pmml.decoration.ContinuousDomain
sklearn2pmml.decoration.DateDomain
sklearn2pmml.decoration.DateTimeDomain
sklearn2pmml.decoration.MultiDomain
sklearn2pmml.decoration.OrdinalDomain
sklearn2pmml.ensemble.GBDTLMRegressor
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
LGBMRegressor
,XGBRegressor
,XGBRFRegressor
. - The LM side: A Scikit-Learn linear regressor (eg.
ElasticNet
,LinearRegression
,SGDRegressor
).
- The GBDT side: All Scikit-Learn decision tree ensemble regressors,
sklearn2pmml.ensemble.GBDTLRClassifier
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
LGBMClassifier
,XGBClassifier
,XGBRFClassifier
. - The LR side: A Scikit-Learn binary linear classifier (eg.
LinearSVC
,LogisticRegression
,SGDClassifier
).
- The GBDT side: All Scikit-Learn decision tree ensemble classifiers,
sklearn2pmml.ensemble.SelectFirstClassifier
sklearn2pmml.ensemble.SelectFirstRegressor
sklearn2pmml.feature_selection.SelectUnique
sklearn2pmml.pipeline.PMMLPipeline
sklearn2pmml.preprocessing.Aggregator
sklearn2pmml.preprocessing.CastTransformer
sklearn2pmml.preprocessing.ConcatTransformer
sklearn2pmml.preprocessing.CutTransformer
sklearn2pmml.preprocessing.DaysSinceYearTransformer
sklearn2pmml.preprocessing.ExpressionTransformer
- Ternary conditional expression
<expression_true> if <condition> else <expression_false>
. - Array indexing expressions
X[<column index>]
andX[<column name>]
. - String concatenation expressions.
- String slicing expressions
<str>[<start>:<stop>]
. - Arithmetic operators
+
,-
,*
,/
and%
. - Identity comparison operators
is None
andis not None
. - Comparison operators
in <list>
,not in <list>
,<=
,<
,==
,!=
,>
and>=
. - Logical operators
and
,or
andnot
. - Value missingness check functions
pandas.isnull
andpandas.notnull
. - Numpy universal functions.
- String functions
lower
,upper
andstrip
. - String length function
len(<str>)
- Ternary conditional expression
sklearn2pmml.preprocessing.IdentityTransformer
sklearn2pmml.preprocessing.LookupTransformer
sklearn2pmml.preprocessing.MatchesTransformer
sklearn2pmml.preprocessing.MultiLookupTransformer
sklearn2pmml.preprocessing.PMMLLabelBinarizer
sklearn2pmml.preprocessing.PMMLLabelEncoder
sklearn2pmml.preprocessing.PowerFunctionTransformer
sklearn2pmml.preprocessing.ReplaceTransformer
sklearn2pmml.preprocessing.SecondsSinceMidnightTransformer
sklearn2pmml.preprocessing.SecondsSinceYearTransformer
sklearn2pmml.preprocessing.StringNormalizer
sklearn2pmml.preprocessing.SubstringTransformer
sklearn2pmml.preprocessing.WordCountTransformer
sklearn2pmml.preprocessing.h2o.H2OFrameCreator
sklearn2pmml.preprocessing.scipy.BSplineTransformer
sklearn2pmml.ruleset.RuleSetClassifier
- Sklearn-Pandas:
sklearn_pandas.CategoricalImputer
sklearn_pandas.DataFrameMapper
- TPOT:
tpot.builtins.stacking_estimator.StackingEstimator
- XGBoost:
- Production quality:
- Complete test coverage.
- Fully compliant with the JPMML-Evaluator library.
- Python 2.7, 3.4 or newer.
scikit-learn
0.16.0 or newer.sklearn-pandas
0.0.10 or newer.sklearn2pmml
0.14.0 or newer.
Validating Python installation:
import sklearn, sklearn.externals.joblib, sklearn_pandas, sklearn2pmml
print(sklearn.__version__)
print(sklearn.externals.joblib.__version__)
print(sklearn_pandas.__version__)
print(sklearn2pmml.__version__)
- Java 1.8 or newer.
Enter the project root directory and build using Apache Maven:
mvn clean install
The build produces an executable uber-JAR file target/jpmml-sklearn-executable-1.6-SNAPSHOT.jar
.
A typical workflow can be summarized as follows:
- Use Python to train a model.
- Serialize the model in
pickle
data format to a file in a local filesystem. - Use the JPMML-SkLearn command-line converter application to turn the pickle file to a PMML file.
Loading data to a pandas.DataFrame
object:
import pandas
df = pandas.read_csv("Iris.csv")
iris_X = df[df.columns.difference(["Species"])]
iris_y = df["Species"]
First, creating a sklearn_pandas.DataFrameMapper
object, which performs column-oriented feature engineering and selection work:
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import StandardScaler
from sklearn2pmml.decoration import ContinuousDomain
column_preprocessor = DataFrameMapper([
(["Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"], [ContinuousDomain(), StandardScaler()])
])
Second, creating Transformer
and Selector
objects, which perform table-oriented feature engineering and selection work:
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest
from sklearn.pipeline import Pipeline
from sklearn2pmml import SelectorProxy
table_preprocessor = Pipeline([
("pca", PCA(n_components = 3)),
("selector", SelectorProxy(SelectKBest(k = 2)))
])
Please note that stateless Scikit-Learn selector objects need to be wrapped into an sklearn2pmml.SelectprProxy
object.
Third, creating an Estimator
object:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier(min_samples_leaf = 5)
Combining the above objects into a sklearn2pmml.pipeline.PMMLPipeline
object, and running the experiment:
from sklearn2pmml.pipeline import PMMLPipeline
pipeline = PMMLPipeline([
("columns", column_preprocessor),
("table", table_preprocessor),
("classifier", classifier)
])
pipeline.fit(iris_X, iris_y)
Embedding model verification data:
pipeline.verify(iris_X.sample(n = 15))
Storing the fitted PMMLPipeline
object in pickle
data format:
from sklearn.externals import joblib
joblib.dump(pipeline, "pipeline.pkl.z", compress = 9)
Please see the test script file main.py for more classification (binary and multi-class) and regression workflows.
Converting the pipeline pickle file pipeline.pkl.z
to a PMML file pipeline.pmml
:
java -jar target/jpmml-sklearn-executable-1.6-SNAPSHOT.jar --pkl-input pipeline.pkl.z --pmml-output pipeline.pmml
Getting help:
java -jar target/jpmml-sklearn-executable-1.6-SNAPSHOT.jar --help
Up-to-date:
- Converting Scikit-Learn based Imbalanced-Learn (imblearn) pipelines to PMML documents
- Extending Scikit-Learn with date and datetime features
- Extending Scikit-Learn with feature specifications
- Converting logistic regression models to PMML documents
- Stacking Scikit-Learn, LightGBM and XGBoost models
- Converting Scikit-Learn hyperparameter-tuned pipelines to PMML documents
- Extending Scikit-Learn with GBDT plus LR ensemble (GBDT+LR) model type
- Converting Scikit-Learn based TPOT automated machine learning (AutoML) pipelines to PMML documents
- Converting Scikit-Learn based LightGBM pipelines to PMML documents
- Extending Scikit-Learn with business rules (BR) model type
Slightly outdated:
JPMML-SkLearn is licensed under the terms and conditions of the GNU Affero General Public License, Version 3.0.
If you would like to use JPMML-SkLearn in a proprietary software project, then it is possible to enter into a licensing agreement which makes JPMML-SkLearn available under the terms and conditions of the BSD 3-Clause License instead.
JPMML-SkLearn is developed and maintained by Openscoring Ltd, Estonia.
Interested in using Java PMML API software in your company? Please contact [email protected]