Pandas dataframe goes in, XGBoost model results come out
The feature engineering step (creating new features and selectively removing unwanted features) is the most creative and fun step of training a model, whereas what follows is usually a standard data-processing flow.
Once you're done engineering your features, xgbmagic automatically runs a standard workflow for using XGBoost to train a model on a pandas dataframe.
- performs one-hot encoding for categorical features,
- drops uninformative features (no variability, too many missing values...)
- trains the model
- plots the most important features in order of importance.
Iterate faster with smaller samples! Improve accuracy with ensemble learning (bagging)!
- allows for training on random sample(s) - e.g. if you'd like to train a subset of your data
- allows for training multiple times on different subsamples, then returning average/consensus from all models as the prediction (bagging)
- detect highly correlated columns and remove redundant columns
- remove categorical features with too many possible category values (to remove unhelpful features like names and ids)
- parameter tuning with GridsearchCV
- allow custom values for more parameters
- for classification problems, convert labels to 0-n integers (in case they're floats or strings)
Install xgboost first
git clone --recursive https://github.com/dmlc/xgboost
cd xgboost; make -j4
cd python-package; sudo python setup.py install
Then install xgbmagic
pip install xgbmagic
Input parameters:
- df (DataFrame): pandas dataframe
- target_column (string): name of column containing the target parameter
- id_column (string): name of column containing IDs
- target_type (string): 'binary' for binary targets (classification), 'multiclass' for multiple classes (classification), 'linear' for continuous targets (linear regression)
- categorical_columns (list of strings): a list of names of columns containing categorical data
- numerical_columns (list of strings): a list of names of columns containing numerical data
- drop_columns (list of strings): a list of names of columns to drop
- verbose (boolean): verbosity of printouts. True = verbose
- sample_fraction (float, 0-1): if this is not 1, a subsample of the data will be used to train the model
- n_samples (int): if this is more than 1, this number of models will be trained, each iteration trained on a subsample of data (size of sample determined by sample_fraction). Note that if n_samples * sample_fraction is more than one, the maximum number of models will be trained based on how many samples we can get from the stated sample_fraction (e.g. if sample_fraction = 0.2, the maximum by number of samples will be 5)
- prefix (string): prefix for names of the output files
predict(test_df, return_multi_outputs, return_mean_std)
- return_multi_outputs (boolean): if True, returns one set of outputs per model trained on a distinct subsample
- return_mean_std (boolean): if True, returns tuple with mean and standard deviation of the model outputs if both return_multi_outputs and return_mean_std are set to True, return_multi_outputs takes precedence
import xgbmagic
import pandas as pd
# read the training data
df = pd.read_csv('train.csv')
# for logistic regression, target_type is 'binary'
target_type = 'binary'
# set columns that are categorical, numeric, and to be dropped here.
xgb = xgbmagic.Xgb(df, target_column='TARGET', id_column='ID', target_type=target_type, categorical_columns=[], drop_columns=[], numeric_columns=[], num_training_rounds=500, verbose=1, early_stopping_rounds=50)
xgb.train()
# use the model to predict values for the test set
test_df = pd.read_csv('test.csv')
print(xgb.feature_importance())
output = xgb.predict(test_df)
# write to csv
xgb.write_csv('output-xgbmagic.csv')
# save model
xgb.save('xgbmodel.pkl')
# load model
from sklearn.externals import joblib
xgb = joblib.load('xgbmodel.pkl')
Please report issues and feedback here