The goal of colino is to provide supervised feature selection steps to be used with the tidymodels recipes package. The overall focus of the package is on filter-based feature selection methods. Permutation score methods that use a model can be considered a special case of filter approaches.
Note - colino is the new package name and replaces the preliminary 'recipeselectors' name. Colino will be submitted to CRAN once some additional steps and documentation have been finalized.
devtools::install_github("stevenpawley/colino")
The following feature selection methods are implemented:
-
step_select_infgain
provides Information Gain feature selection. This step requires theFSelectorRcpp
package to be installed and can be used for both classification and regression problems. For regression, the target variable is discretized using equal frequency binning. -
step_select_mrmr
provides maximum Relevancy Minimum Redundancy feature selection. This step requires thepraznik
package to be installed. This step can be used for classification and regression problems. Similar to information gain, binning is used when the target variable is continuous. -
step_select_roc
provides ROC-based feature selection based on each predictors' relationship with the response outcomes measured using a Receiver Operating Characteristic curve. -
step_select_xtab
provides feature selection of categorical predictors using statistical association for numeric response outcomes. -
step_select_aov
provides feature selection of categorical predictors using the ANOVA F-test for numeric response outcomes. -
step_select_vip
provides model-based selection using feature importance scores or coefficients. This method allows aparsnip
model specification to be used to select a subset of features based on the models' feature importances or coefficients. See below for details. -
step_select_boruta
provides a Boruta feature selection step. This step can be used for classification and regression problems. -
step_select_carscore
provides a CAR score feature selection step for regression models. This step requires thecare
package to be installed. -
step_select_relief
provides a Relief-based feature selection step for classification and regression models. This step requires theFSelectorRcpp
package to be installed. -
step_select_forests
,step_select_tree
, andstep_select_linear
provide model-based methods of selecting a subset of features based on the model's feature importance scores or coefficients. -
step_select_fcbf
provides the Fast Correlation Based Filter method of Yu and Liu, 2003: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution. This step is implemented in the Bioconductor package 'FCBF' which can be installed usingBiocManager::install("FCBF")
.
Three parameters are used to filter features within the step_select_
functions
in colino:
-
top_p
can be used to select the number of best scoring features to retain. This is nice because it is intuitive, but it suffers from the issue that you do not always know how many features are present in your recipe, if you have added/removed features in preceding recipe steps. -
threshold
can be used to select the percentile of best-scoring features. For examplethreshold = 0.9
will retain only predictors with scores in the top 90th percentile and a smaller threshold value will select more features. -
cutoff
is a new argument that can use used to select features based on their absolute feature scores. For example, if astep_select_
method is based on the p-values of features, thencutoff
can be used to threshold the features based on their p-value units. This requires knowledge of the domain space of those values for any particular method.
Note that top_p
and threshold
are mutually exclusive but either can be used
in conjunction with cutoff
to select the top-ranked features and those
that have filter scores that meet the cutoff threshold. For example, you can
require at least three features to be included by using top_n = 3
but also
include any other features that meet the cutoff criteria, e.g., cutoff = 0.01 if
a method uses p-value units.
Most step_select_
steps have top_p
, threshold
and cutoff
available but
a few methods such as Boruta and FCBF do not rank the features, but only provide
a list of rejected features. These methods typically only have none of these
arguments, or only cutoff
.
The step_select_vip
is designed to work with the parsnip
package and
requires a base model specification that provides a method of ranking the
importance of features, such as feature importance scores or coefficients, with
one score per feature. The base model is specified in the step using the model
parameter.
Although step_select_vip
allows a diverse range of models to be used as the
ranking algorithm, and potentially allows new models to be implemented, a
limitation is that the hyperparameters of the ranking model cannot be tuned. As
an alternative, step_select_linear
, step_select_tree
and
step_select_forests
provide steps specific to these types of models where the
hyperparameters of ranking model can be tuned using the same tuning arguments as
parsnip
.
The parsnip package does not currently contain a method of pulling feature
importance scores from models that support them. The colino
package provides a
generic function pull_importances
for this purpose that accepts a fitted
parsnip model, and returns a tibble with two columns 'feature' and 'importance':
model <- boost_tree(mode = "classification") %>%
set_engine("xgboost")
model_fit <- model %>%
fit(Species ~., iris)
pull_importances(model_fit)
Most of the models and 'engines' that provide feature importances are
implemented. In addition, h2o
models are supported using the agua
package.
Use methods(pull_importances)
to list models that are currently implemented.
If need to pull the feature importance scores from a model that is not currently
supported in this package, then you can add a class to the pull_importances
generic function which returns a two-column tibble:
pull_importances._ranger <- function(object, scaled = FALSE, ...) {
scores <- ranger::importance(object$fit)
# create a tibble with 'feature' and 'importance' columns
scores <- tibble::tibble(
feature = names(scores),
importance = as.numeric(scores)
)
# optionally rescale the importance scores
if (scaled)
scores$importance <- scales::rescale(scores$importance)
scores
}
An example of using the step_importance function:
library(parsnip)
library(recipes)
library(magrittr)
# load the example iris dataset
data(iris)
# define a base model to use for feature importances
base_model <- rand_forest(mode = "classification") %>%
set_engine("ranger", importance = "permutation")
# create a preprocessing recipe
rec <- iris %>%
recipe(Species ~ .) %>%
step_select_vip(all_predictors(), model = base_model, top_p = 2,
outcome = "Species")
prepped <- prep(rec)
# create a model specification
clf <- decision_tree(mode = "classification") %>%
set_engine("rpart")
clf_fitted <- clf %>%
fit(Species ~ ., juice(prepped))