Skip to content

ZeonlungPun/VariablesSelection

Repository files navigation

VariablesSelection

Variables Selection detecting important features from huge input variables.

This file VariableSelection.py provides a class "FeatureImportance" combining most vital feature selection models, which is convenient for users to call.

All the 14 methods we use are : LASSO, ElasticNet, SCAD, Knockoff, RandomForest, AdaBoost, GradientBoosting , ExtraTrees, LassoNet,GradientLearning, LassoNet , GroupLasso ,Layer-WiseRelevancePropagation and SHAP.

Among these algorithms, LASSO, ElasticNet, SCAD , GroupLasso are based on linear model ; RandomForest, AdaBoost, GradientBoosting , ExtraTrees are Tree ensemble models ; LassoNet and Layer-WiseRelevancePropagation combine the neural network and features seletion

Paper Links

Packages Version Need

knockpy==1.3.0
lassonet==0.0.14
numpy==1.24.4
group-lasso==1.5.0
matplotlib==3.7.2
torch==2.0.1
shap==0.42.1
statsmodels==0.13.5
captum==0.6.0

Method To Use

Regardless of the method used, first instantiate the 'FeatureImportance' class.

To use LASSO, ElasticNet, SCAD, RandomForest, AdaBoost, ExtraTrees, GroupLasso:

filter=FeatureImportance(x,y,test_ratio=0.2,threshold=0,wanted_num=2,task='regression',scarler=None,times=10)
coef, total=filter.GetCoefficient1(filter.ExtraTreesModel,max_depth=5,estimator_num=100)   

To use GradientLearning, SHAP , Layer-WiseRelevancePropagation ,DeepLIFT, Knockoff:

filter=FeatureImportance(x,y,test_ratio=0.001,threshold=0,wanted_num=2,task='regression',scarler=None,times=10)
coef, total=filter.GetCoefficient2(filter_fun=filter.GradientLearningFilter,eps=0.25,l1_lamda=0.5,kernel_type="Gaussian")

To use LassoNet :

filter=FeatureImportance(x,y,test_ratio=0.2,threshold=0,wanted_num=2,task='regression',scarler=None,times=10)
coef, total=filter.LassoNetModel(hidden_dims=(64,),M=10,plot=True)

'coef' is the important score of each feature, and 'total' is the summaration time of the feature be choosen during all the experiments.

Additionnaly, we provide a C++ version for gradient learning algorithm.

This file is written by armadillo package , to use it ,please input the command in console :

 g++ gradientLearning.cpp -o gradientLearning -std=c++11 -O2 -larmadillo
./gradientLearning

Example

#create data
n=200
p=50
xita=0.25
w=np.random.normal(loc=1,scale=1,size=(n,p))
u=np.random.normal(loc=1,scale=1,size=(n,p))
x=(w+xita*u)/(1+xita)
y=((2*x[:,0]-1)*(2*x[:,1]-1)).reshape((-1,1))

#execute feature selection 
filter=FeatureImportance(x,y,test_ratio=0.2,threshold=0,wanted_num=2,task='regression',scarler='MinMaxScaler',times=20)
coef, total=filter.GetCoefficient2(filter.SHAP,hidden_num=(12,),plot=True)

In the function filter.GetCoefficient1 or filter.GetCoefficient2 , you need to pass a feature selection function in 'FeatureImportance' class as first parameter, other parameters passed depends on the feature selection method.

Visualization

if the parameter 'plot' in filter.GetCoefficient1 and filter.GetCoefficient2

plot=True

The results of knockoff can be visualized as :
image

The results of SHAP can be visualized as : image

Visualization of LassoNet's tuning hyperparameters process: image

About

Variables Selection detecting feature importance

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published