The "Linear Model Selection.ipynb" notebook discusses 1) Subset selection, 2) Step wise regression and 3) Shrinkage methods (lasso and ridge) for selecting from a set of linear models. These methods essentially inform of us of the predictors that should be contained in the regression for optimal model fitness.
The notebook uses school level achievement and characteristics data from New York. The data is available on kaggle. We use model selection methods to find the important determinants of school level math achievememnt.
- numpy
- Pandas
- Matplotlib
- scikit-learn (just for lasso)
All dependencies can be installed using pip
If anyone is interested in working further on improving this notebook, here are some ideas:
- Code lasso from scratch using numpy (don't rely on scikit-learn)
- Perform cross validation to find the optimal tuning parameter for lasso. Compare the model selected by lasso to the one selected by forward selection.