forked from prasunkgupta/DataScienceCourse
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
cd7a685
commit de9f853
Showing
1 changed file
with
294 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,294 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"source": [ | ||
"In this lab we'll demonstrate several common techniques and helpful tools used in a model building process:\n", | ||
"\n", | ||
"- Use Sklearn to generate polynomial features and rescale them\n", | ||
"- Create folds for cross-validation\n", | ||
"- Perform a grid search to optimize hyper-parameters using cross-validation\n", | ||
"- Create pipelines to perform grids search in less code\n", | ||
"- Improve upon a baseline model incrementally by adding in more complexity\n", | ||
"\n", | ||
"This lab will require using several Sklearn classes. It would be helpful to refer to appropriate documentation:\n", | ||
"- http://scikit-learn.org/stable/modules/preprocessing.html\n", | ||
"- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler\n", | ||
"- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures\n", | ||
"- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV\n", | ||
"- http://scikit-learn.org/stable/modules/pipeline.html#pipeline\n", | ||
"\n", | ||
"Also, here is a helpful tutorial that explains how to use much of the above:\n", | ||
"- https://civisanalytics.com/blog/data-science/2016/01/06/workflows-python-using-pipeline-gridsearchcv-for-compact-code/\n", | ||
"\n", | ||
"Like always, let's first load in the data.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import os\n", | ||
"import pandas as pd\n", | ||
"from sklearn.linear_model import LogisticRegression\n", | ||
"from sklearn.grid_search import GridSearchCV\n", | ||
"from sklearn.cross_validation import KFold\n", | ||
"cwd = os.getcwd()\n", | ||
"datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n", | ||
"\n", | ||
"data = pd.read_csv(datadir + 'Cell2Cell_data.csv', header=0, sep=',')\n", | ||
"\n", | ||
"#Randomly sort the data:\n", | ||
"data = data.sample(frac = 1)\n", | ||
"data.columns" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next we're going to prep the data. From prior analysis (Churn Case Study) we learned that we can drop a few variables, as they are either highly redundant or don't carry a strong relationship with the outcome.\n", | ||
"\n", | ||
"After dropping, we're going to use the SkLearn KFold class to set up cross validation fold indexes." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#Prior analysis (from Churn Case study) has shown that we can drop a few redundant variables\n", | ||
"#We want to drop a few to speed up later calculations\n", | ||
"dropvar_list = ['incalls', 'creditcd', 'marryyes', 'travel', 'pcown']\n", | ||
"data_subset = data.drop(dropvar_list, 1)\n", | ||
"\n", | ||
"#Set up X and Y\n", | ||
"X = data_subset.drop('churndep', 1)\n", | ||
"Y = data_subset['churndep']\n", | ||
"\n", | ||
"#Use Kfold to create 4 folds\n", | ||
"kfolds = KFold(#Student -nput code here)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Next let's use cross-validation to build a baseline model. We're going to use LR with no feature pre-processing. We're going to look at both L1 and L2 regularization with different weights. We can do this very succinctly with SkLearns GridSearchCV package." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#1st, set up a paramater grid\n", | ||
"param_grid_lr = {'C':#Student put code here, \n", | ||
" 'penalty':#Student put code here}\n", | ||
"\n", | ||
"#2nd, call the GridSearchCV class, use LogisticRegression and 'log_loss' for scoring\n", | ||
"lr_grid_search = GridSearchCV(#Student put code here) \n", | ||
"lr_grid_search.fit(X, Y)\n", | ||
"\n", | ||
"#3rd, get the score of the best model and print it\n", | ||
"best_1 = #Student put code here\n", | ||
"print(best_1)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#Next let's look at the best-estimator chosen to see what the parameters were\n", | ||
"lr_grid_search.#Student put code here" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now let's see if we can beat this by standardizing the features. We'll approach this using the GridSearchCV class but also build a pipeline. Later we'll extend the pipeline to allow for feature engineering as well." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.pipeline import Pipeline\n", | ||
"from sklearn.preprocessing import StandardScaler\n", | ||
"\n", | ||
"#Create a set of steps. All but the last step is a transformer (something that processes data). \n", | ||
"#Build a list of steps, where the first is StandardScaler and the second is LogisticRegression\n", | ||
"#The last step should be an estimator.\n", | ||
"\n", | ||
"steps = [('scaler', #Student put code here,\n", | ||
" ('lr', #Student put code here)]\n", | ||
"\n", | ||
"#Now set up the pipeline\n", | ||
"pipeline = Pipeline(#Student put code here)\n", | ||
"\n", | ||
"#Now set up the parameter grid, paying close to the correct convention here\n", | ||
"parameters_scaler = dict(lr__C = #Student put code here,\n", | ||
" lr__penalty = #Student put code here)\n", | ||
"\n", | ||
"#Now run another grid search\n", | ||
"lr_grid_search_scaler = GridSearchCV(#Student put code here)\n", | ||
" \n", | ||
"#Don't forget to fit this GridSearchCV pipeline\n", | ||
"#Student put code here\n", | ||
"\n", | ||
"#Again, print the score of the best model\n", | ||
"best_2 = #Student put code here\n", | ||
"print(best_2)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#Let's see the model after scaling. Did the optimal parameters change?\n", | ||
"lr_grid_search_scaler.best_estimator_.steps[-1][1]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now that we've built a pipeline estimator that performs feature scaling and then logistic regression, let's add to it a feature engineering step. We'll then again use GridSearchCV to find an optimal parameter configuration and see if we can beat our best score above." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"from sklearn.preprocessing import PolynomialFeatures\n", | ||
"\n", | ||
"#Create a set of steps. All but the last step is a transformer (something that processes data). \n", | ||
"# Step 1 - PolynomialFeatures\n", | ||
"# Step 2 - StandardScaler\n", | ||
"# Step 3 - LogisticRegression\n", | ||
"\n", | ||
"steps_poly = [#Student put code here]\n", | ||
"\n", | ||
"#Now set up the pipeline\n", | ||
"pipeline_poly = #Student put code here\n", | ||
"\n", | ||
"#Now set up a new parameter grid, use the same paramaters used above for logistic regression, \n", | ||
"#but add polynomial features up to degree 2 with and without interactions. \n", | ||
"parameters_poly = dict(#Student put code here)\n", | ||
"\n", | ||
"#Now run another grid search\n", | ||
"lr_grid_search_poly = #Student put code here\n", | ||
"lr_grid_search_poly.fit(X, Y)\n", | ||
"\n", | ||
"best_3 = lr_grid_search_poly.best_score_\n", | ||
"print(best_3)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"#Let's look at the best estimator, stepwise\n", | ||
"lr_grid_search_poly.best_estimator_.steps" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now make a bar chart to plot results" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": { | ||
"collapsed": false | ||
}, | ||
"outputs": [], | ||
"source": [ | ||
"import numpy as np\n", | ||
"results = -1 * np.array([best_1, best_2, best_3])\n", | ||
"labs = ['LR', 'Scaler-LR', 'Poly-Scaler-LR']\n", | ||
"\n", | ||
"fig = plt.figure(facecolor = 'w', figsize = (12, 6))\n", | ||
"ax = plt.subplot(111)\n", | ||
"\n", | ||
"width = 0.5\n", | ||
"ind = np.arange(3)\n", | ||
"rec = ax.bar(ind + width, results, width, color='r')\n", | ||
"\n", | ||
"ax.set_xticks(ind + width)\n", | ||
"ax.set_xticklabels(labs, size = 14)\n", | ||
"ax.set_ylim([0.6, 0.7])\n", | ||
"\n", | ||
"plt.plot(np.arange(4), min(results) * np.ones(4))" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": { | ||
"collapsed": true | ||
}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"anaconda-cloud": {}, | ||
"kernelspec": { | ||
"display_name": "Python [py35]", | ||
"language": "python", | ||
"name": "Python [py35]" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.5.2" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 0 | ||
} |