Skip to content

Commit

Permalink
bd
Browse files Browse the repository at this point in the history
  • Loading branch information
briandalessandro committed Oct 23, 2016
1 parent cd7a685 commit de9f853
Showing 1 changed file with 294 additions and 0 deletions.
294 changes: 294 additions & 0 deletions ipython/Labs_Student/Lab_Sklearn_Magic_Student.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,294 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"In this lab we'll demonstrate several common techniques and helpful tools used in a model building process:\n",
"\n",
"- Use Sklearn to generate polynomial features and rescale them\n",
"- Create folds for cross-validation\n",
"- Perform a grid search to optimize hyper-parameters using cross-validation\n",
"- Create pipelines to perform grids search in less code\n",
"- Improve upon a baseline model incrementally by adding in more complexity\n",
"\n",
"This lab will require using several Sklearn classes. It would be helpful to refer to appropriate documentation:\n",
"- http://scikit-learn.org/stable/modules/preprocessing.html\n",
"- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler\n",
"- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures\n",
"- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV\n",
"- http://scikit-learn.org/stable/modules/pipeline.html#pipeline\n",
"\n",
"Also, here is a helpful tutorial that explains how to use much of the above:\n",
"- https://civisanalytics.com/blog/data-science/2016/01/06/workflows-python-using-pipeline-gridsearchcv-for-compact-code/\n",
"\n",
"Like always, let's first load in the data.\n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import os\n",
"import pandas as pd\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.grid_search import GridSearchCV\n",
"from sklearn.cross_validation import KFold\n",
"cwd = os.getcwd()\n",
"datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n",
"\n",
"data = pd.read_csv(datadir + 'Cell2Cell_data.csv', header=0, sep=',')\n",
"\n",
"#Randomly sort the data:\n",
"data = data.sample(frac = 1)\n",
"data.columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next we're going to prep the data. From prior analysis (Churn Case Study) we learned that we can drop a few variables, as they are either highly redundant or don't carry a strong relationship with the outcome.\n",
"\n",
"After dropping, we're going to use the SkLearn KFold class to set up cross validation fold indexes."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Prior analysis (from Churn Case study) has shown that we can drop a few redundant variables\n",
"#We want to drop a few to speed up later calculations\n",
"dropvar_list = ['incalls', 'creditcd', 'marryyes', 'travel', 'pcown']\n",
"data_subset = data.drop(dropvar_list, 1)\n",
"\n",
"#Set up X and Y\n",
"X = data_subset.drop('churndep', 1)\n",
"Y = data_subset['churndep']\n",
"\n",
"#Use Kfold to create 4 folds\n",
"kfolds = KFold(#Student -nput code here)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next let's use cross-validation to build a baseline model. We're going to use LR with no feature pre-processing. We're going to look at both L1 and L2 regularization with different weights. We can do this very succinctly with SkLearns GridSearchCV package."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#1st, set up a paramater grid\n",
"param_grid_lr = {'C':#Student put code here, \n",
" 'penalty':#Student put code here}\n",
"\n",
"#2nd, call the GridSearchCV class, use LogisticRegression and 'log_loss' for scoring\n",
"lr_grid_search = GridSearchCV(#Student put code here) \n",
"lr_grid_search.fit(X, Y)\n",
"\n",
"#3rd, get the score of the best model and print it\n",
"best_1 = #Student put code here\n",
"print(best_1)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Next let's look at the best-estimator chosen to see what the parameters were\n",
"lr_grid_search.#Student put code here"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now let's see if we can beat this by standardizing the features. We'll approach this using the GridSearchCV class but also build a pipeline. Later we'll extend the pipeline to allow for feature engineering as well."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"#Create a set of steps. All but the last step is a transformer (something that processes data). \n",
"#Build a list of steps, where the first is StandardScaler and the second is LogisticRegression\n",
"#The last step should be an estimator.\n",
"\n",
"steps = [('scaler', #Student put code here,\n",
" ('lr', #Student put code here)]\n",
"\n",
"#Now set up the pipeline\n",
"pipeline = Pipeline(#Student put code here)\n",
"\n",
"#Now set up the parameter grid, paying close to the correct convention here\n",
"parameters_scaler = dict(lr__C = #Student put code here,\n",
" lr__penalty = #Student put code here)\n",
"\n",
"#Now run another grid search\n",
"lr_grid_search_scaler = GridSearchCV(#Student put code here)\n",
" \n",
"#Don't forget to fit this GridSearchCV pipeline\n",
"#Student put code here\n",
"\n",
"#Again, print the score of the best model\n",
"best_2 = #Student put code here\n",
"print(best_2)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Let's see the model after scaling. Did the optimal parameters change?\n",
"lr_grid_search_scaler.best_estimator_.steps[-1][1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now that we've built a pipeline estimator that performs feature scaling and then logistic regression, let's add to it a feature engineering step. We'll then again use GridSearchCV to find an optimal parameter configuration and see if we can beat our best score above."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.preprocessing import PolynomialFeatures\n",
"\n",
"#Create a set of steps. All but the last step is a transformer (something that processes data). \n",
"# Step 1 - PolynomialFeatures\n",
"# Step 2 - StandardScaler\n",
"# Step 3 - LogisticRegression\n",
"\n",
"steps_poly = [#Student put code here]\n",
"\n",
"#Now set up the pipeline\n",
"pipeline_poly = #Student put code here\n",
"\n",
"#Now set up a new parameter grid, use the same paramaters used above for logistic regression, \n",
"#but add polynomial features up to degree 2 with and without interactions. \n",
"parameters_poly = dict(#Student put code here)\n",
"\n",
"#Now run another grid search\n",
"lr_grid_search_poly = #Student put code here\n",
"lr_grid_search_poly.fit(X, Y)\n",
"\n",
"best_3 = lr_grid_search_poly.best_score_\n",
"print(best_3)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"#Let's look at the best estimator, stepwise\n",
"lr_grid_search_poly.best_estimator_.steps"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now make a bar chart to plot results"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import numpy as np\n",
"results = -1 * np.array([best_1, best_2, best_3])\n",
"labs = ['LR', 'Scaler-LR', 'Poly-Scaler-LR']\n",
"\n",
"fig = plt.figure(facecolor = 'w', figsize = (12, 6))\n",
"ax = plt.subplot(111)\n",
"\n",
"width = 0.5\n",
"ind = np.arange(3)\n",
"rec = ax.bar(ind + width, results, width, color='r')\n",
"\n",
"ax.set_xticks(ind + width)\n",
"ax.set_xticklabels(labs, size = 14)\n",
"ax.set_ylim([0.6, 0.7])\n",
"\n",
"plt.plot(np.arange(4), min(results) * np.ones(4))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python [py35]",
"language": "python",
"name": "Python [py35]"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
"nbformat_minor": 0
}

0 comments on commit de9f853

Please sign in to comment.