bd

UdayKumar0711 · Oct 23, 2016 · de9f853 · de9f853
1 parent cd7a685
commit de9f853
Showing 1 changed file with 294 additions and 0 deletions.
diff --git a/ipython/Labs_Student/Lab_Sklearn_Magic_Student.ipynb b/ipython/Labs_Student/Lab_Sklearn_Magic_Student.ipynb
@@ -0,0 +1,294 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "In this lab we'll demonstrate several common techniques and helpful tools used in a model building process:\n",
+    "\n",
+    "- Use Sklearn to generate polynomial features and rescale them\n",
+    "- Create folds for cross-validation\n",
+    "- Perform a grid search to optimize hyper-parameters using cross-validation\n",
+    "- Create pipelines to perform grids search in less code\n",
+    "- Improve upon a baseline model incrementally by adding in more complexity\n",
+    "\n",
+    "This lab will require using several Sklearn classes. It would be helpful to refer to appropriate documentation:\n",
+    "- http://scikit-learn.org/stable/modules/preprocessing.html\n",
+    "- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler\n",
+    "- http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html#sklearn.preprocessing.PolynomialFeatures\n",
+    "- http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV\n",
+    "- http://scikit-learn.org/stable/modules/pipeline.html#pipeline\n",
+    "\n",
+    "Also, here is a helpful tutorial that explains how to use much of the above:\n",
+    "- https://civisanalytics.com/blog/data-science/2016/01/06/workflows-python-using-pipeline-gridsearchcv-for-compact-code/\n",
+    "\n",
+    "Like always, let's first load in the data.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import pandas as pd\n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sklearn.grid_search import GridSearchCV\n",
+    "from sklearn.cross_validation import KFold\n",
+    "cwd = os.getcwd()\n",
+    "datadir = '/'.join(cwd.split('/')[0:-1]) + '/data/'\n",
+    "\n",
+    "data = pd.read_csv(datadir + 'Cell2Cell_data.csv', header=0, sep=',')\n",
+    "\n",
+    "#Randomly sort the data:\n",
+    "data = data.sample(frac = 1)\n",
+    "data.columns"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we're going to prep the data. From prior analysis (Churn Case Study) we learned that we can drop a few variables, as they are either highly redundant or don't carry a strong relationship with the outcome.\n",
+    "\n",
+    "After dropping, we're going to use the SkLearn KFold class to set up cross validation fold indexes."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#Prior analysis (from Churn Case study) has shown that we can drop a few redundant variables\n",
+    "#We want to drop a few to speed up later calculations\n",
+    "dropvar_list = ['incalls', 'creditcd', 'marryyes', 'travel', 'pcown']\n",
+    "data_subset = data.drop(dropvar_list, 1)\n",
+    "\n",
+    "#Set up X and Y\n",
+    "X = data_subset.drop('churndep', 1)\n",
+    "Y = data_subset['churndep']\n",
+    "\n",
+    "#Use Kfold to create 4 folds\n",
+    "kfolds = KFold(#Student -nput code here)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next let's use cross-validation to build a baseline model. We're going to use LR with no feature pre-processing. We're going to look at both L1 and L2 regularization with different weights. We can do this very succinctly with SkLearns GridSearchCV package."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#1st, set up a paramater grid\n",
+    "param_grid_lr = {'C':#Student put code here, \n",
+    "                 'penalty':#Student put code here}\n",
+    "\n",
+    "#2nd, call the GridSearchCV class, use LogisticRegression and 'log_loss' for scoring\n",
+    "lr_grid_search = GridSearchCV(#Student put code here) \n",
+    "lr_grid_search.fit(X, Y)\n",
+    "\n",
+    "#3rd, get the score of the best model and print it\n",
+    "best_1 = #Student put code here\n",
+    "print(best_1)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#Next let's look at the best-estimator chosen to see what the parameters were\n",
+    "lr_grid_search.#Student put code here"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now let's see if we can beat this by standardizing the features. We'll approach this using the GridSearchCV class but also build a pipeline. Later we'll extend the pipeline to allow for feature engineering as well."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.pipeline import Pipeline\n",
+    "from sklearn.preprocessing import StandardScaler\n",
+    "\n",
+    "#Create a set of steps. All but the last step is a transformer (something that processes data). \n",
+    "#Build a list of steps, where the first is StandardScaler and the second is LogisticRegression\n",
+    "#The last step should be an estimator.\n",
+    "\n",
+    "steps = [('scaler', #Student put code here,\n",
+    "         ('lr', #Student put code here)]\n",
+    "\n",
+    "#Now set up the pipeline\n",
+    "pipeline = Pipeline(#Student put code here)\n",
+    "\n",
+    "#Now set up the parameter grid, paying close to the correct convention here\n",
+    "parameters_scaler = dict(lr__C = #Student put code here,\n",
+    "                         lr__penalty = #Student put code here)\n",
+    "\n",
+    "#Now run another grid search\n",
+    "lr_grid_search_scaler = GridSearchCV(#Student put code here)\n",
+    "                        \n",
+    "#Don't forget to fit this GridSearchCV pipeline\n",
+    "#Student put code here\n",
+    "\n",
+    "#Again, print the score of the best model\n",
+    "best_2 = #Student put code here\n",
+    "print(best_2)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#Let's see the model after scaling. Did the optimal parameters change?\n",
+    "lr_grid_search_scaler.best_estimator_.steps[-1][1]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we've built a pipeline estimator that performs feature scaling and then logistic regression, let's add to it a feature engineering step. We'll then again use GridSearchCV to find an optimal parameter configuration and see if we can beat our best score above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.preprocessing import PolynomialFeatures\n",
+    "\n",
+    "#Create a set of steps. All but the last step is a transformer (something that processes data). \n",
+    "# Step 1 - PolynomialFeatures\n",
+    "# Step 2 - StandardScaler\n",
+    "# Step 3 - LogisticRegression\n",
+    "\n",
+    "steps_poly = [#Student put code here]\n",
+    "\n",
+    "#Now set up the pipeline\n",
+    "pipeline_poly = #Student put code here\n",
+    "\n",
+    "#Now set up a new parameter grid, use the same paramaters used above for logistic regression, \n",
+    "#but add polynomial features up to degree 2 with and without interactions. \n",
+    "parameters_poly = dict(#Student put code here)\n",
+    "\n",
+    "#Now run another grid search\n",
+    "lr_grid_search_poly = #Student put code here\n",
+    "lr_grid_search_poly.fit(X, Y)\n",
+    "\n",
+    "best_3 = lr_grid_search_poly.best_score_\n",
+    "print(best_3)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "#Let's look at the best estimator, stepwise\n",
+    "lr_grid_search_poly.best_estimator_.steps"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now make a bar chart to plot results"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {
+    "collapsed": false
+   },
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "results = -1 * np.array([best_1, best_2, best_3])\n",
+    "labs = ['LR', 'Scaler-LR', 'Poly-Scaler-LR']\n",
+    "\n",
+    "fig = plt.figure(facecolor = 'w', figsize = (12, 6))\n",
+    "ax = plt.subplot(111)\n",
+    "\n",
+    "width = 0.5\n",
+    "ind = np.arange(3)\n",
+    "rec = ax.bar(ind + width, results, width, color='r')\n",
+    "\n",
+    "ax.set_xticks(ind + width)\n",
+    "ax.set_xticklabels(labs, size = 14)\n",
+    "ax.set_ylim([0.6, 0.7])\n",
+    "\n",
+    "plt.plot(np.arange(4), min(results) * np.ones(4))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "collapsed": true
+   },
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "anaconda-cloud": {},
+  "kernelspec": {
+   "display_name": "Python [py35]",
+   "language": "python",
+   "name": "Python [py35]"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}