add class 19 materials

EmmaYXMa · Oct 16, 2015 · 7f84dae · 7f84dae
1 parent 3da688c
commit 7f84dae
Show file tree

Hide file tree

Showing 7 changed files with 1,392 additions and 7 deletions.
diff --git a/README.md b/README.md
@@ -505,8 +505,6 @@ Tuesday | Thursday
 * [Unboxing the Random Forest Classifier](http://nerds.airbnb.com/unboxing-the-random-forest-classifier/) describes a way to interpret the inner workings of Random Forests beyond just feature importances.
 * [Understanding Random Forests: From Theory to Practice](http://arxiv.org/pdf/1407.7502v3.pdf) is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn.
 
-<!--
-
 -----
 
 ### Class 19: Advanced scikit-learn and Clustering
@@ -517,13 +515,17 @@ Tuesday | Thursday
     * K-means: [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [visualization 1](http://tech.nitoyon.com/en/blog/2013/11/07/k-means/), [visualization 2](http://www.naftaliharris.com/blog/visualizing-k-means-clustering/)
     * DBSCAN: [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), [visualization](http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/)
 
+**Homework:**
+* Reread [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html). (The "answers" to the [guiding questions](homework/09_bias_variance.md) have been posted and may be helpful to you.)
+* **Optional:** Watch these two excellent (and related) videos from Caltech's Learning From Data course: [bias-variance tradeoff](http://work.caltech.edu/library/081.html) (15 minutes) and [regularization](http://work.caltech.edu/library/121.html) (8 minutes).
+
 **scikit-learn Resources:**
 * This is a longer example of [feature scaling](https://github.com/rasbt/pattern_classification/blob/master/preprocessing/about_standardization_normalization.ipynb) in scikit-learn, with additional discussion of the types of scaling you can use.
-* [Practical Data Science in Python](http://radimrehurek.com/data_science_python/) is a long and well-written notebook that includes scikit-learn pipelining, plotting a learning curve, and pickling a model.
+* [Practical Data Science in Python](http://radimrehurek.com/data_science_python/) is a long and well-written notebook that uses a few advanced scikit-learn features: pipelining, plotting a learning curve, and pickling a model.
 * To learn how to use [GridSearchCV and RandomizedSearchCV](http://scikit-learn.org/stable/modules/grid_search.html) for parameter tuning, watch [How to find the best model parameters in scikit-learn](https://www.youtube.com/watch?v=Gol_qOgRqfA) (28 minutes) or read the [associated notebook](https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb).
 * Sebastian Raschka has a number of excellent resources for scikit-learn users, including a repository of [tutorials and examples](https://github.com/rasbt/pattern_classification), a library of machine learning [tools and extensions](http://rasbt.github.io/mlxtend/), a new [book](https://github.com/rasbt/python-machine-learning-book), and a semi-active [blog](http://sebastianraschka.com/blog/).
-* scikit-learn has an incredibly active [mailing list](https://www.mail-archive.com/[email protected]/index.html) that is often much more useful than Stack Overflow for researching a particular function.
-* If you forget how to use a particular function in scikit-learn, don't forget that this repository is fully searchable!
+* scikit-learn has an incredibly active [mailing list](https://www.mail-archive.com/[email protected]/index.html) that is often much more useful than Stack Overflow for researching functions and asking questions.
+* If you forget how to use a particular scikit-learn function that we have used in class, don't forget that this repository is fully searchable!
 
 **Clustering Resources:**
 * For a very thorough introduction to clustering, read chapter 8 (69 pages) of [Introduction to Data Mining](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) (available as a free download), or browse through the chapter 8 slides.
@@ -532,18 +534,32 @@ Tuesday | Thursday
 * An Introduction to Statistical Learning has useful videos on [K-means clustering](https://www.youtube.com/watch?v=aIybuNt9ps4&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2) (17 minutes) and [hierarchical clustering](https://www.youtube.com/watch?v=Tuuc9Y06tAc&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2) (15 minutes).
 * This is an excellent interactive visualization of [hierarchical clustering](https://joyofdata.shinyapps.io/hclust-shiny/).
 * This is a nice animated explanation of [mean shift clustering](http://spin.atomicobject.com/2015/05/26/mean-shift-clustering/).
-* The [K-modes algorithm](http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf) can be used for clustering datasets of categorical features without converting them to numerical values.
-* Fun examples of clustering: [A Statistical Analysis of the Work of Bob Ross](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) (with [data and Python code](https://github.com/fivethirtyeight/data/tree/master/bob-ross)), [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/), and [characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/).
+* The [K-modes algorithm](http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf) can be used for clustering datasets of categorical features without converting them to numerical values. Here is a [Python implementation](https://github.com/nicodv/kmodes).
+* Here are some fun examples of clustering: [A Statistical Analysis of the Work of Bob Ross](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) (with [data and Python code](https://github.com/fivethirtyeight/data/tree/master/bob-ross)), [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/), and [characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/).
+
+<!--
 
 -----
 
 ### Class 20: Regularization and Regular Expressions
+* Regularization ([notebook](notebooks/20_regularization.ipynb))
+    * Regression: [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), [RidgeCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html), [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html), [LassoCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html)
+    * Classification: [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html)
+* Regular expressions
 
 **Homework:**
 * Your final project is due next week!
 * **Optional:** Make your final submissions to our Kaggle competition! It closes at 6:30pm ET on Tuesday 10/27.
 * **Optional:** Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf).
 
+**Regularization Resources:**
+* The scikit-learn user guide for [Generalized Linear Models](http://scikit-learn.org/stable/modules/linear_model.html) explains different variations of regularization.
+* Section 6.2 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (14 pages) introduces both ridge regression and lasso regression. Or, watch the related videos on [ridge regression](https://www.youtube.com/watch?v=cSKzqb0EKS0&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=6) (13 minutes) and [lasso regression](https://www.youtube.com/watch?v=A5I1G1MfUmA&index=7&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI) (15 minutes).
+* For more details on lasso regression, read Tibshirani's [original paper](http://statweb.stanford.edu/~tibs/lasso/lasso.pdf) from 1996.
+* For a math-ier explanation of regularization, watch the last four videos (30 minutes) from week 3 of Andrew Ng's [machine learning course](https://www.coursera.org/learn/machine-learning/home/info), or read the [related lecture notes](http://www.holehouse.org/mlclass/07_Regularization.html) compiled by a student.
+* This [notebook](https://github.com/luispedro/PenalizedRegression/blob/master/PenalizedRegression.ipynb) from chapter 7 of [Building Machine Learning Systems with Python](https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) has a nice long example of regularized linear regression.
+* There are some special considerations when using dummy encoding for categorical features with a regularized model. This [Cross Validated Q&A](https://stats.stackexchange.com/questions/69568/whether-to-rescale-indicator-binary-dummy-predictors-for-lasso) debates whether the dummy variables should be standardized (along with the rest of the features), and a comment on this [blog post](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models) recommends that the baseline level should not be dropped.
+
 -----
 
 ### Class 21: Course Review and Final Project Presentation

diff --git a/code/19_advanced_sklearn_nb.py b/code/19_advanced_sklearn_nb.py
@@ -0,0 +1,208 @@
+# # Advanced scikit-learn
+
+# ## Agenda
+# 
+# - StandardScaler
+# - Pipeline (bonus content)
+
+# ## StandardScaler
+# 
+# ### What is the problem we're trying to solve?
+
+# fake data
+import pandas as pd
+train = pd.DataFrame({'id':[0,1,2], 'length':[0.9,0.3,0.6], 'mass':[0.1,0.2,0.8], 'rings':[40,50,60]})
+test = pd.DataFrame({'length':[0.59], 'mass':[0.79], 'rings':[54]})
+
+
+# training data
+train
+
+
+# testing data
+test
+
+
+# define X and y
+feature_cols = ['length', 'mass', 'rings']
+X = train[feature_cols]
+y = train.id
+
+
+# KNN with K=1
+from sklearn.neighbors import KNeighborsClassifier
+knn = KNeighborsClassifier(n_neighbors=1)
+knn.fit(X, y)
+
+
+# what "should" it predict?
+knn.predict(test)
+
+
+# allow plots to appear in the notebook
+import matplotlib.pyplot as plt
+plt.rcParams['font.size'] = 14
+plt.rcParams['figure.figsize'] = (5, 5)
+
+
+# create a "colors" array for plotting
+import numpy as np
+colors = np.array(['red', 'green', 'blue'])
+
+
+# scatter plot of training data, colored by id (0=red, 1=green, 2=blue)
+plt.scatter(train.mass, train.rings, c=colors[train.id], s=50)
+
+# testing data
+plt.scatter(test.mass, test.rings, c='white', s=50)
+
+# add labels
+plt.xlabel('mass')
+plt.ylabel('rings')
+plt.title('How we interpret the data')
+
+
+# adjust the x-limits
+plt.scatter(train.mass, train.rings, c=colors[train.id], s=50)
+plt.scatter(test.mass, test.rings, c='white', s=50)
+plt.xlabel('mass')
+plt.ylabel('rings')
+plt.title('How KNN interprets the data')
+plt.xlim(0, 30)
+
+
+# ### How does StandardScaler solve the problem?
+# 
+# [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) is used for the "standardization" of features, also known as "center and scale" or "z-score normalization".
+
+# standardize the features
+from sklearn.preprocessing import StandardScaler
+scaler = StandardScaler()
+scaler.fit(X)
+X_scaled = scaler.transform(X)
+
+
+# original values
+X.values
+
+
+# standardized values
+X_scaled
+
+
+# figure out how it standardized
+print scaler.mean_
+print scaler.std_
+
+
+# manually standardize
+(X.values - scaler.mean_) / scaler.std_
+
+
+# ### Applying StandardScaler to a real dataset
+# 
+# - Wine dataset from the UCI Machine Learning Repository: [data](http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data), [data dictionary](http://archive.ics.uci.edu/ml/datasets/Wine)
+# - **Goal:** Predict the origin of wine using chemical analysis
+
+# read three columns from the dataset into a DataFrame
+url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data'
+col_names = ['label', 'color', 'proline']
+wine = pd.read_csv(url, header=None, names=col_names, usecols=[0, 10, 13])
+
+
+wine.head()
+
+
+wine.describe()
+
+
+# define X and y
+feature_cols = ['color', 'proline']
+X = wine[feature_cols]
+y = wine.label
+
+
+# split into training and testing sets
+from sklearn.cross_validation import train_test_split
+X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
+
+
+# standardize X_train
+scaler.fit(X_train)
+X_train_scaled = scaler.transform(X_train)
+
+
+# check that it standardized properly
+print X_train_scaled[:, 0].mean()
+print X_train_scaled[:, 0].std()
+print X_train_scaled[:, 1].mean()
+print X_train_scaled[:, 1].std()
+
+
+# standardize X_test
+X_test_scaled = scaler.transform(X_test)
+
+
+# is this right?
+print X_test_scaled[:, 0].mean()
+print X_test_scaled[:, 0].std()
+print X_test_scaled[:, 1].mean()
+print X_test_scaled[:, 1].std()
+
+
+# KNN accuracy on original data
+knn = KNeighborsClassifier(n_neighbors=3)
+knn.fit(X_train, y_train)
+y_pred_class = knn.predict(X_test)
+from sklearn import metrics
+print metrics.accuracy_score(y_test, y_pred_class)
+
+
+# KNN accuracy on scaled data
+knn.fit(X_train_scaled, y_train)
+y_pred_class = knn.predict(X_test_scaled)
+print metrics.accuracy_score(y_test, y_pred_class)
+
+
+# ## Pipeline (bonus content)
+# 
+# ### What is the problem we're trying to solve?
+
+# define X and y
+feature_cols = ['color', 'proline']
+X = wine[feature_cols]
+y = wine.label
+
+
+# proper cross-validation on the original (unscaled) data
+knn = KNeighborsClassifier(n_neighbors=3)
+from sklearn.cross_validation import cross_val_score
+cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean()
+
+
+# why is this improper cross-validation on the scaled data?
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy').mean()
+
+
+# ### How does Pipeline solve the problem?
+# 
+# [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) is used for chaining steps together:
+
+# fix the cross-validation process using Pipeline
+from sklearn.pipeline import make_pipeline
+pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3))
+cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()
+
+
+# Pipeline can also be used with [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) for parameter searching:
+
+# search for an optimal n_neighbors value using GridSearchCV
+neighbors_range = range(1, 21)
+param_grid = dict(kneighborsclassifier__n_neighbors=neighbors_range)
+from sklearn.grid_search import GridSearchCV
+grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy')
+grid.fit(X, y)
+print grid.best_score_
+print grid.best_params_