forked from justmarkham/DAT8
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
3da688c
commit 7f84dae
Showing
7 changed files
with
1,392 additions
and
7 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -505,8 +505,6 @@ Tuesday | Thursday | |
* [Unboxing the Random Forest Classifier](http://nerds.airbnb.com/unboxing-the-random-forest-classifier/) describes a way to interpret the inner workings of Random Forests beyond just feature importances. | ||
* [Understanding Random Forests: From Theory to Practice](http://arxiv.org/pdf/1407.7502v3.pdf) is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn. | ||
|
||
<!-- | ||
----- | ||
|
||
### Class 19: Advanced scikit-learn and Clustering | ||
|
@@ -517,13 +515,17 @@ Tuesday | Thursday | |
* K-means: [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [visualization 1](http://tech.nitoyon.com/en/blog/2013/11/07/k-means/), [visualization 2](http://www.naftaliharris.com/blog/visualizing-k-means-clustering/) | ||
* DBSCAN: [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html), [visualization](http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/) | ||
|
||
**Homework:** | ||
* Reread [Understanding the Bias-Variance Tradeoff](http://scott.fortmann-roe.com/docs/BiasVariance.html). (The "answers" to the [guiding questions](homework/09_bias_variance.md) have been posted and may be helpful to you.) | ||
* **Optional:** Watch these two excellent (and related) videos from Caltech's Learning From Data course: [bias-variance tradeoff](http://work.caltech.edu/library/081.html) (15 minutes) and [regularization](http://work.caltech.edu/library/121.html) (8 minutes). | ||
|
||
**scikit-learn Resources:** | ||
* This is a longer example of [feature scaling](https://github.com/rasbt/pattern_classification/blob/master/preprocessing/about_standardization_normalization.ipynb) in scikit-learn, with additional discussion of the types of scaling you can use. | ||
* [Practical Data Science in Python](http://radimrehurek.com/data_science_python/) is a long and well-written notebook that includes scikit-learn pipelining, plotting a learning curve, and pickling a model. | ||
* [Practical Data Science in Python](http://radimrehurek.com/data_science_python/) is a long and well-written notebook that uses a few advanced scikit-learn features: pipelining, plotting a learning curve, and pickling a model. | ||
* To learn how to use [GridSearchCV and RandomizedSearchCV](http://scikit-learn.org/stable/modules/grid_search.html) for parameter tuning, watch [How to find the best model parameters in scikit-learn](https://www.youtube.com/watch?v=Gol_qOgRqfA) (28 minutes) or read the [associated notebook](https://github.com/justmarkham/scikit-learn-videos/blob/master/08_grid_search.ipynb). | ||
* Sebastian Raschka has a number of excellent resources for scikit-learn users, including a repository of [tutorials and examples](https://github.com/rasbt/pattern_classification), a library of machine learning [tools and extensions](http://rasbt.github.io/mlxtend/), a new [book](https://github.com/rasbt/python-machine-learning-book), and a semi-active [blog](http://sebastianraschka.com/blog/). | ||
* scikit-learn has an incredibly active [mailing list](https://www.mail-archive.com/[email protected]/index.html) that is often much more useful than Stack Overflow for researching a particular function. | ||
* If you forget how to use a particular function in scikit-learn, don't forget that this repository is fully searchable! | ||
* scikit-learn has an incredibly active [mailing list](https://www.mail-archive.com/[email protected]/index.html) that is often much more useful than Stack Overflow for researching functions and asking questions. | ||
* If you forget how to use a particular scikit-learn function that we have used in class, don't forget that this repository is fully searchable! | ||
|
||
**Clustering Resources:** | ||
* For a very thorough introduction to clustering, read chapter 8 (69 pages) of [Introduction to Data Mining](http://www-users.cs.umn.edu/~kumar/dmbook/index.php) (available as a free download), or browse through the chapter 8 slides. | ||
|
@@ -532,18 +534,32 @@ Tuesday | Thursday | |
* An Introduction to Statistical Learning has useful videos on [K-means clustering](https://www.youtube.com/watch?v=aIybuNt9ps4&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2) (17 minutes) and [hierarchical clustering](https://www.youtube.com/watch?v=Tuuc9Y06tAc&list=PL5-da3qGB5IBC-MneTc9oBZz0C6kNJ-f2) (15 minutes). | ||
* This is an excellent interactive visualization of [hierarchical clustering](https://joyofdata.shinyapps.io/hclust-shiny/). | ||
* This is a nice animated explanation of [mean shift clustering](http://spin.atomicobject.com/2015/05/26/mean-shift-clustering/). | ||
* The [K-modes algorithm](http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf) can be used for clustering datasets of categorical features without converting them to numerical values. | ||
* Fun examples of clustering: [A Statistical Analysis of the Work of Bob Ross](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) (with [data and Python code](https://github.com/fivethirtyeight/data/tree/master/bob-ross)), [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/), and [characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/). | ||
* The [K-modes algorithm](http://www.cs.ust.hk/~qyang/Teaching/537/Papers/huang98extensions.pdf) can be used for clustering datasets of categorical features without converting them to numerical values. Here is a [Python implementation](https://github.com/nicodv/kmodes). | ||
* Here are some fun examples of clustering: [A Statistical Analysis of the Work of Bob Ross](http://fivethirtyeight.com/features/a-statistical-analysis-of-the-work-of-bob-ross/) (with [data and Python code](https://github.com/fivethirtyeight/data/tree/master/bob-ross)), [How a Math Genius Hacked OkCupid to Find True Love](http://www.wired.com/2014/01/how-to-hack-okcupid/all/), and [characteristics of your zip code](http://www.esri.com/landing-pages/tapestry/). | ||
|
||
<!-- | ||
----- | ||
### Class 20: Regularization and Regular Expressions | ||
* Regularization ([notebook](notebooks/20_regularization.ipynb)) | ||
* Regression: [Ridge](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html), [RidgeCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.RidgeCV.html), [Lasso](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html), [LassoCV](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html) | ||
* Classification: [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) | ||
* Regular expressions | ||
**Homework:** | ||
* Your final project is due next week! | ||
* **Optional:** Make your final submissions to our Kaggle competition! It closes at 6:30pm ET on Tuesday 10/27. | ||
* **Optional:** Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: [A Few Useful Things to Know about Machine Learning](http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf). | ||
**Regularization Resources:** | ||
* The scikit-learn user guide for [Generalized Linear Models](http://scikit-learn.org/stable/modules/linear_model.html) explains different variations of regularization. | ||
* Section 6.2 of [An Introduction to Statistical Learning](http://www-bcf.usc.edu/~gareth/ISL/) (14 pages) introduces both ridge regression and lasso regression. Or, watch the related videos on [ridge regression](https://www.youtube.com/watch?v=cSKzqb0EKS0&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI&index=6) (13 minutes) and [lasso regression](https://www.youtube.com/watch?v=A5I1G1MfUmA&index=7&list=PL5-da3qGB5IB-Xdpj_uXJpLGiRfv9UVXI) (15 minutes). | ||
* For more details on lasso regression, read Tibshirani's [original paper](http://statweb.stanford.edu/~tibs/lasso/lasso.pdf) from 1996. | ||
* For a math-ier explanation of regularization, watch the last four videos (30 minutes) from week 3 of Andrew Ng's [machine learning course](https://www.coursera.org/learn/machine-learning/home/info), or read the [related lecture notes](http://www.holehouse.org/mlclass/07_Regularization.html) compiled by a student. | ||
* This [notebook](https://github.com/luispedro/PenalizedRegression/blob/master/PenalizedRegression.ipynb) from chapter 7 of [Building Machine Learning Systems with Python](https://www.packtpub.com/big-data-and-business-intelligence/building-machine-learning-systems-python) has a nice long example of regularized linear regression. | ||
* There are some special considerations when using dummy encoding for categorical features with a regularized model. This [Cross Validated Q&A](https://stats.stackexchange.com/questions/69568/whether-to-rescale-indicator-binary-dummy-predictors-for-lasso) debates whether the dummy variables should be standardized (along with the rest of the features), and a comment on this [blog post](http://appliedpredictivemodeling.com/blog/2013/10/23/the-basics-of-encoding-categorical-data-for-predictive-models) recommends that the baseline level should not be dropped. | ||
----- | ||
### Class 21: Course Review and Final Project Presentation | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,208 @@ | ||
# # Advanced scikit-learn | ||
|
||
# ## Agenda | ||
# | ||
# - StandardScaler | ||
# - Pipeline (bonus content) | ||
|
||
# ## StandardScaler | ||
# | ||
# ### What is the problem we're trying to solve? | ||
|
||
# fake data | ||
import pandas as pd | ||
train = pd.DataFrame({'id':[0,1,2], 'length':[0.9,0.3,0.6], 'mass':[0.1,0.2,0.8], 'rings':[40,50,60]}) | ||
test = pd.DataFrame({'length':[0.59], 'mass':[0.79], 'rings':[54]}) | ||
|
||
|
||
# training data | ||
train | ||
|
||
|
||
# testing data | ||
test | ||
|
||
|
||
# define X and y | ||
feature_cols = ['length', 'mass', 'rings'] | ||
X = train[feature_cols] | ||
y = train.id | ||
|
||
|
||
# KNN with K=1 | ||
from sklearn.neighbors import KNeighborsClassifier | ||
knn = KNeighborsClassifier(n_neighbors=1) | ||
knn.fit(X, y) | ||
|
||
|
||
# what "should" it predict? | ||
knn.predict(test) | ||
|
||
|
||
# allow plots to appear in the notebook | ||
import matplotlib.pyplot as plt | ||
plt.rcParams['font.size'] = 14 | ||
plt.rcParams['figure.figsize'] = (5, 5) | ||
|
||
|
||
# create a "colors" array for plotting | ||
import numpy as np | ||
colors = np.array(['red', 'green', 'blue']) | ||
|
||
|
||
# scatter plot of training data, colored by id (0=red, 1=green, 2=blue) | ||
plt.scatter(train.mass, train.rings, c=colors[train.id], s=50) | ||
|
||
# testing data | ||
plt.scatter(test.mass, test.rings, c='white', s=50) | ||
|
||
# add labels | ||
plt.xlabel('mass') | ||
plt.ylabel('rings') | ||
plt.title('How we interpret the data') | ||
|
||
|
||
# adjust the x-limits | ||
plt.scatter(train.mass, train.rings, c=colors[train.id], s=50) | ||
plt.scatter(test.mass, test.rings, c='white', s=50) | ||
plt.xlabel('mass') | ||
plt.ylabel('rings') | ||
plt.title('How KNN interprets the data') | ||
plt.xlim(0, 30) | ||
|
||
|
||
# ### How does StandardScaler solve the problem? | ||
# | ||
# [StandardScaler](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) is used for the "standardization" of features, also known as "center and scale" or "z-score normalization". | ||
|
||
# standardize the features | ||
from sklearn.preprocessing import StandardScaler | ||
scaler = StandardScaler() | ||
scaler.fit(X) | ||
X_scaled = scaler.transform(X) | ||
|
||
|
||
# original values | ||
X.values | ||
|
||
|
||
# standardized values | ||
X_scaled | ||
|
||
|
||
# figure out how it standardized | ||
print scaler.mean_ | ||
print scaler.std_ | ||
|
||
|
||
# manually standardize | ||
(X.values - scaler.mean_) / scaler.std_ | ||
|
||
|
||
# ### Applying StandardScaler to a real dataset | ||
# | ||
# - Wine dataset from the UCI Machine Learning Repository: [data](http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data), [data dictionary](http://archive.ics.uci.edu/ml/datasets/Wine) | ||
# - **Goal:** Predict the origin of wine using chemical analysis | ||
|
||
# read three columns from the dataset into a DataFrame | ||
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data' | ||
col_names = ['label', 'color', 'proline'] | ||
wine = pd.read_csv(url, header=None, names=col_names, usecols=[0, 10, 13]) | ||
|
||
|
||
wine.head() | ||
|
||
|
||
wine.describe() | ||
|
||
|
||
# define X and y | ||
feature_cols = ['color', 'proline'] | ||
X = wine[feature_cols] | ||
y = wine.label | ||
|
||
|
||
# split into training and testing sets | ||
from sklearn.cross_validation import train_test_split | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) | ||
|
||
|
||
# standardize X_train | ||
scaler.fit(X_train) | ||
X_train_scaled = scaler.transform(X_train) | ||
|
||
|
||
# check that it standardized properly | ||
print X_train_scaled[:, 0].mean() | ||
print X_train_scaled[:, 0].std() | ||
print X_train_scaled[:, 1].mean() | ||
print X_train_scaled[:, 1].std() | ||
|
||
|
||
# standardize X_test | ||
X_test_scaled = scaler.transform(X_test) | ||
|
||
|
||
# is this right? | ||
print X_test_scaled[:, 0].mean() | ||
print X_test_scaled[:, 0].std() | ||
print X_test_scaled[:, 1].mean() | ||
print X_test_scaled[:, 1].std() | ||
|
||
|
||
# KNN accuracy on original data | ||
knn = KNeighborsClassifier(n_neighbors=3) | ||
knn.fit(X_train, y_train) | ||
y_pred_class = knn.predict(X_test) | ||
from sklearn import metrics | ||
print metrics.accuracy_score(y_test, y_pred_class) | ||
|
||
|
||
# KNN accuracy on scaled data | ||
knn.fit(X_train_scaled, y_train) | ||
y_pred_class = knn.predict(X_test_scaled) | ||
print metrics.accuracy_score(y_test, y_pred_class) | ||
|
||
|
||
# ## Pipeline (bonus content) | ||
# | ||
# ### What is the problem we're trying to solve? | ||
|
||
# define X and y | ||
feature_cols = ['color', 'proline'] | ||
X = wine[feature_cols] | ||
y = wine.label | ||
|
||
|
||
# proper cross-validation on the original (unscaled) data | ||
knn = KNeighborsClassifier(n_neighbors=3) | ||
from sklearn.cross_validation import cross_val_score | ||
cross_val_score(knn, X, y, cv=5, scoring='accuracy').mean() | ||
|
||
|
||
# why is this improper cross-validation on the scaled data? | ||
scaler = StandardScaler() | ||
X_scaled = scaler.fit_transform(X) | ||
cross_val_score(knn, X_scaled, y, cv=5, scoring='accuracy').mean() | ||
|
||
|
||
# ### How does Pipeline solve the problem? | ||
# | ||
# [Pipeline](http://scikit-learn.org/stable/modules/pipeline.html) is used for chaining steps together: | ||
|
||
# fix the cross-validation process using Pipeline | ||
from sklearn.pipeline import make_pipeline | ||
pipe = make_pipeline(StandardScaler(), KNeighborsClassifier(n_neighbors=3)) | ||
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean() | ||
|
||
|
||
# Pipeline can also be used with [GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.grid_search.GridSearchCV.html) for parameter searching: | ||
|
||
# search for an optimal n_neighbors value using GridSearchCV | ||
neighbors_range = range(1, 21) | ||
param_grid = dict(kneighborsclassifier__n_neighbors=neighbors_range) | ||
from sklearn.grid_search import GridSearchCV | ||
grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy') | ||
grid.fit(X, y) | ||
print grid.best_score_ | ||
print grid.best_params_ |
Oops, something went wrong.