add all materials for classes 13 and 14

yaoweihu · Sep 25, 2015 · 8c508fb · 8c508fb
1 parent f520232
commit 8c508fb
Show file tree

Hide file tree

Showing 22 changed files with 2,625 additions and 2 deletions.
diff --git a/README.md b/README.md
@@ -53,6 +53,8 @@ Tuesday | Thursday
 
 ### [Comparison of machine learning models](other/model_comparison.md)
 
+### [Comparison of model evaluation procedures and metrics](other/model_evaluation_comparison.md)
+
 -----
 
 ### Class 1: Introduction to Data Science
@@ -350,8 +352,6 @@ Tuesday | Thursday
 * My [simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) may be useful to you as a reference.
 * This notebook (from another DAT course) explains [how to calculate "expected value"](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb) from a confusion matrix by treating it as a cost-benefit matrix.
 
-<!--
-
 -----
 
 ### Class 13: Advanced Model Evaluation
@@ -416,6 +416,8 @@ Tuesday | Thursday
 * If you enjoyed Paul Graham's article, you can read [his follow-up article](http://www.paulgraham.com/better.html) on how he improved his spam filter and this [related paper](http://www.merl.com/publications/docs/TR2004-091.pdf) about state-of-the-art spam filtering in 2004.
 * Yelp has found that Naive Bayes is more effective than Mechanical Turks at [categorizing businesses](http://engineeringblog.yelp.com/2011/02/towards-building-a-high-quality-workforce-with-mechanical-turk.html).
 
+<!--
+
 -----
 
 ### Class 15: Natural Language Processing

diff --git a/code/13_advanced_model_evaluation_nb.py b/code/13_advanced_model_evaluation_nb.py
@@ -0,0 +1,198 @@
+# # Data Preparation and Advanced Model Evaluation
+
+# ## Agenda
+# 
+# **Data preparation**
+# 
+# - Handling missing values
+# - Handling categorical features (review)
+# 
+# **Advanced model evaluation**
+# 
+# - ROC curves and AUC
+# - Bonus: ROC curve is only sensitive to rank order of predicted probabilities
+# - Cross-validation
+
+# ## Part 1: Handling missing values
+
+# scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.
+
+# read the Titanic data
+import pandas as pd
+url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
+titanic = pd.read_csv(url, index_col='PassengerId')
+
+
+# check for missing values
+titanic.isnull().sum()
+
+
+# One possible strategy is to **drop missing values**:
+
+# drop rows with any missing values
+titanic.dropna().shape
+
+
+# drop rows where Age is missing
+titanic[titanic.Age.notnull()].shape
+
+
+# Sometimes a better strategy is to **impute missing values**:
+
+# mean Age
+titanic.Age.mean()
+
+
+# median Age
+titanic.Age.median()
+
+
+# most frequent Age
+titanic.Age.value_counts().head(1).index
+
+
+# fill missing values for Age with the median age
+titanic.Age.fillna(titanic.Age.median(), inplace=True)
+
+
+# Another strategy would be to build a **KNN model** just to impute missing values. How would we do that?
+# 
+# If values are missing from a categorical feature, we could treat the missing values as **another category**. Why might that make sense?
+# 
+# How do we **choose** between all of these strategies?
+
+# ## Part 2: Handling categorical features (Review)
+
+# How do we include a categorical feature in our model?
+# 
+# - **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)
+# - **Unordered categories:** use dummy encoding (0/1)
+
+titanic.head(10)
+
+
+# encode Sex_Female feature
+titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})
+
+
+# create a DataFrame of dummy variables
+embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
+embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)
+
+# concatenate the original DataFrame and the dummy DataFrame
+titanic = pd.concat([titanic, embarked_dummies], axis=1)
+
+
+titanic.head(1)
+
+
+# - How do we **interpret** the encoding for Embarked?
+# - Why didn't we just encode Embarked using a **single feature** (C=0, Q=1, S=2)?
+# - Does it matter which category we choose to define as the **baseline**?
+# - Why do we only need **two dummy variables** for Embarked?
+
+# define X and y
+feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
+X = titanic[feature_cols]
+y = titanic.Survived
+
+# train/test split
+from sklearn.cross_validation import train_test_split
+X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
+
+# train a logistic regression model
+from sklearn.linear_model import LogisticRegression
+logreg = LogisticRegression(C=1e9)
+logreg.fit(X_train, y_train)
+
+# make predictions for testing set
+y_pred_class = logreg.predict(X_test)
+
+# calculate testing accuracy
+from sklearn import metrics
+print metrics.accuracy_score(y_test, y_pred_class)
+
+
+# ## Part 3: ROC curves and AUC
+
+# predict probability of survival
+y_pred_prob = logreg.predict_proba(X_test)[:, 1]
+
+
+import matplotlib.pyplot as plt
+
+
+# plot ROC curve
+fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
+plt.plot(fpr, tpr)
+plt.xlim([0.0, 1.0])
+plt.ylim([0.0, 1.0])
+plt.xlabel('False Positive Rate (1 - Specificity)')
+plt.ylabel('True Positive Rate (Sensitivity)')
+
+
+# calculate AUC
+print metrics.roc_auc_score(y_test, y_pred_prob)
+
+
+# Besides allowing you to calculate AUC, seeing the ROC curve can help you to choose a threshold that **balances sensitivity and specificity** in a way that makes sense for the particular context.
+
+# histogram of predicted probabilities grouped by actual response value
+df = pd.DataFrame({'probability':y_pred_prob, 'actual':y_test})
+df.hist(column='probability', by='actual', sharex=True, sharey=True)
+
+
+# What would have happened if you had used **y_pred_class** instead of **y_pred_prob** when drawing the ROC curve or calculating AUC?
+
+# ROC curve using y_pred_class - WRONG!
+fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_class)
+plt.plot(fpr, tpr)
+
+
+# AUC using y_pred_class - WRONG!
+print metrics.roc_auc_score(y_test, y_pred_class)
+
+
+# If you use **y_pred_class**, it will interpret the zeros and ones as predicted probabilities of 0% and 100%.
+
+# ## Bonus: ROC curve is only sensitive to rank order of predicted probabilities
+
+# print the first 10 predicted probabilities
+y_pred_prob[:10]
+
+
+# take the square root of predicted probabilities (to make them all bigger)
+import numpy as np
+y_pred_prob_new = np.sqrt(y_pred_prob)
+
+# print the modified predicted probabilities
+y_pred_prob_new[:10]
+
+
+# histogram of predicted probabilities has changed
+df = pd.DataFrame({'probability':y_pred_prob_new, 'actual':y_test})
+df.hist(column='probability', by='actual', sharex=True, sharey=True)
+
+
+# ROC curve did not change
+fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob_new)
+plt.plot(fpr, tpr)
+
+
+# AUC did not change
+print metrics.roc_auc_score(y_test, y_pred_prob_new)
+
+
+# ## Part 4: Cross-validation
+
+# calculate cross-validated AUC
+from sklearn.cross_validation import cross_val_score
+cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
+
+
+# add Fare to the model
+feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S', 'Fare']
+X = titanic[feature_cols]
+
+# recalculate AUC
+cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
diff --git a/code/13_bank_exercise_nb.py b/code/13_bank_exercise_nb.py
@@ -0,0 +1,20 @@
+# # Exercise with bank marketing data
+
+# ## Introduction
+# 
+# - Data from the UCI Machine Learning Repository: [data](https://github.com/justmarkham/DAT8/blob/master/data/bank-additional.csv), [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
+# - **Goal:** Predict whether a customer will purchase a bank product marketed over the phone
+# - `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website
+
+# ## Step 1: Read the data into Pandas
+
+# ## Step 2: Prepare at least three features
+# 
+# - Include both numeric and categorical features
+# - Choose features that you think might be related to the response (based on intuition or exploration)
+# - Think about how to handle missing values (encoded as "unknown")
+
+# ## Step 3: Model building
+# 
+# - Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
+# - Try to increase the AUC by selecting different sets of features
diff --git a/code/14_bayes_theorem_iris_nb.py b/code/14_bayes_theorem_iris_nb.py
@@ -0,0 +1,102 @@
+# # Applying Bayes' theorem to iris classification
+# 
+# Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris?
+
+# ## Preparing the data
+# 
+# We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer:
+
+import pandas as pd
+import numpy as np
+
+
+# read the iris data into a DataFrame
+url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
+col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
+iris = pd.read_csv(url, header=None, names=col_names)
+iris.head()
+
+
+# apply the ceiling function to the numeric columns
+iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)
+iris.head()
+
+
+# ## Deciding how to make a prediction
+# 
+# Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species?
+
+# show all observations with features: 7, 3, 5, 2
+iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]
+
+
+# count the species for these observations
+iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()
+
+
+# count the species for all observations
+iris.species.value_counts()
+
+
+# Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?
+# 
+# $$P(species \ | \ 7352)$$
+# 
+# We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:
+# 
+# $$P(setosa \ | \ 7352)$$
+# $$P(versicolor \ | \ 7352)$$
+# $$P(virginica \ | \ 7352)$$
+
+# ## Calculating the probability of each species
+# 
+# **Bayes' theorem** gives us a way to calculate these conditional probabilities.
+# 
+# Let's start with **versicolor**:
+# 
+# $$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$
+# 
+# We can calculate each of the terms on the right side of the equation:
+# 
+# $$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$
+# 
+# $$P(versicolor) = \frac {50} {150} = 0.33$$
+# 
+# $$P(7352) = \frac {17} {150} = 0.11$$
+# 
+# Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is:
+# 
+# $$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$
+# 
+# Let's repeat this process for **virginica** and **setosa**:
+# 
+# $$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$
+# 
+# $$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$
+# 
+# We predict that the iris is a versicolor, since that species had the **highest conditional probability**.
+
+# ## Summary
+# 
+# 1. We framed a **classification problem** as three conditional probability problems.
+# 2. We used **Bayes' theorem** to calculate those conditional probabilities.
+# 3. We made a **prediction** by choosing the species with the highest conditional probability.
+
+# ## Bonus: The intuition behind Bayes' theorem
+# 
+# Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:
+# 
+# Pretend that **more of the existing versicolors had measurements of 7352:**
+# 
+# - $P(7352 \ | \ versicolor)$ would increase, thus increasing the numerator.
+# - It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase.
+# 
+# Pretend that **most of the existing irises were versicolor:**
+# 
+# - $P(versicolor)$ would increase, thus increasing the numerator.
+# - It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase.
+# 
+# Pretend that **17 of the setosas had measurements of 7352:**
+# 
+# - $P(7352)$ would double, thus doubling the denominator.
+# - It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half.