Skip to content

Commit

Permalink
add all materials for classes 13 and 14
Browse files Browse the repository at this point in the history
  • Loading branch information
justmarkham committed Sep 25, 2015
1 parent f520232 commit 8c508fb
Show file tree
Hide file tree
Showing 22 changed files with 2,625 additions and 2 deletions.
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,8 @@ Tuesday | Thursday

### [Comparison of machine learning models](other/model_comparison.md)

### [Comparison of model evaluation procedures and metrics](other/model_evaluation_comparison.md)

-----

### Class 1: Introduction to Data Science
Expand Down Expand Up @@ -350,8 +352,6 @@ Tuesday | Thursday
* My [simple guide to confusion matrix terminology](http://www.dataschool.io/simple-guide-to-confusion-matrix-terminology/) may be useful to you as a reference.
* This notebook (from another DAT course) explains [how to calculate "expected value"](https://github.com/podopie/DAT18NYC/blob/master/classes/13-expected_value_cost_benefit_analysis.ipynb) from a confusion matrix by treating it as a cost-benefit matrix.

<!--
-----

### Class 13: Advanced Model Evaluation
Expand Down Expand Up @@ -416,6 +416,8 @@ Tuesday | Thursday
* If you enjoyed Paul Graham's article, you can read [his follow-up article](http://www.paulgraham.com/better.html) on how he improved his spam filter and this [related paper](http://www.merl.com/publications/docs/TR2004-091.pdf) about state-of-the-art spam filtering in 2004.
* Yelp has found that Naive Bayes is more effective than Mechanical Turks at [categorizing businesses](http://engineeringblog.yelp.com/2011/02/towards-building-a-high-quality-workforce-with-mechanical-turk.html).

<!--
-----
### Class 15: Natural Language Processing
Expand Down
198 changes: 198 additions & 0 deletions code/13_advanced_model_evaluation_nb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
# # Data Preparation and Advanced Model Evaluation

# ## Agenda
#
# **Data preparation**
#
# - Handling missing values
# - Handling categorical features (review)
#
# **Advanced model evaluation**
#
# - ROC curves and AUC
# - Bonus: ROC curve is only sensitive to rank order of predicted probabilities
# - Cross-validation

# ## Part 1: Handling missing values

# scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn.

# read the Titanic data
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv'
titanic = pd.read_csv(url, index_col='PassengerId')


# check for missing values
titanic.isnull().sum()


# One possible strategy is to **drop missing values**:

# drop rows with any missing values
titanic.dropna().shape


# drop rows where Age is missing
titanic[titanic.Age.notnull()].shape


# Sometimes a better strategy is to **impute missing values**:

# mean Age
titanic.Age.mean()


# median Age
titanic.Age.median()


# most frequent Age
titanic.Age.value_counts().head(1).index


# fill missing values for Age with the median age
titanic.Age.fillna(titanic.Age.median(), inplace=True)


# Another strategy would be to build a **KNN model** just to impute missing values. How would we do that?
#
# If values are missing from a categorical feature, we could treat the missing values as **another category**. Why might that make sense?
#
# How do we **choose** between all of these strategies?

# ## Part 2: Handling categorical features (Review)

# How do we include a categorical feature in our model?
#
# - **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3)
# - **Unordered categories:** use dummy encoding (0/1)

titanic.head(10)


# encode Sex_Female feature
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1})


# create a DataFrame of dummy variables
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked')
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True)

# concatenate the original DataFrame and the dummy DataFrame
titanic = pd.concat([titanic, embarked_dummies], axis=1)


titanic.head(1)


# - How do we **interpret** the encoding for Embarked?
# - Why didn't we just encode Embarked using a **single feature** (C=0, Q=1, S=2)?
# - Does it matter which category we choose to define as the **baseline**?
# - Why do we only need **two dummy variables** for Embarked?

# define X and y
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S']
X = titanic[feature_cols]
y = titanic.Survived

# train/test split
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

# train a logistic regression model
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9)
logreg.fit(X_train, y_train)

# make predictions for testing set
y_pred_class = logreg.predict(X_test)

# calculate testing accuracy
from sklearn import metrics
print metrics.accuracy_score(y_test, y_pred_class)


# ## Part 3: ROC curves and AUC

# predict probability of survival
y_pred_prob = logreg.predict_proba(X_test)[:, 1]


import matplotlib.pyplot as plt


# plot ROC curve
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.xlabel('False Positive Rate (1 - Specificity)')
plt.ylabel('True Positive Rate (Sensitivity)')


# calculate AUC
print metrics.roc_auc_score(y_test, y_pred_prob)


# Besides allowing you to calculate AUC, seeing the ROC curve can help you to choose a threshold that **balances sensitivity and specificity** in a way that makes sense for the particular context.

# histogram of predicted probabilities grouped by actual response value
df = pd.DataFrame({'probability':y_pred_prob, 'actual':y_test})
df.hist(column='probability', by='actual', sharex=True, sharey=True)


# What would have happened if you had used **y_pred_class** instead of **y_pred_prob** when drawing the ROC curve or calculating AUC?

# ROC curve using y_pred_class - WRONG!
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_class)
plt.plot(fpr, tpr)


# AUC using y_pred_class - WRONG!
print metrics.roc_auc_score(y_test, y_pred_class)


# If you use **y_pred_class**, it will interpret the zeros and ones as predicted probabilities of 0% and 100%.

# ## Bonus: ROC curve is only sensitive to rank order of predicted probabilities

# print the first 10 predicted probabilities
y_pred_prob[:10]


# take the square root of predicted probabilities (to make them all bigger)
import numpy as np
y_pred_prob_new = np.sqrt(y_pred_prob)

# print the modified predicted probabilities
y_pred_prob_new[:10]


# histogram of predicted probabilities has changed
df = pd.DataFrame({'probability':y_pred_prob_new, 'actual':y_test})
df.hist(column='probability', by='actual', sharex=True, sharey=True)


# ROC curve did not change
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob_new)
plt.plot(fpr, tpr)


# AUC did not change
print metrics.roc_auc_score(y_test, y_pred_prob_new)


# ## Part 4: Cross-validation

# calculate cross-validated AUC
from sklearn.cross_validation import cross_val_score
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()


# add Fare to the model
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S', 'Fare']
X = titanic[feature_cols]

# recalculate AUC
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean()
20 changes: 20 additions & 0 deletions code/13_bank_exercise_nb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# # Exercise with bank marketing data

# ## Introduction
#
# - Data from the UCI Machine Learning Repository: [data](https://github.com/justmarkham/DAT8/blob/master/data/bank-additional.csv), [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)
# - **Goal:** Predict whether a customer will purchase a bank product marketed over the phone
# - `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website

# ## Step 1: Read the data into Pandas

# ## Step 2: Prepare at least three features
#
# - Include both numeric and categorical features
# - Choose features that you think might be related to the response (based on intuition or exploration)
# - Think about how to handle missing values (encoded as "unknown")

# ## Step 3: Model building
#
# - Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features
# - Try to increase the AUC by selecting different sets of features
102 changes: 102 additions & 0 deletions code/14_bayes_theorem_iris_nb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,102 @@
# # Applying Bayes' theorem to iris classification
#
# Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris?

# ## Preparing the data
#
# We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer:

import pandas as pd
import numpy as np


# read the iris data into a DataFrame
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)
iris.head()


# apply the ceiling function to the numeric columns
iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil)
iris.head()


# ## Deciding how to make a prediction
#
# Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species?

# show all observations with features: 7, 3, 5, 2
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)]


# count the species for these observations
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts()


# count the species for all observations
iris.species.value_counts()


# Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2?
#
# $$P(species \ | \ 7352)$$
#
# We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**:
#
# $$P(setosa \ | \ 7352)$$
# $$P(versicolor \ | \ 7352)$$
# $$P(virginica \ | \ 7352)$$

# ## Calculating the probability of each species
#
# **Bayes' theorem** gives us a way to calculate these conditional probabilities.
#
# Let's start with **versicolor**:
#
# $$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$
#
# We can calculate each of the terms on the right side of the equation:
#
# $$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$
#
# $$P(versicolor) = \frac {50} {150} = 0.33$$
#
# $$P(7352) = \frac {17} {150} = 0.11$$
#
# Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is:
#
# $$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$
#
# Let's repeat this process for **virginica** and **setosa**:
#
# $$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$
#
# $$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$
#
# We predict that the iris is a versicolor, since that species had the **highest conditional probability**.

# ## Summary
#
# 1. We framed a **classification problem** as three conditional probability problems.
# 2. We used **Bayes' theorem** to calculate those conditional probabilities.
# 3. We made a **prediction** by choosing the species with the highest conditional probability.

# ## Bonus: The intuition behind Bayes' theorem
#
# Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense:
#
# Pretend that **more of the existing versicolors had measurements of 7352:**
#
# - $P(7352 \ | \ versicolor)$ would increase, thus increasing the numerator.
# - It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase.
#
# Pretend that **most of the existing irises were versicolor:**
#
# - $P(versicolor)$ would increase, thus increasing the numerator.
# - It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase.
#
# Pretend that **17 of the setosas had measurements of 7352:**
#
# - $P(7352)$ would double, thus doubling the denominator.
# - It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half.
Loading

0 comments on commit 8c508fb

Please sign in to comment.