forked from justmarkham/DAT8
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
add all materials for classes 13 and 14
- Loading branch information
1 parent
f520232
commit 8c508fb
Showing
22 changed files
with
2,625 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
# # Data Preparation and Advanced Model Evaluation | ||
|
||
# ## Agenda | ||
# | ||
# **Data preparation** | ||
# | ||
# - Handling missing values | ||
# - Handling categorical features (review) | ||
# | ||
# **Advanced model evaluation** | ||
# | ||
# - ROC curves and AUC | ||
# - Bonus: ROC curve is only sensitive to rank order of predicted probabilities | ||
# - Cross-validation | ||
|
||
# ## Part 1: Handling missing values | ||
|
||
# scikit-learn models expect that all values are **numeric** and **hold meaning**. Thus, missing values are not allowed by scikit-learn. | ||
|
||
# read the Titanic data | ||
import pandas as pd | ||
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/titanic.csv' | ||
titanic = pd.read_csv(url, index_col='PassengerId') | ||
|
||
|
||
# check for missing values | ||
titanic.isnull().sum() | ||
|
||
|
||
# One possible strategy is to **drop missing values**: | ||
|
||
# drop rows with any missing values | ||
titanic.dropna().shape | ||
|
||
|
||
# drop rows where Age is missing | ||
titanic[titanic.Age.notnull()].shape | ||
|
||
|
||
# Sometimes a better strategy is to **impute missing values**: | ||
|
||
# mean Age | ||
titanic.Age.mean() | ||
|
||
|
||
# median Age | ||
titanic.Age.median() | ||
|
||
|
||
# most frequent Age | ||
titanic.Age.value_counts().head(1).index | ||
|
||
|
||
# fill missing values for Age with the median age | ||
titanic.Age.fillna(titanic.Age.median(), inplace=True) | ||
|
||
|
||
# Another strategy would be to build a **KNN model** just to impute missing values. How would we do that? | ||
# | ||
# If values are missing from a categorical feature, we could treat the missing values as **another category**. Why might that make sense? | ||
# | ||
# How do we **choose** between all of these strategies? | ||
|
||
# ## Part 2: Handling categorical features (Review) | ||
|
||
# How do we include a categorical feature in our model? | ||
# | ||
# - **Ordered categories:** transform them to sensible numeric values (example: small=1, medium=2, large=3) | ||
# - **Unordered categories:** use dummy encoding (0/1) | ||
|
||
titanic.head(10) | ||
|
||
|
||
# encode Sex_Female feature | ||
titanic['Sex_Female'] = titanic.Sex.map({'male':0, 'female':1}) | ||
|
||
|
||
# create a DataFrame of dummy variables | ||
embarked_dummies = pd.get_dummies(titanic.Embarked, prefix='Embarked') | ||
embarked_dummies.drop(embarked_dummies.columns[0], axis=1, inplace=True) | ||
|
||
# concatenate the original DataFrame and the dummy DataFrame | ||
titanic = pd.concat([titanic, embarked_dummies], axis=1) | ||
|
||
|
||
titanic.head(1) | ||
|
||
|
||
# - How do we **interpret** the encoding for Embarked? | ||
# - Why didn't we just encode Embarked using a **single feature** (C=0, Q=1, S=2)? | ||
# - Does it matter which category we choose to define as the **baseline**? | ||
# - Why do we only need **two dummy variables** for Embarked? | ||
|
||
# define X and y | ||
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S'] | ||
X = titanic[feature_cols] | ||
y = titanic.Survived | ||
|
||
# train/test split | ||
from sklearn.cross_validation import train_test_split | ||
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) | ||
|
||
# train a logistic regression model | ||
from sklearn.linear_model import LogisticRegression | ||
logreg = LogisticRegression(C=1e9) | ||
logreg.fit(X_train, y_train) | ||
|
||
# make predictions for testing set | ||
y_pred_class = logreg.predict(X_test) | ||
|
||
# calculate testing accuracy | ||
from sklearn import metrics | ||
print metrics.accuracy_score(y_test, y_pred_class) | ||
|
||
|
||
# ## Part 3: ROC curves and AUC | ||
|
||
# predict probability of survival | ||
y_pred_prob = logreg.predict_proba(X_test)[:, 1] | ||
|
||
|
||
import matplotlib.pyplot as plt | ||
|
||
|
||
# plot ROC curve | ||
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob) | ||
plt.plot(fpr, tpr) | ||
plt.xlim([0.0, 1.0]) | ||
plt.ylim([0.0, 1.0]) | ||
plt.xlabel('False Positive Rate (1 - Specificity)') | ||
plt.ylabel('True Positive Rate (Sensitivity)') | ||
|
||
|
||
# calculate AUC | ||
print metrics.roc_auc_score(y_test, y_pred_prob) | ||
|
||
|
||
# Besides allowing you to calculate AUC, seeing the ROC curve can help you to choose a threshold that **balances sensitivity and specificity** in a way that makes sense for the particular context. | ||
|
||
# histogram of predicted probabilities grouped by actual response value | ||
df = pd.DataFrame({'probability':y_pred_prob, 'actual':y_test}) | ||
df.hist(column='probability', by='actual', sharex=True, sharey=True) | ||
|
||
|
||
# What would have happened if you had used **y_pred_class** instead of **y_pred_prob** when drawing the ROC curve or calculating AUC? | ||
|
||
# ROC curve using y_pred_class - WRONG! | ||
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_class) | ||
plt.plot(fpr, tpr) | ||
|
||
|
||
# AUC using y_pred_class - WRONG! | ||
print metrics.roc_auc_score(y_test, y_pred_class) | ||
|
||
|
||
# If you use **y_pred_class**, it will interpret the zeros and ones as predicted probabilities of 0% and 100%. | ||
|
||
# ## Bonus: ROC curve is only sensitive to rank order of predicted probabilities | ||
|
||
# print the first 10 predicted probabilities | ||
y_pred_prob[:10] | ||
|
||
|
||
# take the square root of predicted probabilities (to make them all bigger) | ||
import numpy as np | ||
y_pred_prob_new = np.sqrt(y_pred_prob) | ||
|
||
# print the modified predicted probabilities | ||
y_pred_prob_new[:10] | ||
|
||
|
||
# histogram of predicted probabilities has changed | ||
df = pd.DataFrame({'probability':y_pred_prob_new, 'actual':y_test}) | ||
df.hist(column='probability', by='actual', sharex=True, sharey=True) | ||
|
||
|
||
# ROC curve did not change | ||
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob_new) | ||
plt.plot(fpr, tpr) | ||
|
||
|
||
# AUC did not change | ||
print metrics.roc_auc_score(y_test, y_pred_prob_new) | ||
|
||
|
||
# ## Part 4: Cross-validation | ||
|
||
# calculate cross-validated AUC | ||
from sklearn.cross_validation import cross_val_score | ||
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean() | ||
|
||
|
||
# add Fare to the model | ||
feature_cols = ['Pclass', 'Parch', 'Age', 'Sex_Female', 'Embarked_Q', 'Embarked_S', 'Fare'] | ||
X = titanic[feature_cols] | ||
|
||
# recalculate AUC | ||
cross_val_score(logreg, X, y, cv=10, scoring='roc_auc').mean() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
# # Exercise with bank marketing data | ||
|
||
# ## Introduction | ||
# | ||
# - Data from the UCI Machine Learning Repository: [data](https://github.com/justmarkham/DAT8/blob/master/data/bank-additional.csv), [data dictionary](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing) | ||
# - **Goal:** Predict whether a customer will purchase a bank product marketed over the phone | ||
# - `bank-additional.csv` is already in our repo, so there is no need to download the data from the UCI website | ||
|
||
# ## Step 1: Read the data into Pandas | ||
|
||
# ## Step 2: Prepare at least three features | ||
# | ||
# - Include both numeric and categorical features | ||
# - Choose features that you think might be related to the response (based on intuition or exploration) | ||
# - Think about how to handle missing values (encoded as "unknown") | ||
|
||
# ## Step 3: Model building | ||
# | ||
# - Use cross-validation to evaluate the AUC of a logistic regression model with your chosen features | ||
# - Try to increase the AUC by selecting different sets of features |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,102 @@ | ||
# # Applying Bayes' theorem to iris classification | ||
# | ||
# Can **Bayes' theorem** help us to solve a **classification problem**, namely predicting the species of an iris? | ||
|
||
# ## Preparing the data | ||
# | ||
# We'll read the iris data into a DataFrame, and **round up** all of the measurements to the next integer: | ||
|
||
import pandas as pd | ||
import numpy as np | ||
|
||
|
||
# read the iris data into a DataFrame | ||
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data' | ||
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'] | ||
iris = pd.read_csv(url, header=None, names=col_names) | ||
iris.head() | ||
|
||
|
||
# apply the ceiling function to the numeric columns | ||
iris.loc[:, 'sepal_length':'petal_width'] = iris.loc[:, 'sepal_length':'petal_width'].apply(np.ceil) | ||
iris.head() | ||
|
||
|
||
# ## Deciding how to make a prediction | ||
# | ||
# Let's say that I have an **out-of-sample iris** with the following measurements: **7, 3, 5, 2**. How might I predict the species? | ||
|
||
# show all observations with features: 7, 3, 5, 2 | ||
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)] | ||
|
||
|
||
# count the species for these observations | ||
iris[(iris.sepal_length==7) & (iris.sepal_width==3) & (iris.petal_length==5) & (iris.petal_width==2)].species.value_counts() | ||
|
||
|
||
# count the species for all observations | ||
iris.species.value_counts() | ||
|
||
|
||
# Let's frame this as a **conditional probability problem**: What is the probability of some particular species, given the measurements 7, 3, 5, and 2? | ||
# | ||
# $$P(species \ | \ 7352)$$ | ||
# | ||
# We could calculate the conditional probability for **each of the three species**, and then predict the species with the **highest probability**: | ||
# | ||
# $$P(setosa \ | \ 7352)$$ | ||
# $$P(versicolor \ | \ 7352)$$ | ||
# $$P(virginica \ | \ 7352)$$ | ||
|
||
# ## Calculating the probability of each species | ||
# | ||
# **Bayes' theorem** gives us a way to calculate these conditional probabilities. | ||
# | ||
# Let's start with **versicolor**: | ||
# | ||
# $$P(versicolor \ | \ 7352) = \frac {P(7352 \ | \ versicolor) \times P(versicolor)} {P(7352)}$$ | ||
# | ||
# We can calculate each of the terms on the right side of the equation: | ||
# | ||
# $$P(7352 \ | \ versicolor) = \frac {13} {50} = 0.26$$ | ||
# | ||
# $$P(versicolor) = \frac {50} {150} = 0.33$$ | ||
# | ||
# $$P(7352) = \frac {17} {150} = 0.11$$ | ||
# | ||
# Therefore, Bayes' theorem says the **probability of versicolor given these measurements** is: | ||
# | ||
# $$P(versicolor \ | \ 7352) = \frac {0.26 \times 0.33} {0.11} = 0.76$$ | ||
# | ||
# Let's repeat this process for **virginica** and **setosa**: | ||
# | ||
# $$P(virginica \ | \ 7352) = \frac {0.08 \times 0.33} {0.11} = 0.24$$ | ||
# | ||
# $$P(setosa \ | \ 7352) = \frac {0 \times 0.33} {0.11} = 0$$ | ||
# | ||
# We predict that the iris is a versicolor, since that species had the **highest conditional probability**. | ||
|
||
# ## Summary | ||
# | ||
# 1. We framed a **classification problem** as three conditional probability problems. | ||
# 2. We used **Bayes' theorem** to calculate those conditional probabilities. | ||
# 3. We made a **prediction** by choosing the species with the highest conditional probability. | ||
|
||
# ## Bonus: The intuition behind Bayes' theorem | ||
# | ||
# Let's make some hypothetical adjustments to the data, to demonstrate how Bayes' theorem makes intuitive sense: | ||
# | ||
# Pretend that **more of the existing versicolors had measurements of 7352:** | ||
# | ||
# - $P(7352 \ | \ versicolor)$ would increase, thus increasing the numerator. | ||
# - It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would also increase. | ||
# | ||
# Pretend that **most of the existing irises were versicolor:** | ||
# | ||
# - $P(versicolor)$ would increase, thus increasing the numerator. | ||
# - It would make sense that the probability of any iris being a versicolor (regardless of measurements) would also increase. | ||
# | ||
# Pretend that **17 of the setosas had measurements of 7352:** | ||
# | ||
# - $P(7352)$ would double, thus doubling the denominator. | ||
# - It would make sense that given an iris with measurements of 7352, the probability of it being a versicolor would be cut in half. |
Oops, something went wrong.