forked from justmarkham/DAT8
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08_knn_sklearn_nb.py
254 lines (180 loc) · 7.95 KB
/
08_knn_sklearn_nb.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
# # K-nearest neighbors and scikit-learn
# ## Agenda
#
# 1. Review of the iris dataset
# 2. Human learning on the iris dataset
# 3. K-nearest neighbors (KNN) classification
# 4. Review of supervised learning
# 5. Benefits and drawbacks of scikit-learn
# 6. Requirements for working with data in scikit-learn
# 7. scikit-learn's 4-step modeling pattern
# 8. Tuning a KNN model
# 9. Comparing KNN with other models
#
# ## Lesson goals
#
# 1. Learn how the modeling process works
# 2. Learn how scikit-learn works
# 3. Learn how KNN works
# ## Review of the iris dataset
# read the iris data into a DataFrame
import pandas as pd
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
col_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
iris = pd.read_csv(url, header=None, names=col_names)
iris.head()
# ### Terminology
#
# - **150 observations** (n=150): each observation is one iris flower
# - **4 features** (p=4): sepal length, sepal width, petal length, and petal width
# - **Response**: iris species
# - **Classification problem** since response is categorical
# ## Human learning on the iris dataset
#
# How did we (as humans) predict the species of an iris?
#
# 1. We observed that the different species had (somewhat) dissimilar measurements.
# 2. We focused on features that seemed to correlate with the response.
# 3. We created a set of rules (using those features) to predict the species of an unknown iris.
#
# We assumed that if an **unknown iris** has measurements similar to **previous irises**, then its species is most likely the same as those previous irises.
# allow plots to appear in the notebook
import matplotlib.pyplot as plt
# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (6, 4)
plt.rcParams['font.size'] = 14
# create a custom colormap
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})
# create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold)
# create a scatter plot of SEPAL LENGTH versus SEPAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='sepal_length', y='sepal_width', c='species_num', colormap=cmap_bold)
# ## K-nearest neighbors (KNN) classification
# 1. Pick a value for K.
# 2. Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
# - Euclidian distance is often used as the distance metric, but other metrics are allowed.
# 3. Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.
# ### KNN classification map for iris (K=1)
#
# ![1NN classification map](images/iris_01nn_map.png)
# ### KNN classification map for iris (K=5)
#
# ![5NN classification map](images/iris_05nn_map.png)
# ### KNN classification map for iris (K=15)
#
# ![15NN classification map](images/iris_15nn_map.png)
# ### KNN classification map for iris (K=50)
#
# ![50NN classification map](images/iris_50nn_map.png)
# **Question:** What's the "best" value for K in this case?
#
# **Answer:** The value which produces the most accurate predictions on **unseen data**. We want to create a model that generalizes!
# ## Review of supervised learning
#
# ![Supervised learning diagram](images/supervised_learning.png)
# ## Benefits and drawbacks of scikit-learn
#
# **Benefits:**
#
# - Consistent interface to machine learning models
# - Provides many tuning parameters but with sensible defaults
# - Exceptional documentation
# - Rich set of functionality for companion tasks
# - Active community for development and support
#
# **Potential drawbacks:**
#
# - Harder (than R) to get started with machine learning
# - Less emphasis (than R) on model interpretability
#
# Ben Lorica: [Six reasons why I recommend scikit-learn](http://radar.oreilly.com/2013/12/six-reasons-why-i-recommend-scikit-learn.html)
# ## Requirements for working with data in scikit-learn
#
# 1. Features and response should be **separate objects**
# 2. Features and response should be entirely **numeric**
# 3. Features and response should be **NumPy arrays** (or easily converted to NumPy arrays)
# 4. Features and response should have **specific shapes** (outlined below)
iris.head()
# store feature matrix in "X"
feature_cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
X = iris[feature_cols]
# alternative ways to create "X"
X = iris.drop(['species', 'species_num'], axis=1)
X = iris.loc[:, 'sepal_length':'petal_width']
X = iris.iloc[:, 0:4]
# store response vector in "y"
y = iris.species_num
# check X's type
print type(X)
print type(X.values)
# check y's type
print type(y)
print type(y.values)
# check X's shape (n = number of observations, p = number of features)
print X.shape
# check y's shape (single dimension with length n)
print y.shape
# ## scikit-learn's 4-step modeling pattern
# **Step 1:** Import the class you plan to use
from sklearn.neighbors import KNeighborsClassifier
# **Step 2:** "Instantiate" the "estimator"
#
# - "Estimator" is scikit-learn's term for "model"
# - "Instantiate" means "make an instance of"
# make an instance of a KNeighborsClassifier object
knn = KNeighborsClassifier(n_neighbors=1)
type(knn)
# - Created an object that "knows" how to do K-nearest neighbors classification, and is just waiting for data
# - Name of the object does not matter
# - Can specify tuning parameters (aka "hyperparameters") during this step
# - All parameters not specified are set to their defaults
print knn
# **Step 3:** Fit the model with data (aka "model training")
#
# - Model is "learning" the relationship between X and y in our "training data"
# - Process through which learning occurs varies by model
# - Occurs in-place
knn.fit(X, y)
# - Once a model has been fit with data, it's called a "fitted model"
# **Step 4:** Predict the response for a new observation
#
# - New observations are called "out-of-sample" data
# - Uses the information it learned during the model training process
knn.predict([3, 5, 4, 2])
# - Returns a NumPy array, and we keep track of what the numbers "mean"
# - Can predict for multiple observations at once
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)
# ## Tuning a KNN model
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)
# fit the model with data
knn.fit(X, y)
# predict the response for new observations
knn.predict(X_new)
# **Question:** Which model produced the correct predictions for the two unknown irises?
#
# **Answer:** We don't know, because these are **out-of-sample observations**, meaning that we don't know the true response values. Our goal with supervised learning is to build models that generalize to out-of-sample data. However, we can't truly measure how well our models will perform on out-of-sample data.
#
# **Question:** Does that mean that we have to guess how well our models are likely to do?
#
# **Answer:** Thankfully, no. In the next class, we'll discuss **model evaluation procedures**, which allow us to use our existing labeled data to estimate how well our models are likely to perform on out-of-sample data. These procedures will help us to tune our models and choose between different types of models.
# calculate predicted probabilities of class membership
knn.predict_proba(X_new)
# ## Comparing KNN with other models
# **Advantages of KNN:**
#
# - Simple to understand and explain
# - Model training is fast
# - Can be used for classification and regression
#
# **Disadvantages of KNN:**
#
# - Must store all of the training data
# - Prediction phase can be slow when n is large
# - Sensitive to irrelevant features
# - Sensitive to the scale of the data
# - Accuracy is (generally) not competitive with the best supervised learning methods