Skip to content

Commit

Permalink
add materials for classes 17 and 18
Browse files Browse the repository at this point in the history
  • Loading branch information
justmarkham committed Oct 9, 2015
1 parent 4e969ab commit d7cc9d0
Show file tree
Hide file tree
Showing 8 changed files with 3,154 additions and 2 deletions.
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -468,8 +468,6 @@ Tuesday | Thursday
* These examples may help you to better understand the process of feature engineering: predicting the number of [passengers at a train station](https://medium.com/@chris_bour/french-largest-data-science-challenge-ever-organized-shows-the-unreasonable-effectiveness-of-open-8399705a20ef), identifying [fraudulent users of an online store](https://docs.google.com/presentation/d/1UdI5NY-mlHyseiRVbpTLyvbrHxY8RciHp5Vc-ZLrwmU/edit#slide=id.p), identifying [bots in an online auction](https://www.kaggle.com/c/facebook-recruiting-iv-human-or-bot/forums/t/14628/share-your-secret-sauce), predicting who will [subscribe to the next season of an orchestra](http://blog.kaggle.com/2015/01/05/kaggle-inclass-stanfords-getting-a-handel-on-data-science-winners-report/), and evaluating the [quality of e-commerce search engine results](http://blog.kaggle.com/2015/07/22/crowdflower-winners-interview-3rd-place-team-quartet/).
* [Our perfect submission](https://www.kaggle.com/c/restaurant-revenue-prediction/forums/t/13950/our-perfect-submission) is a fun read about how great performance on the [public leaderboard](https://www.kaggle.com/c/restaurant-revenue-prediction/leaderboard/public) does not guarantee that a model will generalize to new data.

<!--
-----

### Class 17: Decision Trees
Expand Down Expand Up @@ -503,8 +501,11 @@ Tuesday | Thursday
* [Not Even the People Who Write Algorithms Really Know How They Work](http://www.theatlantic.com/technology/archive/2015/09/not-even-the-people-who-write-algorithms-really-know-how-they-work/406099/) argues that the decreased interpretability of state-of-the-art machine learning models has a negative impact on society.
* For an intuitive explanation of Random Forests, read Edwin Chen's answer to [How do random forests work in layman's terms?](http://www.quora.com/Random-Forests/How-do-random-forests-work-in-laymans-terms/answer/Edwin-Chen-1)
* [Large Scale Decision Forests: Lessons Learned](http://blog.siftscience.com/blog/2015/large-scale-decision-forests-lessons-learned) is an excellent post from Sift Science about their custom implementation of Random Forests.
* [Unboxing the Random Forest Classifier](http://nerds.airbnb.com/unboxing-the-random-forest-classifier/) describes a way to interpret the inner workings of Random Forests beyond just feature importances.
* [Understanding Random Forests: From Theory to Practice](http://arxiv.org/pdf/1407.7502v3.pdf) is an in-depth academic analysis of Random Forests, including details of its implementation in scikit-learn.

<!--
-----
### Class 19: Advanced scikit-learn and Clustering
Expand Down
68 changes: 68 additions & 0 deletions code/17_bikeshare_exercise_nb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# # Exercise with Capital Bikeshare data

# ## Introduction
#
# - Capital Bikeshare dataset from Kaggle: [data](https://github.com/justmarkham/DAT8/blob/master/data/bikeshare.csv), [data dictionary](https://www.kaggle.com/c/bike-sharing-demand/data)
# - Each observation represents the bikeshare rentals initiated during a given hour of a given day

import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz


# read the data and set "datetime" as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)


# "count" is a method, so it's best to rename that column
bikes.rename(columns={'count':'total'}, inplace=True)


# create "hour" as its own feature
bikes['hour'] = bikes.index.hour


bikes.head()


bikes.tail()


# - **hour** ranges from 0 (midnight) through 23 (11pm)
# - **workingday** is either 0 (weekend or holiday) or 1 (non-holiday weekday)

# ## Task 1
#
# Run these two `groupby` statements and figure out what they tell you about the data.

bikes.groupby('workingday').total.mean()


bikes.groupby('hour').total.mean()


# ## Task 2
#
# Run this plotting code, and make sure you understand the output. Then, separate this plot into two separate plots conditioned on "workingday". (In other words, one plot should display the hourly trend for "workingday=0", and the other should display the hourly trend for "workingday=1".)

bikes.groupby('hour').total.mean().plot()


# ## Task 3
#
# Fit a linear regression model to the entire dataset, using "total" as the response and "hour" and "workingday" as the only features. Then, print the coefficients and interpret them. What are the limitations of linear regression in this instance?

# ## Task 4
#
# Use 10-fold cross-validation to calculate the RMSE for the linear regression model.

# ## Task 5
#
# Use 10-fold cross-validation to evaluate a decision tree model with those same features (fit to any "max_depth" you choose).

# ## Task 6
#
# Fit a decision tree model to the entire dataset using "max_depth=3", and create a tree diagram using Graphviz. Then, figure out what each leaf represents. What did the decision tree learn that a linear regression model could not learn?
Loading

0 comments on commit d7cc9d0

Please sign in to comment.