GitHub - ajchan11/DAT8 at 7cbb74b9fa59023462a5242be414414fd7759137

Name	Name	Last commit message	Last commit date
Latest commit History 59 Commits
code	code
data	data
homework	homework
notebooks	notebooks
other	other
project	project
slides	slides
.gitignore	.gitignore
README.md	README.md

DAT8 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15).

Course Project

Tuesday	Thursday
8/18: Introduction to Data Science	8/20: Command Line and Version Control
8/25: Data Reading and Cleaning	8/27: Exploratory Data Analysis
9/1: Visualization Project Discussion Deadline	9/3: Machine Learning Project Question and Dataset Due
9/8: Getting Data	9/10: K-Nearest Neighbors
9/15: Basic Model Evaluation	9/17: Linear Regression
9/22: First Project Presentation	9/24: Logistic Regression
9/29: Advanced Model Evaluation	10/1: Naive Bayes and Text Data
10/6: Natural Language Processing	10/8: Kaggle Competition, Draft Paper Due
10/13: Decision Trees	10/15: Ensembling
10/20: Regularization and Clustering, Peer Review Due	10/22: Course Review and Bonus Topics
10/27: Bonus Topics and Final Project Presentation	10/29: Final Project Presentation

Python Resources

Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
DataQuest: Uses interactive exercises to teach Python in the context of data science.
Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
Introduction to Python: A series of IPython notebooks that do a great job explaining core Python concepts and data structures.
Python for Informatics: A very beginner-oriented book, with associated slides and videos.
A Crash Course in Python for Scientists: Read through the Overview section for a very quick introduction to Python.
Python Quick Reference Guide: My beginner-oriented guide that demonstrates Python concepts through short, well-commented examples.
Beginner and intermediate workshop code: Useful for review and reference.
Python Tutor: Allows you to visualize the execution of Python code.

Submission Forms

Comparison of machine learning models

Class 1: Introduction to Data Science

Course overview (slides)
Introduction to data science (slides)
Discuss the course project: requirements and example projects
Types of data (slides) and public data sources
Welcome from General Assembly staff

Homework:

Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows).
Read through this command line reference, and complete the pre-class exercise at the bottom. (There's nothing you need to submit once you're done.)
Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub, or read sections 1.1 through 2.2 of Pro Git.
If your laptop has any setup issues, please work with us to resolve them by Thursday. If your laptop has not yet been checked, you should come early on Thursday, or just walk through the setup checklist yourself (and let us know you have done so).

Resources:

For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
Quora has a data science topic FAQ with lots of interesting Q&A.
Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.

Class 2: Command Line and Version Control

Slack tour
Review the command line pre-class exercise (code)
Git and GitHub (slides)
Intermediate command line

Homework:

Complete the command line homework assignment with the Chipotle data.
Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time this weekend practicing Python:
- Introduction to Python does a great job explaining Python essentials and includes tons of example code.
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
- If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message) and send me your code in Slack.
To give you a framework for thinking about your project, watch What is machine learning, and how does it work? (10 minutes). (This is the IPython notebook shown in the video.) Alternatively, read A Visual Introduction to Machine Learning, which focuses on a specific machine learning model called decision trees.
Optional: Browse through some more example student projects, which may help to inspire your own project!

Git and Markdown Resources:

Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.

Command Line Resources:

If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.

Class 3: Data Reading and Cleaning

Git and GitHub assorted tips (slides)
Review command line homework (solution)
Python:
- Spyder interface
- Looping exercise
- Lesson on file reading with airline safety data (code, data, article)
- Data cleaning exercise
- Walkthrough of Python homework with Chipotle data (code, data, article)

Homework:

Complete the Python homework assignment with the Chipotle data, add a commented Python script to your GitHub repo, and submit a link using the homework submission form. You have until Tuesday (9/1) to complete this assignment. (Note: Pandas, which is covered in class 4, should not be used for this assignment.)

Resources:

Want to understand Python's comprehensions? Think in Excel or SQL may be helpful if you are still confused by list comprehensions.
My code isn't working is a great flowchart explaining how to debug Python errors.
PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
If you want to understand Python at a deeper level, Ned Batchelder's Loop Like A Native and Python Names and Values are excellent presentations.

Class 4: Exploratory Data Analysis

Pandas (code):
- MovieLens 100k movie ratings (data, data dictionary, website)
- Alcohol consumption by country (data, article)
- Reports of UFO sightings (data, website)
Project question exercise

Homework:

The deadline for discussing your project ideas with an instructor is Tuesday (9/1), and your project question write-up is due Thursday (9/3).
Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
Read Anscombe's Quartet, and Why Summary Statistics Don't Tell the Whole Story for a classic example of why visualization is useful.

Resources:

Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.

Class 5: Visualization

Python homework with the Chipotle data due (solution, detailed explanation)
Part 2 of Exploratory Data Analysis with Pandas (code)
Visualization with Pandas and Matplotlib (code, notebook)

Homework:

Your project question write-up is due on Thursday.
Complete the Pandas homework assignment with the IMDb data. You have until Tuesday (9/8) to complete this assignment.
If you're not using Anaconda, install the Jupyter Notebook (formerly known as the IPython Notebook) using pip. (The Jupyter or IPython Notebook is included with Anaconda.)

Pandas Resources:

To learn more Pandas, read this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
This is a nice, short tutorial on pivot tables in Pandas.
For working with geospatial data in Python, GeoPandas looks promising. This tutorial uses GeoPandas (and scikit-learn) to build a "linguistic street map" of Singapore.

Visualization Resources:

Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
To learn how to customize your plots further, browse through this notebook on matplotlib or this similar notebook.
Read Overview of Python Visualization Tools for a useful comparison of Matplotlib, Pandas, Seaborn, ggplot, Bokeh, Pygal, and Plotly.
To explore different types of visualizations and when to use them, Choosing a Good Chart and The Graphic Continuum are nice one-page references, and the interactive R Graph Catalog has handy filtering capabilities.
This PowerPoint presentation from Columbia's Data Mining class contains lots of good advice for properly using different types of visualizations.
Harvard's Data Science course includes an excellent lecture on Visualization Goals, Data Types, and Statistical Graphs (83 minutes), for which the slides are also available.

Class 6: Machine Learning

Part 2 of Visualization with Pandas and Matplotlib (code, notebook)
Brief introduction to the Jupyter/IPython Notebook
"Human learning" exercise:
- Iris dataset hosted by the UCI Machine Learning Repository
- Iris photo
- Notebook
Introduction to machine learning (slides)

Homework:

Optional: Complete the bonus exercise listed in the human learning notebook. It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/8).
If you're not using Anaconda, install requests and Beautiful Soup 4 using pip. (Both of these packages are included with Anaconda.)

Machine Learning Resources:

For a very quick summary of the key points about machine learning, watch What is machine learning, and how does it work? (10 minutes) or read the associated notebook.
For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
The Learning Paradigms video (13 minutes) from Caltech's Learning From Data course provides a nice comparison of supervised versus unsupervised learning, as well as an introduction to "reinforcement learning".
Real-World Active Learning is a readable and thorough introduction to "active learning", a variation of machine learning in which humans label only the most "important" observations.
For a preview of some of the machine learning content we will cover during the course, read Sebastian Raschka's overview of the supervised learning process.
Data Science, Machine Learning, and Statistics: What is in a Name? discusses the differences between these (and other) terms.
The Emoji Translation Project is a really fun application of machine learning.
Look up the characteristics of your zip code, and then read about the 67 distinct segments in detail.

IPython Notebook Resources:

For a recap of the IPython Notebook introduction (and a preview of scikit-learn), watch scikit-learn and the IPython Notebook (15 minutes) or read the associated notebook.
If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
This Reddit discussion compares the relative strengths of the IPython Notebook and Spyder.

Class 7: Getting Data

Pandas homework with the IMDb data due (solution)
Optional "human learning" exercise with the iris data due (solution)
APIs (code)
- OMDb API
Web scraping (code)

Homework:

Optional: Complete the homework exercise listed in the web scraping code. It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/15).
Optional: If you're not using Anaconda, install Seaborn using pip. If you're using Anaconda, install Seaborn by running conda install seaborn at the command line. (Note that some students in past courses have had problems with Anaconda after installing Seaborn.)

API Resources:

This Python script to query the U.S. Census API was created by a former DAT student. It's a bit more complicated than the example we used in class, it's very well commented, and it may provide a useful framework for writing your own code to query APIs.
Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
The Data Science Toolkit is a collection of location-based and text-related APIs.
API Integration in Python provides a very readable introduction to REST APIs.
Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.

Web Scraping Resources:

The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, a former DAT student's well-commented notebook on scraping Craigslist, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
robotstxt.org has a concise explanation of how to write (and read) the robots.txt file.
import.io and Kimono claim to allow you to scrape websites without writing any code.
How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.

Class 8: K-Nearest Neighbors

Brief review of Pandas (notebook)
K-nearest neighbors and scikit-learn (notebook)
Exercise with NBA player data (notebook, data, data dictionary)
Exploring the bias-variance tradeoff (notebook)

Homework:

Reading assignment on the bias-variance tradeoff
Read Kevin's introduction to reproducibility, read Jeff Leek's guide to creating a reproducible analysis, and watch this related Colbert Report video (8 minutes).
Work on your project... your first project presentation is in less than two weeks!

KNN Resources:

For a recap of the key points about KNN and scikit-learn, watch Getting started in scikit-learn with the famous iris dataset (15 minutes) and Training a machine learning model with scikit-learn (20 minutes).
KNN supports distance metrics other than Euclidean distance, such as Mahalanobis distance, which takes the scale of the data into account.
A Detailed Introduction to KNN is a bit dense, but provides a more thorough introduction to KNN and its applications.
This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and cross-validation).
Some applications for which KNN is well-suited are object recognition, satellite image enhancement, document categorization, and gene expression analysis.

Seaborn Resources:

To get started with Seaborn for visualization, the official website has a series of detailed tutorials and an example gallery.
Data visualization with Seaborn is a quick tour of some of the popular types of Seaborn plots.
Visualizing Google Forms Data with Seaborn and How to Create NBA Shot Charts in Python are both good examples of Seaborn usage on real-world data.

Class 9: Basic Model Evaluation

Optional web scraping homework due (solution)
Reproducibility
- Discuss assigned readings: introduction, Colbert Report video, cabs article, Tweet, creating a reproducible analysis
- Examples: Classic rock, student project 1, student project 2
Discuss the reading assignment on the bias-variance tradeoff
Model evaluation using train/test split (notebook)
Exploring the scikit-learn documentation: module reference, user guide, class and function documentation

Homework:

Watch Data science in Python (35 minutes) for an introduction to linear regression (and a review of other course content), or at the very least, read through the associated notebook.
Optional: For another introduction to linear regression, watch The Easiest Introduction to Regression Analysis (14 minutes).

Model Evaluation Resources:

For a recap of some of the key points from today's lesson, watch Comparing machine learning models in scikit-learn (27 minutes).
For another explanation of training error versus testing error, the bias-variance tradeoff, and train/test split (also known as the "validation set approach"), watch Hastie and Tibshirani's video on estimating prediction error (12 minutes, starting at 2:34).
Caltech's Learning From Data course includes a fantastic video on visualizing bias and variance (15 minutes).
Random Test/Train Split is Not Always Enough explains why random train/test split may not be a suitable model evaluation procedure if your data has a significant time element.

Reproducibility Resources:

Software development skills for data scientists discusses the importance of writing functions and proper code comments (among other skills), which are highly useful for creating a reproducible analysis.
Data science done well looks easy - and that is a big problem for data scientists explains how a reproducible analysis demonstrates all of the work that goes into proper data science.

Class 10: Linear Regression

Machine learning exercise (article)
Linear regression (notebook)
- Capital Bikeshare dataset used in a Kaggle competition
- Data dictionary
Feature engineering example: Predicting User Engagement in Corporate Collaboration Network

Homework:

Your first project presentation is on Tuesday (9/22)! Please submit a link to your project repository (with slides, code, data, and visualizations) by 6pm on Tuesday.
Complete the homework assignment with the Yelp data. This is due on Thursday (9/24).

Linear Regression Resources:

To go much more in-depth on linear regression, read Chapter 3 of An Introduction to Statistical Learning. Alternatively, watch the related videos or read my quick reference guide to the key points in that chapter.
This introduction to linear regression is more detailed and mathematically thorough, and includes lots of good advice.
This is a relatively quick post on the assumptions of linear regression.
Setosa has an interactive visualization of linear regression.
For a brief introduction to confidence intervals, hypothesis testing, p-values, and R-squared, as well as a comparison between scikit-learn code and Statsmodels code, read my DAT7 lesson on linear regression.
Here is a useful explanation of confidence intervals from Quora.
Hypothesis Testing: The Basics provides a nice overview of the topic, and John Rauser's talk on Statistics Without the Agonizing Pain (12 minutes) gives a great explanation of how the null hypothesis is rejected.
Earlier this year, a major scientific journal banned the use of p-values:
- Scientific American has a nice summary of the ban.
- This response to the ban in Nature argues that "decisions that are made earlier in data analysis have a much greater impact on results".
- Andrew Gelman has a readable paper in which he argues that "it's easy to find a p < .05 comparison even if nothing is going on, if you look hard enough".
- Science Isn't Broken includes a neat tool that allows you to "p-hack" your way to "statistically significant" results.
Accurately Measuring Model Prediction Error compares adjusted R-squared, AIC and BIC, train/test split, and cross-validation.

Other Resources:

Section 3.3.1 of An Introduction to Statistical Learning (4 pages) has a great explanation of dummy encoding for categorical features.
Kaggle has some nice visualizations of the bikeshare data we used today.

Class 11: First Project Presentation

Project presentations!

Homework:

Watch Rahul Patwari's videos on probability (5 minutes) and odds (8 minutes) if you're not comfortable with either of those terms.
Read these excellent articles from BetterExplained: An Intuitive Guide To Exponential Functions & e and Demystifying the Natural Logarithm (ln).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DAT8 Course Repository

Course Project

Python Resources

Submission Forms

Comparison of machine learning models

Class 1: Introduction to Data Science

Class 2: Command Line and Version Control

Class 3: Data Reading and Cleaning

Class 4: Exploratory Data Analysis

Class 5: Visualization

Class 6: Machine Learning

Class 7: Getting Data

Class 8: K-Nearest Neighbors

Class 9: Basic Model Evaluation

Class 10: Linear Regression

Class 11: First Project Presentation

About

Releases

Packages

Languages

ajchan11/DAT8

Folders and files

Latest commit

History

Repository files navigation

DAT8 Course Repository

Course Project

Python Resources

Submission Forms

Comparison of machine learning models

Class 1: Introduction to Data Science

Class 2: Command Line and Version Control

Class 3: Data Reading and Cleaning

Class 4: Exploratory Data Analysis

Class 5: Visualization

Class 6: Machine Learning

Class 7: Getting Data

Class 8: K-Nearest Neighbors

Class 9: Basic Model Evaluation

Class 10: Linear Regression

Class 11: First Project Presentation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages