Course materials for General Assembly's Data Science course in Washington, DC (6/1/15 - 8/12/15).
Instructor: Kevin Markham
Monday | Wednesday |
---|---|
6/1: Introduction to Data Science | 6/3: Command Line and Version Control |
6/8: Data Reading and Cleaning | 6/10: Exploratory Data Analysis |
6/15: Visualization | 6/17: Machine Learning |
6/22: Getting Data Project Discussion Deadline |
6/24: K-Nearest Neighbors Project Question and Dataset Due |
6/29: Model Evaluation Part 1 | 7/1: Linear Regression |
7/6: Logistic Regression | 7/8: Model Evaluation Part 2 |
7/13: First Project Presentation | 7/15: Naive Bayes and Text Data |
7/20: Natural Language Processing | 7/22: Kaggle Competition |
7/27: Decision Trees Draft Paper Due |
7/29: Ensembling |
8/3: Clustering, Peer Review Due | 8/5: Course Review |
8/10: Final Project Presentation | 8/12: Final Project Presentation |
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Similar interface to Codecademy, but focused on teaching Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
- Beginner and intermediate workshop code: Useful for review and reference.
- Python 2.7x Reference Guide: Kevin's beginner-oriented guide that demonstrates a ton of Python concepts through short, well-commented examples.
- Python Tutor: Allows you to visualize the execution of Python code.
- Welcome from General Assembly staff
- Course overview (slides)
- Introduction to data science (slides)
- Discuss the course project
- Types of data (slides) and public data sources
- Wrap up: Slack tour, submission forms
Homework:
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows), and then browse through this command line reference.
- Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub.
- If your laptop has any setup issues, please work with us to resolve them by Wednesday.
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.
- Command line exercise (code)
- Git and GitHub (slides)
- Intermediate command line
- Wrap up: Course schedule, office hours
Homework:
- Complete the homework exercise listed in the command line introduction. Create a Markdown document that includes your answers and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form.
- Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (up through the "dictionaries" section), you should spend some time this weekend practicing Python. Here are my recommended resources:
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try these much longer lessons from DataQuest: "Find the US city with the lowest crime rate" and "Discover weather patterns in LA".
- If you've already mastered these topics and want more of a challenge, try solving the second Python Challenge and send me your code in Slack.
- If there are specific Python topics you want me to cover next week, send me a Slack message.
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
- If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
- GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
Command Line Resources:
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
- If you want to do more at the command line with CSV files, try out csvkit, which can be installed via
pip
.
Homework:
- Complete the homework assignment with the Chipotle data, and add a commented Python script to your GitHub repo. If you are unable to complete a part, try writing some pseudocode instead! You have until Monday to complete this assignment.
Resources:
- PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.
- Pandas (code):
Homework:
- Complete "Exercise Three" from today's Pandas script. Note: You do not need to submit this assignment.
- Read How Software in Half of NYC Cabs Generates $5.2 Million a Year in Extra Tips for an excellent example of exploratory data analysis.
Resources:
- Browsing or searching the Pandas API Reference is an excellent way to locate a function even if you don't know its exact name.
- What I do when I get a new data set as told through tweets is a fun (yet enlightening) look at the process of exploratory data analysis.
- Part 2 of Exploratory Data Analysis with Pandas (code)
- Visualization with Pandas and Matplotlib (code)
Homework:
- Complete the homework assignment with the IMDb data, and add a Python script to your GitHub repo. This assignment is due next Monday.
Pandas Resources:
- To learn more Pandas, review this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
- If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis by the creator of Pandas.
- Here are examples of different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
Visualization Resources:
- Watch Look at Your Data (18 minutes) for an excellent example of why visualization is useful for understanding your data.
- For more on Pandas plotting, read this notebook or the visualization page from the official Pandas documentation.
- To learn how to customize your plots further, browse through this notebook on matplotlib or this similar notebook.
- To explore different types of visualizations and when to use them, Choosing a Good Chart and The Graphic Continuum are nice one-page references, and the interactive R Graph Catalog has handy filtering capabilities.
- This PowerPoint presentation from Columbia's Data Mining class contains lots of good advice for properly using different types of visualizations.
- Review Python homework with the Chipotle data (solution, detailed explanation)
- Grouped box plots and grouped histograms (code)
- Human learning exercise:
- Iris dataset hosted by the UCI Machine Learning Repository
- Iris photo
- Solution
- Introduction to machine learning (slides)
- Course project:
- Example projects
- Project question exercise
Homework:
- Your deadline for discussing your project ideas with an instructor is Monday, and your project question and dataset is due Wednesday.
Resources:
- For a very quick summary of the key points about machine learning, watch What is machine learning, and how does it work? (10 minutes) or read the associated notebook.
- For a more in-depth introduction to machine learning, read section 2.1 (14 pages) of Hastie and Tibshirani's excellent book, An Introduction to Statistical Learning. (It's a free PDF download!)
- For a really nice comparison of supervised versus unsupervised learning, plus an introduction to reinforcement learning, watch this video (13 minutes) from Caltech's Learning From Data course.
- For a preview of some of the machine learning content we will cover during the course, read Sebastian Raschka's overview of the supervised learning process.
- The Emoji Translation Project is a really fun application of machine learning.
- Look up the characteristics of your zip code, and then read about the 67 distinct segments in detail.
Homework:
- Your project question and dataset is due on Wednesday.
- Optional: Complete the homework exercise listed in the web scraping code. It will take the place of any one homework you miss, past or future!
- If you're not using Anaconda, install the IPython Notebook using
pip
. (The IPython Notebook comes with Anaconda.) - If you're not using Anaconda, install Seaborn using
pip
. If you're using Anaconda, install Seaborn by runningconda install seaborn
at the command line. - Watch this brief introduction to scikit-learn and the IPython Notebook (15 minutes), and try to follow along with the Notebook introduction on your own computer.
- Read Kevin's introduction to reproducibility, read Jeff Leek's guide to creating a reproducible analysis, and watch this related Colbert Report video (8 minutes).
API Resources:
- Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
- API Integration in Python provides a very readable introduction to REST APIs.
- Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.
Web Scraping Resources:
- The Beautiful Soup documentation is incredibly thorough, but is hard to use as a reference guide. However, the section on specifying a parser may be helpful if Beautiful Soup appears to be parsing a page incorrectly.
- For more Beautiful Soup examples and tutorials, see Web Scraping 101 with Python, Alex's well-commented notebook on scraping Craigslist, this notebook from Stanford's Text As Data course, and this notebook and associated video from Harvard's Data Science course.
- For a much longer web scraping tutorial covering Beautiful Soup, lxml, XPath, and Selenium, watch Web Scraping with Python (3 hours 23 minutes) from PyCon 2014. The slides and code are also available.
- For more complex web scraping projects, Scrapy is a popular application framework that works with Python. It has excellent documentation, and here's a tutorial with detailed slides and code.
- robotstxt.org has a concise explanation of how to write (and read) the
robots.txt
file. - import.io and Kimono claim to allow you to scrape websites without writing any code.
- How a Math Genius Hacked OkCupid to Find True Love and How Netflix Reverse Engineered Hollywood are two fun examples of how web scraping has been used to build interesting datasets.
- Optional web scraping homework due (solution)
- Reproducibility
- Discuss assigned readings: introduction, Colbert Report video, cabs article, Tweet, creating a reproducible analysis
- Examples: Classic rock, student project 1, student project 2
- Machine learning exercise (article)
- Brief introduction to the IPython Notebook
- K-nearest neighbors and scikit-learn (notebook)
- Exploring the bias-variance tradeoff (notebook)
Homework:
- Reading assignment on the bias-variance tradeoff
- Browse through the scikit-learn documentation for KNN to get a sense of how it's organized: user guide, module reference, class documentation
- Work on your project... your first project presentation is in less than three weeks!
- Optional: Read the Teaching Assistant Evaluation dataset into Pandas, create the X and y objects, and go through scikit-learn's 4-step modeling process. (There's no need to submit your code unless you have a question or would like feedback!)
KNN Resources:
- For a recap of the key points about KNN and scikit-learn, watch Getting started in scikit-learn with the famous iris dataset (15 minutes) and Training a machine learning model with scikit-learn (20 minutes).
- A Detailed Introduction to KNN is a bit dense, but provides a more thorough introduction to KNN and its applications.
- This lecture on Image Classification shows how KNN could be used for detecting similar images, and also touches on topics we will cover in future classes (hyperparameter tuning and cross-validation).
- Some applications for which KNN is well-suited are object recognition, satellite image enhancement, document categorization, and gene expression analysis.
Reproducibility Resources:
- Software development skills for data scientists discusses the importance of writing functions and proper code comments (among other skills), which are highly useful for creating a reproducible analysis.
- Data science done well looks easy - and that is a big problem for data scientists explains how a reproducible analysis demonstrates all of the work that goes into proper data science.
Other Resources:
- If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
- To get started with Seaborn for visualization, the official website has a series of tutorials and an example gallery.