Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15).
Instructor: Kevin Markham
Tuesday | Thursday |
---|---|
8/18: Introduction to Data Science | 8/20: Command Line and Version Control |
8/25: Data Reading and Cleaning | 8/27: Exploratory Data Analysis |
9/1: Visualization Project Discussion Deadline |
9/3: Machine Learning Project Question and Dataset Due |
9/8: Getting Data | 9/10: K-Nearest Neighbors |
9/15: Basic Model Evaluation | 9/17: Linear Regression |
9/22: First Project Presentation | 9/24: Logistic Regression |
9/29: Advanced Model Evaluation | 10/1: Naive Bayes and Text Data |
10/6: Natural Language Processing | 10/8: Kaggle Competition, Draft Paper Due |
10/13: Decision Trees | 10/15: Ensembling |
10/20: Regularization and Clustering, Peer Review Due |
10/22: Course Review |
10/27: Final Project Presentation | 10/29: Final Project Presentation |
- Install Git.
- Create an account on the GitHub website.
- It is not necessary to download "GitHub for Windows" or "GitHub for Mac"
- Install the Anaconda distribution of Python 2.7x.
- If you choose not to use Anaconda, here is a list of the Python packages you will need to install during the course.
- We would like to check the setup of your laptop before the course begins:
- You can have your laptop checked before the intermediate Python workshop on Tuesday 8/11 (5:30pm-6:30pm), at the 15th & K Starbucks on Saturday 8/15 (1pm-3pm), or before class on Tuesday 8/18 (5:30pm-6:30pm).
- Alternatively, you can walk through the setup checklist yourself.
- Once you receive an email invitation from Slack, join our "DAT8 team" and add your photo.
- Practice Python using the resources below.
- Codecademy's Python course: Good beginner material, including tons of in-browser exercises.
- DataQuest: Teaches Python in the context of data science.
- Google's Python Class: Slightly more advanced, including hours of useful lecture videos and downloadable exercises (with solutions).
- A Crash Course in Python for Scientists: Read through the Overview section for a quick introduction to Python.
- Python for Informatics: A very beginner-oriented book, with associated slides and videos.
- Python Quick Reference Guide: My beginner-oriented guide that demonstrates Python concepts through short, well-commented examples.
- Beginner and intermediate workshop code: Useful for review and reference.
- Python Tutor: Allows you to visualize the execution of Python code.
- Welcome from General Assembly staff
- Course overview (slides)
- Introduction to data science (slides)
- Discuss the course project: requirements and example projects
- Types of data (slides) and public data sources
- Wrap up: Slack tour, submission forms
Homework:
- Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows).
- Read through this command line reference, and complete the pre-class exercise at the bottom. (There's nothing you need to submit once you're done.)
- Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub.
- If your laptop has any setup issues, please work with us to resolve them by Thursday.
Resources:
- For a useful look at the different types of data scientists, read Analyzing the Analyzers (32 pages).
- For some thoughts on what it's like to be a data scientist, read these short posts from Win-Vector and Datascope Analytics.
- Quora has a data science topic FAQ with lots of interesting Q&A.
- Keep up with local data-related events through the Data Community DC event calendar or weekly newsletter.
- Review the command line pre-class exercise (code)
- Git and GitHub (slides)
- Intermediate command line
- Wrap up: Course schedule, office hours
Homework:
- Complete the homework exercise listed in the command line introduction:
- Create a Markdown file that includes your answers and the code you used to arrive at those answers.
- Add this file to a GitHub repo that you'll use for all of your coursework.
- Submit a link to your repo using the homework submission form.
- Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (excluding the "requests" and "APIs" sections), you should spend some time this weekend practicing Python:
- If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
- If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
- If you have more time, try missions 2 and 3 from DataQuest's Learning Python course.
- If you've already mastered these topics and want more of a challenge, try solving Python Challenge number 1 (decoding a message) and send me your code in Slack.
- To help you think about your own project, watch What is machine learning, and how does it work? (10 minutes) and browse through some more example student projects.
Git and Markdown Resources:
- Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
- If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
- If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
- GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
- Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.
Command Line Resources:
- If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
- If you want to do more at the command line with CSV files, try out csvkit, which can be installed via
pip
.