- June 22: Discuss project ideas with instructional team
- June 24: Project question and dataset
- Project question by Jason Knobloch
- Project question by Jennifer Lambert
- Project question by Alex Lee
- July 13: First project presentation
- First presentation by Chandler McCann
- First presentation by Nathan Danielsen
- July 27: Draft paper
- August 3: Peer review
- August 10/12: Final project presentation and paper
- Final presentation by Austin Brown
- Final paper by Kerry Jones
The final project should represent significant original work applying data science techniques to an interesting problem. Final projects are individual attainments, but you should be talking frequently with your instructor and classmates about them.
Address a data-related problem in your professional field or a field you're passionate about. If you have a strong interest in the subject matter, you'll create a better project and it will be a lot more fun for you!
Here's a collection of past projects from GA Data Science students that may help to stimulate your thinking. You're welcome to use public data or private data, though with private data, you'll have to be careful about what you release. Competing in a Kaggle competition is also a project option, in which case the data will be provided for you.
By June 22, you should talk with a member of the instructional team about your project idea(s). We can help you to choose between different ideas, advise you on the appropriate scope for your project, and ensure that your project question might reasonably be answerable using the data science tools and techniques taught in the course. (There is nothing you have to turn in for this milestone.)
Create a GitHub repository for your project. It should include a short write-up that answers these questions:
- What is the question you hope to answer?
- What data are you planning to use to answer that question?
- What do you know about the data so far?
- Why did you choose this topic?
You'll be giving a short presentation to the class about the work you have done so far, as well as your plans for the project going forward. Your presentation should use slides (or a similar format). Your slides, code, data, and visualizations should be included in your GitHub repository. Here are some questions that you should address in your presentation:
- What data have you gathered, and how did you gather it?
- Which areas of the data have you cleaned, and which areas still need cleaning?
- What steps have you taken to explore the data?
- What insights have you gained from your exploration?
- Will you be able to answer your question with this data, or do you need to gather more data (or adjust your question)?
- How might you use modeling to answer your question?
A draft of your project paper is due, along with the data, well-commented code, and visualizations. It should be written with a technical audience in mind. Your paper should include the following components:
- Problem statement and hypothesis
- Description of your data set and how it was obtained
- Description of any pre-processing steps you took
- What you learned from exploring the data, including visualizations
- How you chose which features to use in your analysis
- Details of your modeling process, including how you selected your models and validated them
- Your challenges and successes
- Possible extensions or business applications of your project
- Conclusions and key learnings
Your peers and instructional team will be providing feedback. However, the paper should stand "on its own", and should not depend upon the reader remembering your first presentation. The easier your paper is to follow, the more useful feedback you will receive! As well, if your reviewers can actually run your code on the provided data, they will be able to give you better feedback on your code.
You will provide project feedback to two of your peers, according to the peer review guidelines.
Your project repository on GitHub should contain the following:
- Project paper: any format (PDF, Markdown, etc.)
- Presentation slides: any format (PDF, PowerPoint, Google Slides, etc.)
- Code: commented Python scripts, and any other code you used in the project
- Visualizations: integrated into your paper and/or slides
- Data: data files in "raw" or "processed" format
- Data dictionary (aka "code book"): description of each variable, including units
While your project paper should focus on a technical audience, your presentation should be suitable for a non-technical audience. Focus on creating an engaging, clear, and informative presentation that tells the story of your project, rather than trying to include every last detail.
Note: If it's not practical to include your entire dataset in your GitHub repository, you should link to your data source and provide a sample of the data. (GitHub has a size limit of 100 MB per file and 1 GB per repository.) If your data is private, you can either include an "anonymized" version of your data or create a private GitHub repository.