Skip to content

Latest commit

 

History

History
113 lines (79 loc) · 7.74 KB

G5243_ADS.md

File metadata and controls

113 lines (79 loc) · 7.74 KB

STAT GR5243/GU4243

Applied Data Science

Department of Statistics, Columbia University

Course Information

Prerequisites

The pre-requisite for this course includes working knowledge in Statistics and Probability, data mining, statistical modeling and machine learning. Prior advanced programming experience in R or Python is required.

Description

This course incorporates knowledge and skills covered in a statistical curriculum with topics and projects in data science. Programming will be covered using existing tools mostly in R, while students can use tools from other languages. Computing best practices will be taught using test-driven development, version control, and collaboration. Students finish the class with a portfolio on GitHub, and deeper understanding of several core statistical/machine-learning algorithms.

This course will be a project-based hands-on course in data science. No formal instruction on statistics, data science, machine learning will be given. Project cycles run every 2-3 weeks, where we will have mini-group data projects. Groups will be formed randomly and project products will be peer-reviewed, in addition to evaluation by the instructional team.

Course organization

This course will have a total of five project cycles, while Project 5 is optional and free-topic. For Projects 1-4, each project cycle follows a sequence of four types of activities.

a. Dataset release, introduction to the data science problem, individual exercises, team forming

b. Lecture/tutorial

c. Brainstorming, live hacking, code sharing

d. Team presentation, peer reviews, within-team peer reviews

Except for project 1, students will be working in teams of 5 that will be randomly formed. For a meaningful experience in data science, students are expected to collaborate and work together on all the stages of a project. Code sharing and brainstorming are great opportunities to learn from each other.

We will have a total of five project cycles for this course (topics are subject to change):

  1. [Individual] R notebook for exploratory data analysis
  2. Shiny app for interactive data visualization project.
  3. Predictive analytics of images.
  4. Algorithms implementation, evaluation, and reproducibility challenge.
  5. [optional] Free topic.

Below is a tentative schedule for Spring 2020 we will follow.

  • Week 1 (Jan 22): 1a+1b
  • Week 2 (Jan 29): 1c
  • Week 3 (Feb 5): 1d+2a
  • Week 4 (Feb 12): 2b+2c
  • Week 5 (Feb 19): 2c
  • Week 6 (Feb 26): 2d+3a
  • Week 7 (Mar 4): 3b+3c
  • Week 8 (Mar 11): 3b+3c
  • Spring Break
  • Week 9 (Mar 25): 3d+4a
  • Week 10 (Apr 1): 4b+4c
  • Week 11 (Apr 8): 4b+4c
  • Week 12 (Apr 15): 4d+5c
  • Week 13 (Apr 22): 5c
  • Week 14 (Apr 29): 5d

Evaluation

Students' performance will be evaluated based on

  • [85%] Project products (instructor-reviewed and/or peer-reviewed, averaged over 4 projects with highest grades). Each project description will have explicit grading rubrics.

  • [15%] Individual participation (based on individual tasks and instructors' observation).

    A note on participation evaluation.

    In addition to individual tasks such as peer reviews, for each project, we will enforce formal evaluation of participation as follows.

    • Each project needs to show clear collaboration and task assignments in Piazza discussion using the group discussion function.

    • Teams should try to use GitHub to coordinate code sharing and project development throughout the project. GitHub activities will be used as part of participation evaluation.

    • Students should participate actively in class discussion and piazza discussion.

    • We will give participation score for each project cycle, the average of which will contribute to 15% of your final grade. The participation will be graded on the following curve.

      • A (1.8-2): project leader, major contributor who contribute substantially in every stage of the project and class discussions.
      • A- (1.5-1.8): major contributor who contributed substantially to two stages of the project and some discussions. This is what most students receive for their participation.
      • B+ (1.2-1.5): average participation, participate in the discussion at every stage and contribute substentially in at least one stage of the project and some discussions.
      • B (1-1.2) or lower: below average performance.
    • This is to ensure a positive learning process for all of us.

Communication

Projects grades are managed in courseworks. We will be using the discussion/announcement tools in Piazza (accessible from Courseworks) for our online class communication and discussion. The system is highly catered to exchanging ideas, discussing plans, and getting answers and help fast and efficiently from the instructional team and classmates. Rather than emailing questions to the teaching staff, we encourage you to post your questions online.

Textbook

There is not a single required text. As part of this course, we will learn from what we can find online and in academic papers. Here are a couple of recommended reference books.

  • Mount and Zumel (2014) Practical data science with R.
  • Segaran (2007) Programming collective intelligence: building smart web 2.0 applications.
  • Tuffe (2001) The visual display of quantitative information.
  • Fung (2013) Numbersense: how to use big data to your advantage.
  • Wickham (2017) R for Data Science http://r4ds.had.co.nz/

Class policy

  • We learn together through projects. Please stay positive and congenial. Share what you know with your peers and also learn from them.

  • Working towards deadlines can be stressful. Remember, emails or online posts do NOT have tones. Be mindful about how you phrase your questions, comments, inquries, and suggestions. Also be generous and forgiving when reading them.

  • Academic Integrity is the cornerstone of meaningful teaching and learning. It is especially important for our project-based course. Remember what matters more is how much you learn not what grade you will get. In your project, document references and resources that have been incorporated into your project and accredit them appropriately. Plagiarism is one of the most likely form of cheating in this course.

  • Be a good team member and contribute to each project as much as you can. Don't underestimate the efforts of your teammates. Something seems simple may not be that simple.

  • Emails related to learning and projects shall be redirected to our discussion board.

  • Students are expected to check emails at least once every 12 hours during the week and every 24 hours over the weekend. Students should make sure not to miss any important class-related announcements sent by emails or posted on Courseworks. Emails will be delivered to the students' official UNI. It is the students' responsibility to ensure that these emails are properly forwarded if they choose to use an alternative email address.