Skip to content

Latest commit

 

History

History
111 lines (75 loc) · 6.92 KB

G5243_Spring_2017_ADS.md

File metadata and controls

111 lines (75 loc) · 6.92 KB

G5243/GU4243 Spring 2017

Applied Data Science

Department of Statistics, Columbia University

Course Information

Prerequisites

The pre-requisite for this course includes working knowledge in statistics and probability, data mining, statistical modeling and machine learning. Prior advanced programming experience in R or Python is required.

Description

This course incorporates knowledge and skills covered in a statistical curriculum with topics and projects in data science. Programming will be covered using existing tools in R, while students can use tools from other languages. Computing best practices will be taught using test-driven development, version control, and collaboration. Students finish the class with a portfolio on GitHub, and deeper understanding of several core statistical/machine-learning algorithms.

This course will be a project-based hands-on course in data science. No formal instruction on statistics, data science, machine learning will be given. Project cycles run every 2-3 weeks, where we will have mini data projects. Groups will be formed randomly and project products will be peer-reviewed, in addition to evaluation by the instructional team.

Course organization

This course will have a total of five project cycles. Each project cycle follows a sequence of four types of activities.

A. Dataset release, introduction to data science problem, individual exercises, team forming

B. Lecture/tutorial

C. Brainstorming, live hacking, code sharing

D. Team presentation, peer reviews, within-team peer reviews

Students will be working in teams of 5 students that will be randomly formed. For a meaningful experience in data science, students are expected to collaborate and work together on all the stages of a project. Code sharing and brainstorming are great opportunities to learn from each other.

We will have a total of five project cycles for this course:

  1. [Individual] R notebook project.
  2. Open data visualization project.
  3. Predictive analytics of images.
  4. Relational (network) data analysis.
  5. Free topic (multiple data sources will be provided).

Below is a tentative schedule we will follow.

  • Week 1 (1/20): 1a+1b
  • Week 2 (1/27): 1c
  • Week 3 (2/3): 1d+2a
  • Week 4 (2/10): 2b+2c
  • Week 5 (2/17): 2c
  • Week 6 (2/24): 2d+3a
  • Week 7 (3/3): 3b+3c
  • Week 8 (3/10): 3b+3c
  • Spring break week
  • Week 9 (3/24): 3c+4a
  • Week 10 (3/31): 4b+4c
  • Week 11 (4/7): 4b+4c
  • Week 12 (4/14): 4d
  • Week 13 (4/21): 5c
  • Week 14 (4/28): 5d

Evaluation

Students' performance will be based on

  • Project products (instructor-reviewed and/or peer-reviewed, averaged over 5 projects) 90%

  • Participation (instructors' observation) 10%

    A note on participation evaluation.

    This semester, we will enforce formal evaluation of participation as follows.

    • Each project needs to show clear collaboration and task assignments on GitHub using GitHub's features such as issues and projects. (We will provide a tutorial on how to use these features in week 1).

    • Team should use GitHub to coordinate code sharing and project development throughout the project.

    • Students should participate actively in class discussion and piazza discussion.

    • We will give participation score for each project cycle, the average of which will contribute to 10% of your final grade. The participation will be graded on the following curve.

      • A: project leader, major contributor who contribute substantially in every stage of the project and class discussions.
      • A-: majro contributor who contributed substantially to two stages of the project and some discussions.
      • B+: average participation, participate in the discussion at every stage and contribute substentially in at least one stage of the project and some discussions.
      • B or lower: below average performance.
    • This is to ensure a positive learning process for all of us.

Communication

Projects grades are managed in courseworks. We will be using the discussion/announcement tools in courseworks (via Piazza) for our online class communication and discussion. The system is highly catered to getting you help fast and efficiently from classmates, the TA, and myself. Rather than emailing questions to the teaching staff, I encourage you to post your questions online.

Textbook

There is not a single required text. As part of this course, we will learn from what we can find online and in academic papers. Here are a couple of recommended reference books.

  • Mount and Zumel (2014) Practical data science with R.
  • Segaran (2007) Programming collective intelligence: building smart web 2.0 applications.
  • Tuffe (2001) The visual display of quantitative information.
  • Fung (2013) Numbersense: how to use big data to your advantage.

Class policy

  • We learn together through projects. Please stay positive and congenial. Share what you know with your peers and also learn from them.

  • Working towards deadlines can be stressful. Remember, emails or online posts do not have tones. Be mindful about how your phrase your questions, comments, inquries and suggestions. Also be generous when reading them.

  • Academic Integrity is the cornerstone of meaningful teaching and learning. It is especially important for our project-based course. Remember what matters more is how much you learn not what grade you will get. In your project, document references and resources that have been incorporated into your project and accredit them appriporiately. Plagiarism is one of the most likely forms of cheating in this course. Here are some tips to avoid plagiarism.

  • Be a good team member and contribute to each project as much as you can. Don't underestimate the efforts of your teammates. Something seems simple may not be that simple.

  • Emails related to learning and projects shall be redirected to our discussion board.

  • Students are expected to check emails at least once every 12 hours during the week and every 24 hours over the weekend. Students should make sure not to miss any important class-related announcements sent by emails or posted on Courseworks. Emails will be delivered to the students' official UNI. It is the students' responsibility to ensure that these emails are properly forwarded if they choose to use an alternative email address.