Skip to content

Commit

Permalink
change course description
Browse files Browse the repository at this point in the history
  • Loading branch information
ganyuan94 committed Jan 13, 2022
1 parent ded7e66 commit 033ae99
Show file tree
Hide file tree
Showing 54 changed files with 94,217 additions and 26 deletions.
Binary file modified CourseInfo/.DS_Store
Binary file not shown.
47 changes: 24 additions & 23 deletions CourseInfo/G5243_ADS.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,17 +8,15 @@
* Classes: Wednesdays 6:10pm-8:55pm.
* Instructor: Ying Liu. <[email protected]> [(@yingliug)](https://github.com/yingliug)
* Office hours: after class
* TA: Diane Lu. <[email protected]> [(@mathdiane)](http://github.com/mathdiane)
<!--- * Office hours: Mondays 6:00 pm to 8:00 pm on 10th Floor Lounge of SSW --->
* TA: Gan Yuan. <[email protected]> [(@Simon-YG)](https://github.com/Simon-YG)
* Office hours: Mondays 10:00 am to 12:00 pm on Zoom
* Contact preference: through Piazza


* Course websites (all accessible via courseworks or github):
* Grades and basic course info on **Courseworks**: <http://courseworks2.columbia.edu>
* Discussion board on **Piazza**: <https://piazza.com/class/kt4rwf8dd0s70b>
* Discussion board on **Piazza**: <https://piazza.com/class/ky52am2xrzq5ul>
* Course materials and repositories on **GitHub**: <http://tzstatsads.github.io> or <https://github.com/TZstatsADS/ADS_Teaching>

#### Prerequisites
The pre-requisite for this course includes working knowledge in Statistics and Probability, data mining, statistical modeling and machine learning. Prior **advanced** programming experience in R or Python is required.

Expand Down Expand Up @@ -46,22 +44,25 @@ We will have a total of four project cycles for this course (topics are subject
2. Shiny app for interactive data visualization project.
3. Predictive analytics of images.
4. Algorithms implementation, evaluation, and reproducibility challenge.

Below is a tentative schedule for Fall 2021 we will follow.

+ Week 1 (Sep 15): 1a+1b
+ Week 2 (Sep 22): 1c
+ Week 3 (Sep 29): 1d+2a
+ Week 4 (Oct 6): 2b+2c
+ Week 5 (Oct 13): 2b+2c
+ Week 6 (Oct 20): 2d+3a
+ Week 7 (Oct 27): 3b+3c
+ Week 8 (Nov 3): 3b+3c
+ Week 9 (Nov 10): 3d+4a
+ Week 10 (Nov 17): 4b+4c
+ Thanksgiving Break
+ Week 11 (Dec 1): 4b+4c
+ Week 12 (Dec 8): 4d
4. [optional] *Free topic*.

Below is a tentative schedule for Spring 2022 we will follow.

+ Week 1 (Jan 19): 1a+1b
+ Week 2 (Jan 26): 1c
+ Week 3 (Feb 2): 1d+2a
+ Week 4 (Feb 9): 2b+2c
+ Week 5 (Feb 16): 2b+2c
+ Week 6 (Feb 23): 2d+3a
+ Week 7 (Mar 2): 3b+3c
+ Week 8 (Mar 9): 3b+3c
+ Spring Break
+ Week 9 (Mar 16): 3d+4a
+ Week 10 (Mar 23): 4b+4c
+ Week 11 (Mar 30): 4b+4c
+ Week 12 (Apr 6): 4d+5c
+ Week 13 (Apr 13): 5c
+ Week 14 (Apr 20): 5d

#### Evaluation

Expand Down Expand Up @@ -107,5 +108,5 @@ There is not a single required text. As part of this course, we will learn from
* Be a good team member and contribute to each project as much as you can. Don't underestimate the efforts of your teammates. Something seems simple may not be that simple.

* Emails related to learning and projects shall be redirected to our discussion board.

* Students are [expected](http://policylibrary.columbia.edu/student-email-communication-policy) to check emails at least once every 12 hours during the week and every 24 hours over the weekend. Students should make sure not to miss any important class-related announcements sent by emails or posted on Courseworks. Emails will be delivered to the students' official UNI. It is the students' responsibility to ensure that these emails are properly forwarded if they choose to use an alternative email address.
111 changes: 111 additions & 0 deletions CourseInfo/G5243_Fall_2021_ADS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
### STAT GR5243/GU4243
### Applied Data Science

#### Department of Statistics, Columbia University

#### Course Information

* Classes: Wednesdays 6:10pm-8:55pm.
* Instructor: Ying Liu. <[email protected]> [(@yingliug)](https://github.com/yingliug)
* Office hours: after class
* TA: Diane Lu. <[email protected]> [(@mathdiane)](http://github.com/mathdiane)
<!--- * Office hours: Mondays 6:00 pm to 8:00 pm on 10th Floor Lounge of SSW --->
* Office hours: Mondays 10:00 am to 12:00 pm on Zoom
* Contact preference: through Piazza

* Course websites (all accessible via courseworks or github):
* Grades and basic course info on **Courseworks**: <http://courseworks2.columbia.edu>
* Discussion board on **Piazza**: <https://piazza.com/class/kt4rwf8dd0s70b>
* Course materials and repositories on **GitHub**: <http://tzstatsads.github.io> or <https://github.com/TZstatsADS/ADS_Teaching>

#### Prerequisites
The pre-requisite for this course includes working knowledge in Statistics and Probability, data mining, statistical modeling and machine learning. Prior **advanced** programming experience in R or Python is required.

#### Description
This course incorporates knowledge and skills covered in a statistical curriculum with topics and projects in data science. Programming will be covered using existing tools mostly in R, while students can use tools from other languages. Computing best practices will be taught using test-driven development, version control, and collaboration. Students finish the class with a portfolio on GitHub, and deeper understanding of several core statistical/machine-learning algorithms.

This course will be a project-based hands-on course in data science. **No formal instruction on statistics, data science, machine learning will be given**. Project cycles run every 2-3 weeks, where we will have mini-group data projects. Groups will be formed **randomly** and project products will be peer-reviewed, in addition to evaluation by the instructional team.

#### Course organization
This course will have a total of *four* project cycles. Each project cycle follows a sequence of four types of activities.

**a**. Dataset release, introduction to the data science problem, individual exercises, team forming

**b**. Lecture/tutorial

**c**. Brainstorming, live hacking, code sharing

**d**. Team presentation, peer reviews, within-team peer reviews

Except for project 1, students will be working in teams of 5 that will be randomly formed. For a meaningful experience in data science, students are expected to collaborate and work together on all the stages of a project. Code sharing and brainstorming are great opportunities to learn from each other.

We will have a total of four project cycles for this course (topics are subject to change):

1. [Individual] R notebook for exploratory data analysis
2. Shiny app for interactive data visualization project.
3. Predictive analytics of images.
4. Algorithms implementation, evaluation, and reproducibility challenge.

Below is a tentative schedule for Fall 2021 we will follow.

+ Week 1 (Sep 15): 1a+1b
+ Week 2 (Sep 22): 1c
+ Week 3 (Sep 29): 1d+2a
+ Week 4 (Oct 6): 2b+2c
+ Week 5 (Oct 13): 2b+2c
+ Week 6 (Oct 20): 2d+3a
+ Week 7 (Oct 27): 3b+3c
+ Week 8 (Nov 3): 3b+3c
+ Week 9 (Nov 10): 3d+4a
+ Week 10 (Nov 17): 4b+4c
+ Thanksgiving Break
+ Week 11 (Dec 1): 4b+4c
+ Week 12 (Dec 8): 4d

#### Evaluation

Students' performance will be evaluated based on

* [85%] Project products (instructor-reviewed and/or peer-reviewed, averaged over 4 projects). Each project description will have explicit grading rubrics.
* [15%] Individual participation (based on individual tasks and instructors' observation).

##### A note on participation evaluation.
In addition to individual tasks such as peer reviews, for each project, we will enforce formal evaluation of participation as follows.

* Each project needs to show clear collaboration and task assignments in Piazza discussion using the group discussion function.
* Teams should try to use GitHub to coordinate code sharing and project development throughout the project. GitHub activities will be used as part of participation evaluation.
* Students should participate actively in class discussion and piazza discussion.
* We will give participation score for each project cycle, the average of which will contribute to 15% of your final grade. The participation will be graded on the following curve.

* A (1.8-2): project leader, major contributor who contribute substantially in every stage of the project and class discussions.
* A- (1.5-1.8): major contributor who contributed substantially to two stages of the project and some discussions. *This is what most students receive for their participation.*
* B+ (1.2-1.5): average participation, participate in the discussion at every stage and contribute substentially in at least one stage of the project and some discussions.
* B (1-1.2) or lower: below average performance.
* This is to ensure a positive learning process for all of us.

#### Communication
Projects grades are managed in courseworks. We will be using the discussion/announcement tools in Piazza (accessible from Courseworks) for our online class communication and discussion. The system is highly catered to exchanging ideas, discussing plans, and getting answers and help fast and efficiently from the instructional team and classmates. Rather than emailing questions to the teaching staff, we encourage you to post your questions online.

#### Textbook
There is not a single required text. As part of this course, we will learn from what we can find online and in academic papers. Here are a couple of recommended reference books.

+ Mount and Zumel (2014) Practical data science with R.
+ Segaran (2007) Programming collective intelligence: building smart web 2.0 applications.
+ Tuffe (2001) The visual display of quantitative information.
+ Fung (2013) Numbersense: how to use big data to your advantage.
+ Wickham (2017) R for Data Science http://r4ds.had.co.nz/

#### Class policy

* We learn together through projects. Please stay positive and congenial. Share what you know with your peers and also learn from them.

* Working towards deadlines can be stressful. Remember, emails or online posts do NOT have tones. Be mindful about how you phrase your questions, comments, inquries, and suggestions. Also be generous and forgiving when reading them.

* **Academic Integrity** is the cornerstone of meaningful teaching and learning. It is especially important for our project-based course. Remember what matters more is how much you learn not what grade you will get. In your project, document references and resources that have been incorporated into your project and accredit them appropriately. Plagiarism is one of the most likely form of cheating in this course.

* Be a good team member and contribute to each project as much as you can. Don't underestimate the efforts of your teammates. Something seems simple may not be that simple.

* Emails related to learning and projects shall be redirected to our discussion board.

* Students are [expected](http://policylibrary.columbia.edu/student-email-communication-policy) to check emails at least once every 12 hours during the week and every 24 hours over the weekend. Students should make sure not to miss any important class-related announcements sent by emails or posted on Courseworks. Emails will be delivered to the students' official UNI. It is the students' responsibility to ensure that these emails are properly forwarded if they choose to use an alternative email address.
5 changes: 2 additions & 3 deletions CourseInfo/G5243_Spring_2021_ADS.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,11 @@
* Office hours: Mondays 2:00 pm to 2:30 pm and 8:00 pm to 9:30 pm via Zoom (can join through coursework "Zoom Class Sessions")
* Contact preference: through Piazza

* Course websites (all accessible via courseworks or github):
* Grades and basic course info on **Courseworks**: <http://courseworks2.columbia.edu>
* Discussion board on **Piazza**: <https://piazza.com/class/kjn957avsba7p3>
* Course materials and repositories on **GitHub**: <http://tzstatsads.github.io> or <https://github.com/TZstatsADS/ADS_Teaching>

#### Prerequisites
The pre-requisite for this course includes working knowledge in Statistics and Probability, data mining, statistical modeling and machine learning. Prior **advanced** programming experience in R or Python is required.

Expand Down Expand Up @@ -109,5 +108,5 @@ There is not a single required text. As part of this course, we will learn from
* Be a good team member and contribute to each project as much as you can. Don't underestimate the efforts of your teammates. Something seems simple may not be that simple.

* Emails related to learning and projects shall be redirected to our discussion board.

* Students are [expected](http://policylibrary.columbia.edu/student-email-communication-policy) to check emails at least once every 12 hours during the week and every 24 hours over the weekend. Students should make sure not to miss any important class-related announcements sent by emails or posted on Courseworks. Emails will be delivered to the students' official UNI. It is the students' responsibility to ensure that these emails are properly forwarded if they choose to use an alternative email address.
28 changes: 28 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Applied Data Science @ Columbia
## Fall 2021
## Project 1: A "data story" on the history of philosophy

<img src="figs/100126-the-glass.jpeg" width="500">

### [Project Description](doc/)
This is the first and only *individual* (as opposed to *team*) project this semester.

Term: Fall 2021

+ Projec title: Lorem ipsum dolor sit amet
+ This project is conducted by [your name]

+ Project summary: [a short summary] Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Following [suggestions](http://nicercode.github.io/blog/2013-04-05-projects/) by [RICH FITZJOHN](http://nicercode.github.io/about/#Team) (@richfitz). This folder is orgarnized as follows.

```
proj/
├── lib/
├── data/
├── doc/
├── figs/
└── output/
```

Please see each subfolder for a README file.
6 changes: 6 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# ADS Project 1: R Notebook on the history of philosophy

### Data folder

The data directory contains data used in the analysis. This is treated as read only; in paricular the R/python files are never allowed to write to the files in here. Depending on the project, these might be csv files, a database, and the directory itself may have subdirectories.

80 changes: 80 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/doc/Proj1_desc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
## Applied Data Science @ Columbia
## STAT GR5243/GU4243 Fall 2021
### Project 1 An R Notebook "Data Story" on the history of philosophy

<img src="../figs/100126-the-glass.jpeg" width="400">

The goal of this project is to write a data story on philosophy using the dataset for the [Philosophy Data Project](http://philosophydata.com/index.html). Applying data mining, statistical analysis and visualization, students should derive interesting findings in this collection of philosophy texts and write a "data story" that can be shared with a **general audience**.

### Datasets

+ The data sets can be found at https://www.kaggle.com/kouroshalizadeh/history-of-philosophy.

### Challenge

In this project you will carry out an **exploratory data analysis (EDA)** of philosophy texts and write a blog on interesting findings from your analysis (i.e., a *data story*).

You are tasked to explore the text corpus using tools from data mining, statistical analysis and visualization, etc, all available in `R` or `Python` and write a blog post using `R` or `Python` Notebook. Your blog should be in the form of a `data story` blog on interesting trends and patterns identified by your analysis of these philosophy texts.

Even though this is an individual project, you are **encouraged** to discuss with your classmates and exchange ideas.

### Project organization

A link to initiate a *GitHub starter codes repo* will be posted on piazza for you to start your own project.

#### Suggested workflow
This is a relatively short project. We only have about two weeks of working time.

1. [wk1] Week 1 is the **data processing and mining** week. Read data description, **project requirement**, browse data and study the R notebooks in the starter codes, and think about what to do and try out different tools you find related to this task.
2. [wk1] Try out ideas on a **subset** of the data set to get a sense of computational burden of this project.
3. [wk2] Explore data for interesting trends and start writing your data story.

#### Submission
You should produce an R or Python notebook (rmd and html files) in your GitHub project folder, where you should write a story or a blog post on the history of philosophy based on your data analysis. Your story, especially *main takeways* should be **supported by** your results and appropriate visualization.

Your story should NOT be a laundry list of all analyses you have tried on the data or how you solved a technical issue in your analysis, no matter how fascinating that might be.

#### Repository requirement

The final repo should be under our class github organization (TZStatsADS) and be organized according to the structure of the starter codes.

```
proj/
├──data/
├──doc/
├──figs/
├──lib/
├──output/
├── README
```
- The `data` folder contains the raw data of this project. These data should NOT be processed inside this folder. Processed data should be saved to `output` folder. This is to ensure that the raw data will not be altered.
- The `doc` folder should have documentations for this project, presentation files and other supporting materials.
- The `figs` folder contains figure files produced during the project and running of the codes.
- The `lib` folder (sometimes called `dev`) contain computation codes for your data analysis. Make sure your README.md is informative about what are the programs found in this folder.
- The `output` folder is the holding place for intermediate and final computational results.

The root README.md should contain your name and an abstract of your findings.

### Useful resources

##### R pakcages
* R [tidyverse](https://www.tidyverse.org/) packages
* R [DT](http://www.htmlwidgets.org/showcase_datatables.html) package
* R [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)
* [Rcharts](https://www.r-graph-gallery.com/interactive-charts.html), quick interactive plots
* [htmlwidgets](http://www.htmlwidgets.org/), javascript library adaptation in R.

##### Project tools
* A brief [guide](http://rogerdudler.github.io/git-guide/) to git.
* Putting your project on [GitHub](https://guides.github.com/introduction/getting-your-project-on-github/).

##### Example
+ [A good "data story"](https://drhagen.com/blog/the-missing-11th-of-the-month/)

##### Tutorials

For this project we will give **tutorials** and give comments on:

- GitHub
- R notebook
- Example on sentiment analysis and topic modeling
5 changes: 5 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/doc/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# ADS Project 1: R Notebook on the history of philosophy

### Doc folder

The doc directory contains the report or presentation files. It can have subfolders.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
5 changes: 5 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/figs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# ADS Project 1: R Notebook on the history of philosophy

### Figs folder

The figs directory contains the figures. This directory only contains generated files; that is, one should always be able to delete the contents and regenerate them.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 6 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/lib/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# ADS Project 1: R Notebook on the history of philosophy

### Code dev/lib Folder

The lib directory contains various files with function definitions and computation codes for your data analysis.

6 changes: 6 additions & 0 deletions Projects_StarterCodes/Project1-RNotebook/output/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# ADS Project 1: R Notebook on the history of philosophy

### Output folder

The output directory contains analysis output, processed datasets, logs, or other processed things.

Loading

0 comments on commit 033ae99

Please sign in to comment.