Merge branch 'master' of https://github.com/TZstatsADS/ADS_Teaching

minhdng · Sep 7, 2020 · 8dafc47 · 8dafc47
2 parents 02cf93b + 502763c
commit 8dafc47
Show file tree

Hide file tree

Showing 771 changed files with 414,962 additions and 3,771 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/7-Spring2019/README.md b/7-Spring2019/README.md
@@ -20,6 +20,8 @@
 + Data Story Examples: [Example 1](http://www.columbia.edu/~hl3099/proj1_report.html) ([repo](https://github.com/TZstatsADS/Spring2018-Project1-Hongyu-Li)), [Example 2](https://github.com/TZstatsADS/fall2017-project1-duanshiqi).
 + Discussion and Q&A
 
+[Finished student projects](https://github.com/TZstatsADS?q=Spring2019-Proj1&type=&language=) 
+
 ----
 ##### Shortcuts: [Project 1](#project-cycle-1-individual-r-notebook-for-exploratory-data-analysis) | [Project 3](#project-cycle-3-predictive-modeling) | [Project 4](#project-cycle-4-algorithm-implementation-and-evaluation)
 
@@ -52,6 +54,8 @@
 #### Week 6 (Feb 27)
 + Project 2 presentations
 
+[Finished student projects](https://github.com/TZstatsADS?q=Spring2019-Proj2&type=&language=) 
+
 ----
 ##### Shortcuts: [Project 1](#project-cycle-1-individual-r-notebook-for-exploratory-data-analysis) | [Project 2](#project-cycle-2-shiny-app-development) | [Project 4](#project-cycle-4-algorithm-implementation-and-evaluation)
 
@@ -84,6 +88,8 @@
 #### Week 9 (Mar 27)
 + Project 3 submission and presentations
 
+[Finished student projects](https://github.com/TZstatsADS?q=Spring2019-Proj3&type=&language=) 
+
 ----
 ##### Shortcuts: [Project 1](#project-cycle-1-individual-r-notebook-for-exploratory-data-analysis) | [Project 2](#project-cycle-2-shiny-app-development) | [Project 3](#project-cycle-3-predictive-modeling) 
 
@@ -101,6 +107,8 @@
 + Project 4 presentations
 + Project 5 discussions
 
+[Finished student projects](https://github.com/TZstatsADS?q=Spring2019-Proj4&type=&language=) 
+
 ----
 ##### Shortcuts: [Project 1](#project-cycle-1-individual-r-notebook-for-exploratory-data-analysis) | [Project 2](#project-cycle-2-shiny-app-development) | [Project 3](#project-cycle-3-predictive-modeling) | [Project 4](#project-cycle-4-algorithm-implementation-and-evaluation)
 
@@ -110,10 +118,9 @@
 + [Project 3 summary](/Tutorials/wk12-project3summary/)
 + Project 5 discussions
 
-----
-### Project cycle 5: 
-
 #### Week 14 (May 1)
 + Project 5 Presentations
 
+[Finished student projects](https://github.com/TZstatsADS?q=Spring2019-Proj5&type=&language=) 
+
 ##### Shortcuts: [Project 1](#project-cycle-1-individual-r-notebook-for-exploratory-data-analysis) | [Project 2](#project-cycle-2-shiny-app-development) | [Project 3](#project-cycle-3-predictive-modeling) | [Project 4](#project-cycle-4-algorithm-implementation-and-evaluation)
diff --git a/8-Fall2019/.DS_Store b/8-Fall2019/.DS_Store
diff --git a/8-Fall2019/Projects_StarterCodes/.DS_Store b/8-Fall2019/Projects_StarterCodes/.DS_Store
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/.DS_Store b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/.DS_Store
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/README.md b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/README.md
@@ -0,0 +1,28 @@
+# Applied Data Science @ Columbia
+## Fall 2019
+## Project 1: A "data story" on the songs of our times
+
+<img src="figs/title1.jpeg" width="500">
+
+### [Project Description](doc/Proj1_desc.md)
+This is the first and only *individual* (as opposed to *team*) this semester. 
+
+Term: Fall 2019
+
++ Projec title: Lorem ipsum dolor sit amet
++ This project is conducted by [your name]
+
++ Project summary: [a short summary] Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
+
+Following [suggestions](http://nicercode.github.io/blog/2013-04-05-projects/) by [RICH FITZJOHN](http://nicercode.github.io/about/#Team) (@richfitz). This folder is orgarnized as follows.
+
+```
+proj/
+├── lib/
+├── data/
+├── doc/
+├── figs/
+└── output/
+```
+
+Please see each subfolder for a README file.
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/README.md b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/README.md
@@ -0,0 +1,5 @@
+# ADS Project 1: R Notebook on Lyrics Analysis
+### Data folder
+
+The data directory contains data used in the analysis. This is treated as read only; in paricular the R/python files are never allowed to write to the files in here. Depending on the project, these might be csv files, a database, and the directory itself may have subdirectories.
+
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/artists.csv b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/artists.csv
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/lyrics.rdata b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/lyrics.rdata
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/Lyrics_ShinyApp.Rmd b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/Lyrics_ShinyApp.Rmd
@@ -0,0 +1,209 @@
+---
+title: "Lyrics ShinyApp"
+author: "Chengliang Tang, Arpita Shah, Yujie Wang and Tian Zheng"
+output: html_notebook
+runtime: shiny
+---
+
+"lyrics_filter.csv" is a filtered corpus of 380,000+ song lyrics from from MetroLyrics. You can read more about it on [Kaggle](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics).
+
+"info_artist.csv" provides the background information of all the artistis. These information are scraped from [LyricsFreak](https://www.lyricsfreak.com/).
+
+Here, we explore these data sets and try to find interesting patterns.
+
+### Load all the required libraries
+
+From the packages' descriptions:
+
++ `tidyverse` is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures;
++ `tidytext` allows text mining using 'dplyr', 'ggplot2', and other tidy tools;
++ `plotly` allows plotting interactive graphs;
++ `DT` provides an R interface to the JavaScript library DataTables;
++ `tm` is a framework for text mining applications within R;
++ `scales` map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends;
++ `data.table` is a package for fast aggregation of large data;
++ `wordcloud2` provides an HTML5 interface to wordcloud for data visualization;
++ `gridExtra` contains miscellaneous functions for "grid" graphics;
++ `ngram` is for constructing n-grams (“tokenizing”), as well as generating new text based on the n-gram structure of a given text input (“babbling”);
++ `Shiny` is an R package that makes it easy to build interactive web apps straight from R;
+
+```{r load libraries, warning=FALSE, message=FALSE}
+library(tidyverse)
+library(tidytext)
+library(plotly)
+library(DT)
+library(tm)
+library(data.table)
+library(scales)
+library(wordcloud2)
+library(gridExtra)
+library(ngram)
+library(shiny) 
+```
+
+
+### Load the processed lyrics data along with artist information
+
+We use the processed data and artist information for our analysis.
+
+```{r load data, warning=FALSE, message=FALSE}
+# load lyrics data
+load('../output/processed_lyrics.RData') 
+# load artist information
+dt_artist <- fread('../data/artists.csv') 
+```
+
+### Preparations for visualization
+```{r}
+lyrics_list <- c("Folk", "R&B", "Electronic", "Jazz", "Indie", "Country", "Rock", "Metal", "Pop", "Hip-Hop", "Other")
+time_list <- c("1970s", "1980s", "1990s", "2000s", "2010s")
+corpus <- VCorpus(VectorSource(dt_lyrics$stemmedwords))
+word_tibble <- tidy(corpus) %>%
+  select(text) %>%
+  mutate(id = row_number()) %>%
+  unnest_tokens(word, text)
+```
+
+
+
+### Specify the user interface for the R Shiny app
+```{r}
+# Define UI for app that draws a histogram ----
+ui <- navbarPage(strong("Lyrics Analysis"),
+  tabPanel("Overview",
+    titlePanel("Most frequent words"),
+    # Sidebar layout with input and output definitions ----
+    sidebarLayout(
+      # Sidebar panel for inputs ----
+      sidebarPanel(
+        sliderInput(inputId = "nwords1",
+                    label = "Number of terms in the first word cloud:",
+                    min = 5, max = 100, value = 50),
+        selectInput('genre1', 'Genre of the first word cloud', 
+                    lyrics_list, selected='Folk')
+
+    ),
+    # Main panel for displaying outputs ----
+    mainPanel(
+      wordcloud2Output(outputId = "WC1", height = "300")
+    )
+  ),
+  hr(),
+  sidebarLayout(
+      # Sidebar panel for inputs ----
+      sidebarPanel(
+        sliderInput(inputId = "nwords2",
+                    label = "Number of terms in the second word cloud:",
+                    min = 5, max = 100, value = 50),
+        selectInput('genre2', 'Genre of the second word cloud', 
+                    lyrics_list, selected='Metal')
+    ),
+    # Main panel for displaying outputs ----
+    mainPanel(
+      wordcloud2Output(outputId = "WC2", height = "300")
+    )
+  )
+           ),
+  tabPanel("Time Variation",
+           # Sidebar layout with input and output definitions ----
+          sidebarLayout(
+            # Sidebar panel for inputs ----
+            sidebarPanel(
+              selectInput('decade1', 'Selected decade for the first plot:', 
+                          time_list, selected='1970s'),
+              selectInput('decade2', 'Selected decade for the second plot:', 
+                          time_list, selected='1980s'),
+              numericInput(inputId = "topBigrams",
+                                          label = "Number of top pairs to view:",
+                                          min = 1,
+                                          max = 20,
+                                          value = 10)
+      
+          ),
+          # Main panel for displaying outputs ----
+          mainPanel(
+            fluidRow(
+              column(5,
+                     plotlyOutput("bigram1")),
+              column(5,
+                     plotlyOutput("bigram2"))
+            )
+          )
+        )
+           ),
+  tabPanel("Data", 
+           DT::dataTableOutput("table"))
+)
+```
+
+
+### Develop the server for the R Shiny app
+This shiny app visualizes summary of data and displays the data table itself.
+
+# Define server logic required for ui ----
+```{r}
+server <- function(input, output) {
+  output$WC1 <- renderWordcloud2({
+    count(filter(word_tibble, id %in% which(dt_lyrics$genre == input$genre1)), word, sort = TRUE) %>%
+      slice(1:input$nwords1) %>%
+      wordcloud2(size=0.6, rotateRatio=0.2)
+  })
+  output$WC2 <- renderWordcloud2({
+    count(filter(word_tibble, id %in% which(dt_lyrics$genre == input$genre2)), word, sort = TRUE) %>%
+      slice(1:input$nwords2) %>%
+      wordcloud2(size=0.6, rotateRatio=0.2)
+  })
+  output$bigram1 <- renderPlotly({
+    year_start <- as.integer(substr(input$decade1, 1, 4))
+    dt_sub <- filter(dt_lyrics, year>=year_start) %>%
+      filter(year<(year_start+10))
+    lyric_bigrams <- dt_sub %>%
+      unnest_tokens(bigram, stemmedwords, token = "ngrams", n = 2)
+    bigram_counts <- lyric_bigrams %>%
+      separate(bigram, c("word1", "word2"), sep = " ") %>%
+      count(word1, word2, sort = TRUE)
+    combined_words <- apply(bigram_counts[c(1, 2)], 1, paste , collapse = " " )[1:input$topBigrams]
+    x_names <- factor(combined_words, levels = rev(combined_words))
+    plot_ly(
+      x = bigram_counts$n[1:input$topBigrams],
+      y = x_names,
+      name = "Bigram",
+      type = "bar",
+      orientation = 'h'
+    )
+  })
+  output$bigram2 <- renderPlotly({
+    year_start <- as.integer(substr(input$decade2, 1, 4))
+    dt_sub <- filter(dt_lyrics, year>=year_start) %>%
+      filter(year<(year_start+10))
+    lyric_bigrams <- dt_sub %>%
+      unnest_tokens(bigram, stemmedwords, token = "ngrams", n = 2)
+    bigram_counts <- lyric_bigrams %>%
+      separate(bigram, c("word1", "word2"), sep = " ") %>%
+      count(word1, word2, sort = TRUE)
+    combined_words <- apply(bigram_counts[c(1, 2)], 1, paste , collapse = " " )[1:input$topBigrams]
+    x_names <- factor(combined_words, levels = rev(combined_words))
+    plot_ly(
+      x = bigram_counts$n[1:input$topBigrams],
+      y = x_names,
+      name = "Bigram",
+      type = "bar",
+      orientation = 'h'
+    )
+  })
+  output$table <- DT::renderDataTable({
+    DT::datatable(dt_lyrics)
+  })
+}
+```
+
+### Run the R Shiny app
+
+```{r shiny app, warning=FALSE, message=FALSE}
+shinyApp(ui, server)
+```
+
+
+
+
+
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/Proj1_desc.md b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/Proj1_desc.md
@@ -0,0 +1,91 @@
+## Applied Data Science @ Columbia
+## STAT GR5243/GU4243 Fall 2019 
+### Project 1 An R/Python Notebook "Data Story" on Song Lyrics
+
+<img src="../figs/title2.jpg" width="400">
+
+Has it happened to you when a song was really *speaking* to you? Do you like some of your favorite songs for their lyrics? When you think of a particular music genre (e.g., classic rock), do you expect certain *topics* or *sentiments* for the lyrics? 
+
+The goal of this project is to look deeper into the patterns and characteristics of different types of song lyrics. Applying tools from natural language processing and text mining, students should derive interesting findings in this collection of song lyrics and write a "data story" that can be shared with a general audience. 
+
+### Datasets
+
++ "lyrics.csv" ([Download](https://www.dropbox.com/s/3tfv5v73z0ec8vr/lyrics.csv?dl=0)) is a filtered corpus of 100,000+ song lyrics from MetroLyrics. Available features are song name, year, artist, genre, and lyrics. You can find the complete 380,000+ song lyrics data on [Kaggle](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics). A ```lyrics.RData``` file is also provided in the [\data folder](../data/).
+
++ "artists.csv" (in the [\data folder](../data/)) provides additional background information of all the artistis. These information were scraped from [LyricsFreak](https://www.lyricsfreak.com/) by the ADS instructional team. For the singers, a detailed biography is provided. And for the bands, available information are members, established year and location. **The use of this data set for this project is optional.**
+
+### Challenge 
+
+In this project you will carry out an **exploratory data analysis (EDA)** of the corpus of song lyrics and write a blog on interesting findings from the provided data sets (i.e., a *data story*).
+
+You are tasked to explore the texts using tools from text mining and natural language processing such as sentiment analysis, topic modeling, etc, all available in `R` and write a blog post using `R` Notebook. Your blog should be in the form of a `data story` blog on interesting trends and patterns identified by your analysis of these song lyrics. 
+
+Even though this is an individual project, you are **encouraged** to discuss with your classmates online and exchange ideas. 
+
+### Project organization
+
+A link to initiate a *GitHub starter codes repo* will be posted on piazza for you to start your own project. 
+
+#### Suggested workflow
+This is a relatively short project. We only have about two weeks of working time. In the starter codes, we provide you two basic data processing R notebooks to get you started. 
+
+`Text_processing.rmd` cleans the text data while `Lyrics_ShinyApp.rmd` constrcuts a shiny app to quickly explore the data. 
+
+1. [wk1] Week 1 is the **data processing and mining** week. Read data description, **project requirement**, browse data and studies the R notebooks in the starter codes, and think about what to do and try out different tools you find related to this task.
+2. [wk1] Try out ideas on a **subset** of the data set to get a sense of computational burden of this project. 
+3. [wk2] Explore data for interesting trends and start writing your data story. 
+
+#### Submission
+You should produce an R notebook (rmd and html files) in your GitHub project folder, where you should write a story or a blog post on song lyrics based on your data analysis. Your story should be supported by your results and appropriate visualization.
+
+#### Repositary requirement
+
+The final repo should be under our class github organization (TZStatsADS) and be organized according to the structure of the starter codes. 
+
+```
+proj/
+├──data/
+├──doc/
+├──figs/
+├──lib/
+├──output/
+├── README
+```
+- The `data` folder contains the raw data of this project. These data should NOT be processed inside this folder. Processed data should be saved to `output` folder. This is to ensure that the raw data will not be altered. 
+- The `doc` folder should have documentations for this project, presentation files and other supporting materials. 
+- The `figs` folder contains figure files produced during the project and running of the codes. 
+- The `lib` folder contain computation codes for your data analysis. Make sure your README.md is informative about what are the programs found in this folder. 
+- The `output` folder is the holding place for intermediate and final computational results.
+
+The root README.md should contain your name and an abstract of your findings. 
+
+### Useful resources
+
+##### R pakcages
+* R [tidyverse](https://www.tidyverse.org/) packages
+* R [tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html)
+* [Text Mining with `R`](https://www.tidytextmining.com/)
+* R [DT](http://www.htmlwidgets.org/showcase_datatables.html) package
+* R [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html)
+* [Rcharts](https://www.r-graph-gallery.com/interactive-charts.html), quick interactive plots
+* [htmlwidgets](http://www.htmlwidgets.org/), javascript library adaptation in R. 
+
+##### Project tools
+* A brief [guide](http://rogerdudler.github.io/git-guide/) to git.
+* Putting your project on [GitHub](https://guides.github.com/introduction/getting-your-project-on-github/).
+
+##### Examples
++ [Topic modeling](https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf)
++ [Clustering](http://www.statmethods.net/advstats/cluster.html)
++ [Sentiment analysis of Trump's tweets](https://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/)
+
+##### Tutorials
+
+For this project we will give **tutorials** and give comments on:
+
+- GitHub
+- R notebook
+- Example on sentiment analysis and topic modeling
+
+
+
diff --git a/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/README.md b/8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/README.md
@@ -0,0 +1,5 @@
+# ADS Project 1:  R Notebook on Lyrics Analysis
+
+### Doc folder
+
+The doc directory contains the report or presentation files. It can have subfolders.