forked from TZstatsADS/ADS_Teaching
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'master' of https://github.com/TZstatsADS/ADS_Teaching
- Loading branch information
Showing
771 changed files
with
414,962 additions
and
3,771 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Binary file not shown.
Binary file not shown.
28 changes: 28 additions & 0 deletions
28
8-Fall2019/Projects_StarterCodes/Project1-RNotebook/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
# Applied Data Science @ Columbia | ||
## Fall 2019 | ||
## Project 1: A "data story" on the songs of our times | ||
|
||
<img src="figs/title1.jpeg" width="500"> | ||
|
||
### [Project Description](doc/Proj1_desc.md) | ||
This is the first and only *individual* (as opposed to *team*) this semester. | ||
|
||
Term: Fall 2019 | ||
|
||
+ Projec title: Lorem ipsum dolor sit amet | ||
+ This project is conducted by [your name] | ||
|
||
+ Project summary: [a short summary] Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. | ||
|
||
Following [suggestions](http://nicercode.github.io/blog/2013-04-05-projects/) by [RICH FITZJOHN](http://nicercode.github.io/about/#Team) (@richfitz). This folder is orgarnized as follows. | ||
|
||
``` | ||
proj/ | ||
├── lib/ | ||
├── data/ | ||
├── doc/ | ||
├── figs/ | ||
└── output/ | ||
``` | ||
|
||
Please see each subfolder for a README file. |
5 changes: 5 additions & 0 deletions
5
8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# ADS Project 1: R Notebook on Lyrics Analysis | ||
### Data folder | ||
|
||
The data directory contains data used in the analysis. This is treated as read only; in paricular the R/python files are never allowed to write to the files in here. Depending on the project, these might be csv files, a database, and the directory itself may have subdirectories. | ||
|
12,858 changes: 12,858 additions & 0 deletions
12,858
8-Fall2019/Projects_StarterCodes/Project1-RNotebook/data/artists.csv
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file not shown.
209 changes: 209 additions & 0 deletions
209
8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/Lyrics_ShinyApp.Rmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,209 @@ | ||
--- | ||
title: "Lyrics ShinyApp" | ||
author: "Chengliang Tang, Arpita Shah, Yujie Wang and Tian Zheng" | ||
output: html_notebook | ||
runtime: shiny | ||
--- | ||
|
||
"lyrics_filter.csv" is a filtered corpus of 380,000+ song lyrics from from MetroLyrics. You can read more about it on [Kaggle](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics). | ||
|
||
"info_artist.csv" provides the background information of all the artistis. These information are scraped from [LyricsFreak](https://www.lyricsfreak.com/). | ||
|
||
Here, we explore these data sets and try to find interesting patterns. | ||
|
||
### Load all the required libraries | ||
|
||
From the packages' descriptions: | ||
|
||
+ `tidyverse` is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures; | ||
+ `tidytext` allows text mining using 'dplyr', 'ggplot2', and other tidy tools; | ||
+ `plotly` allows plotting interactive graphs; | ||
+ `DT` provides an R interface to the JavaScript library DataTables; | ||
+ `tm` is a framework for text mining applications within R; | ||
+ `scales` map data to aesthetics, and provide methods for automatically determining breaks and labels for axes and legends; | ||
+ `data.table` is a package for fast aggregation of large data; | ||
+ `wordcloud2` provides an HTML5 interface to wordcloud for data visualization; | ||
+ `gridExtra` contains miscellaneous functions for "grid" graphics; | ||
+ `ngram` is for constructing n-grams (“tokenizing”), as well as generating new text based on the n-gram structure of a given text input (“babbling”); | ||
+ `Shiny` is an R package that makes it easy to build interactive web apps straight from R; | ||
|
||
```{r load libraries, warning=FALSE, message=FALSE} | ||
library(tidyverse) | ||
library(tidytext) | ||
library(plotly) | ||
library(DT) | ||
library(tm) | ||
library(data.table) | ||
library(scales) | ||
library(wordcloud2) | ||
library(gridExtra) | ||
library(ngram) | ||
library(shiny) | ||
``` | ||
|
||
|
||
### Load the processed lyrics data along with artist information | ||
|
||
We use the processed data and artist information for our analysis. | ||
|
||
```{r load data, warning=FALSE, message=FALSE} | ||
# load lyrics data | ||
load('../output/processed_lyrics.RData') | ||
# load artist information | ||
dt_artist <- fread('../data/artists.csv') | ||
``` | ||
|
||
### Preparations for visualization | ||
```{r} | ||
lyrics_list <- c("Folk", "R&B", "Electronic", "Jazz", "Indie", "Country", "Rock", "Metal", "Pop", "Hip-Hop", "Other") | ||
time_list <- c("1970s", "1980s", "1990s", "2000s", "2010s") | ||
corpus <- VCorpus(VectorSource(dt_lyrics$stemmedwords)) | ||
word_tibble <- tidy(corpus) %>% | ||
select(text) %>% | ||
mutate(id = row_number()) %>% | ||
unnest_tokens(word, text) | ||
``` | ||
|
||
|
||
|
||
### Specify the user interface for the R Shiny app | ||
```{r} | ||
# Define UI for app that draws a histogram ---- | ||
ui <- navbarPage(strong("Lyrics Analysis"), | ||
tabPanel("Overview", | ||
titlePanel("Most frequent words"), | ||
# Sidebar layout with input and output definitions ---- | ||
sidebarLayout( | ||
# Sidebar panel for inputs ---- | ||
sidebarPanel( | ||
sliderInput(inputId = "nwords1", | ||
label = "Number of terms in the first word cloud:", | ||
min = 5, max = 100, value = 50), | ||
selectInput('genre1', 'Genre of the first word cloud', | ||
lyrics_list, selected='Folk') | ||
), | ||
# Main panel for displaying outputs ---- | ||
mainPanel( | ||
wordcloud2Output(outputId = "WC1", height = "300") | ||
) | ||
), | ||
hr(), | ||
sidebarLayout( | ||
# Sidebar panel for inputs ---- | ||
sidebarPanel( | ||
sliderInput(inputId = "nwords2", | ||
label = "Number of terms in the second word cloud:", | ||
min = 5, max = 100, value = 50), | ||
selectInput('genre2', 'Genre of the second word cloud', | ||
lyrics_list, selected='Metal') | ||
), | ||
# Main panel for displaying outputs ---- | ||
mainPanel( | ||
wordcloud2Output(outputId = "WC2", height = "300") | ||
) | ||
) | ||
), | ||
tabPanel("Time Variation", | ||
# Sidebar layout with input and output definitions ---- | ||
sidebarLayout( | ||
# Sidebar panel for inputs ---- | ||
sidebarPanel( | ||
selectInput('decade1', 'Selected decade for the first plot:', | ||
time_list, selected='1970s'), | ||
selectInput('decade2', 'Selected decade for the second plot:', | ||
time_list, selected='1980s'), | ||
numericInput(inputId = "topBigrams", | ||
label = "Number of top pairs to view:", | ||
min = 1, | ||
max = 20, | ||
value = 10) | ||
), | ||
# Main panel for displaying outputs ---- | ||
mainPanel( | ||
fluidRow( | ||
column(5, | ||
plotlyOutput("bigram1")), | ||
column(5, | ||
plotlyOutput("bigram2")) | ||
) | ||
) | ||
) | ||
), | ||
tabPanel("Data", | ||
DT::dataTableOutput("table")) | ||
) | ||
``` | ||
|
||
|
||
### Develop the server for the R Shiny app | ||
This shiny app visualizes summary of data and displays the data table itself. | ||
|
||
# Define server logic required for ui ---- | ||
```{r} | ||
server <- function(input, output) { | ||
output$WC1 <- renderWordcloud2({ | ||
count(filter(word_tibble, id %in% which(dt_lyrics$genre == input$genre1)), word, sort = TRUE) %>% | ||
slice(1:input$nwords1) %>% | ||
wordcloud2(size=0.6, rotateRatio=0.2) | ||
}) | ||
output$WC2 <- renderWordcloud2({ | ||
count(filter(word_tibble, id %in% which(dt_lyrics$genre == input$genre2)), word, sort = TRUE) %>% | ||
slice(1:input$nwords2) %>% | ||
wordcloud2(size=0.6, rotateRatio=0.2) | ||
}) | ||
output$bigram1 <- renderPlotly({ | ||
year_start <- as.integer(substr(input$decade1, 1, 4)) | ||
dt_sub <- filter(dt_lyrics, year>=year_start) %>% | ||
filter(year<(year_start+10)) | ||
lyric_bigrams <- dt_sub %>% | ||
unnest_tokens(bigram, stemmedwords, token = "ngrams", n = 2) | ||
bigram_counts <- lyric_bigrams %>% | ||
separate(bigram, c("word1", "word2"), sep = " ") %>% | ||
count(word1, word2, sort = TRUE) | ||
combined_words <- apply(bigram_counts[c(1, 2)], 1, paste , collapse = " " )[1:input$topBigrams] | ||
x_names <- factor(combined_words, levels = rev(combined_words)) | ||
plot_ly( | ||
x = bigram_counts$n[1:input$topBigrams], | ||
y = x_names, | ||
name = "Bigram", | ||
type = "bar", | ||
orientation = 'h' | ||
) | ||
}) | ||
output$bigram2 <- renderPlotly({ | ||
year_start <- as.integer(substr(input$decade2, 1, 4)) | ||
dt_sub <- filter(dt_lyrics, year>=year_start) %>% | ||
filter(year<(year_start+10)) | ||
lyric_bigrams <- dt_sub %>% | ||
unnest_tokens(bigram, stemmedwords, token = "ngrams", n = 2) | ||
bigram_counts <- lyric_bigrams %>% | ||
separate(bigram, c("word1", "word2"), sep = " ") %>% | ||
count(word1, word2, sort = TRUE) | ||
combined_words <- apply(bigram_counts[c(1, 2)], 1, paste , collapse = " " )[1:input$topBigrams] | ||
x_names <- factor(combined_words, levels = rev(combined_words)) | ||
plot_ly( | ||
x = bigram_counts$n[1:input$topBigrams], | ||
y = x_names, | ||
name = "Bigram", | ||
type = "bar", | ||
orientation = 'h' | ||
) | ||
}) | ||
output$table <- DT::renderDataTable({ | ||
DT::datatable(dt_lyrics) | ||
}) | ||
} | ||
``` | ||
|
||
### Run the R Shiny app | ||
|
||
```{r shiny app, warning=FALSE, message=FALSE} | ||
shinyApp(ui, server) | ||
``` | ||
|
||
|
||
|
||
|
||
|
91 changes: 91 additions & 0 deletions
91
8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/Proj1_desc.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
## Applied Data Science @ Columbia | ||
## STAT GR5243/GU4243 Fall 2019 | ||
### Project 1 An R/Python Notebook "Data Story" on Song Lyrics | ||
|
||
<img src="../figs/title2.jpg" width="400"> | ||
|
||
Has it happened to you when a song was really *speaking* to you? Do you like some of your favorite songs for their lyrics? When you think of a particular music genre (e.g., classic rock), do you expect certain *topics* or *sentiments* for the lyrics? | ||
|
||
The goal of this project is to look deeper into the patterns and characteristics of different types of song lyrics. Applying tools from natural language processing and text mining, students should derive interesting findings in this collection of song lyrics and write a "data story" that can be shared with a general audience. | ||
|
||
### Datasets | ||
|
||
+ "lyrics.csv" ([Download](https://www.dropbox.com/s/3tfv5v73z0ec8vr/lyrics.csv?dl=0)) is a filtered corpus of 100,000+ song lyrics from MetroLyrics. Available features are song name, year, artist, genre, and lyrics. You can find the complete 380,000+ song lyrics data on [Kaggle](https://www.kaggle.com/gyani95/380000-lyrics-from-metrolyrics). A ```lyrics.RData``` file is also provided in the [\data folder](../data/). | ||
|
||
+ "artists.csv" (in the [\data folder](../data/)) provides additional background information of all the artistis. These information were scraped from [LyricsFreak](https://www.lyricsfreak.com/) by the ADS instructional team. For the singers, a detailed biography is provided. And for the bands, available information are members, established year and location. **The use of this data set for this project is optional.** | ||
|
||
### Challenge | ||
|
||
In this project you will carry out an **exploratory data analysis (EDA)** of the corpus of song lyrics and write a blog on interesting findings from the provided data sets (i.e., a *data story*). | ||
|
||
You are tasked to explore the texts using tools from text mining and natural language processing such as sentiment analysis, topic modeling, etc, all available in `R` and write a blog post using `R` Notebook. Your blog should be in the form of a `data story` blog on interesting trends and patterns identified by your analysis of these song lyrics. | ||
|
||
Even though this is an individual project, you are **encouraged** to discuss with your classmates online and exchange ideas. | ||
|
||
### Project organization | ||
|
||
A link to initiate a *GitHub starter codes repo* will be posted on piazza for you to start your own project. | ||
|
||
#### Suggested workflow | ||
This is a relatively short project. We only have about two weeks of working time. In the starter codes, we provide you two basic data processing R notebooks to get you started. | ||
|
||
`Text_processing.rmd` cleans the text data while `Lyrics_ShinyApp.rmd` constrcuts a shiny app to quickly explore the data. | ||
|
||
1. [wk1] Week 1 is the **data processing and mining** week. Read data description, **project requirement**, browse data and studies the R notebooks in the starter codes, and think about what to do and try out different tools you find related to this task. | ||
2. [wk1] Try out ideas on a **subset** of the data set to get a sense of computational burden of this project. | ||
3. [wk2] Explore data for interesting trends and start writing your data story. | ||
|
||
#### Submission | ||
You should produce an R notebook (rmd and html files) in your GitHub project folder, where you should write a story or a blog post on song lyrics based on your data analysis. Your story should be supported by your results and appropriate visualization. | ||
|
||
#### Repositary requirement | ||
|
||
The final repo should be under our class github organization (TZStatsADS) and be organized according to the structure of the starter codes. | ||
|
||
``` | ||
proj/ | ||
├──data/ | ||
├──doc/ | ||
├──figs/ | ||
├──lib/ | ||
├──output/ | ||
├── README | ||
``` | ||
- The `data` folder contains the raw data of this project. These data should NOT be processed inside this folder. Processed data should be saved to `output` folder. This is to ensure that the raw data will not be altered. | ||
- The `doc` folder should have documentations for this project, presentation files and other supporting materials. | ||
- The `figs` folder contains figure files produced during the project and running of the codes. | ||
- The `lib` folder contain computation codes for your data analysis. Make sure your README.md is informative about what are the programs found in this folder. | ||
- The `output` folder is the holding place for intermediate and final computational results. | ||
|
||
The root README.md should contain your name and an abstract of your findings. | ||
|
||
### Useful resources | ||
|
||
##### R pakcages | ||
* R [tidyverse](https://www.tidyverse.org/) packages | ||
* R [tidytext](https://cran.r-project.org/web/packages/tidytext/vignettes/tidytext.html) | ||
* [Text Mining with `R`](https://www.tidytextmining.com/) | ||
* R [DT](http://www.htmlwidgets.org/showcase_datatables.html) package | ||
* R [tibble](https://cran.r-project.org/web/packages/tibble/vignettes/tibble.html) | ||
* [Rcharts](https://www.r-graph-gallery.com/interactive-charts.html), quick interactive plots | ||
* [htmlwidgets](http://www.htmlwidgets.org/), javascript library adaptation in R. | ||
|
||
##### Project tools | ||
* A brief [guide](http://rogerdudler.github.io/git-guide/) to git. | ||
* Putting your project on [GitHub](https://guides.github.com/introduction/getting-your-project-on-github/). | ||
|
||
##### Examples | ||
+ [Topic modeling](https://cran.r-project.org/web/packages/topicmodels/vignettes/topicmodels.pdf) | ||
+ [Clustering](http://www.statmethods.net/advstats/cluster.html) | ||
+ [Sentiment analysis of Trump's tweets](https://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/) | ||
|
||
##### Tutorials | ||
|
||
For this project we will give **tutorials** and give comments on: | ||
|
||
- GitHub | ||
- R notebook | ||
- Example on sentiment analysis and topic modeling | ||
|
||
|
||
|
5 changes: 5 additions & 0 deletions
5
8-Fall2019/Projects_StarterCodes/Project1-RNotebook/doc/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,5 @@ | ||
# ADS Project 1: R Notebook on Lyrics Analysis | ||
|
||
### Doc folder | ||
|
||
The doc directory contains the report or presentation files. It can have subfolders. |
Oops, something went wrong.