forked from allandproust/ADS_Teaching
-
Notifications
You must be signed in to change notification settings - Fork 0
/
InteractiveWordCloud.Rmd
162 lines (132 loc) · 5.46 KB
/
InteractiveWordCloud.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
---
title: 'Tutorial (week 2): Interactive Word Cloud'
runtime: shiny
output:
html_document: default
html_notebook: default
---
This is an *Interactive* [R Markdown](http://rmarkdown.rstudio.com) Notebook. It generates an HTML notebook that would allow users to interactively explore your analysis results.
We will use the presidential inaugural speech word clouds as examples.
# Step 0 - Install and load libraries
```{r, message=FALSE, warning=FALSE}
packages.used=c("tm", "wordcloud", "RColorBrewer",
"dplyr", "tidytext")
# check packages that need to be installed.
packages.needed=setdiff(packages.used,
intersect(installed.packages()[,1],
packages.used))
# install additional packages
if(length(packages.needed)>0){
install.packages(packages.needed, dependencies = TRUE,
repos='http://cran.us.r-project.org')
}
library(tm)
library(wordcloud)
library(RColorBrewer)
library(dplyr)
library(tidytext)
```
This notebook was prepared with the following environmental settings.
```{r}
print(R.version)
```
# Step 1 - Read in the speeches
```{r}
folder.path="../data/inaugurals/"
speeches=list.files(path = folder.path, pattern = "*.txt")
prex.out=substr(speeches, 6, nchar(speeches)-4)
ff.all<-Corpus(DirSource(folder.path))
```
# Step 2 - Text processing
See [Basic Text Mining in R](https://rstudio-pubs-static.s3.amazonaws.com/132792_864e3813b0ec47cb95c7e1e2e2ad83e7.html) for a more comprehensive discussion.
For the speeches, we remove extra white space, convert all letters to the lower case, remove [stop words](https://github.com/arc12/Text-Mining-Weak-Signals/wiki/Standard-set-of-english-stopwords), remove empty words due to formatting errors, and remove punctuation. Then we compute the [Document-Term Matrix (DTM)](https://en.wikipedia.org/wiki/Document-term_matrix).
```{r}
ff.all<-tm_map(ff.all, stripWhitespace)
ff.all<-tm_map(ff.all, content_transformer(tolower))
ff.all<-tm_map(ff.all, removeWords, stopwords("english"))
ff.all<-tm_map(ff.all, removeWords, character(0))
ff.all<-tm_map(ff.all, removePunctuation)
tdm.all<-TermDocumentMatrix(ff.all)
tdm.tidy=tidy(tdm.all)
tdm.overall=summarise(group_by(tdm.tidy, term), sum(count))
```
# Step 3 - Inspect an overall wordcloud
```{r, fig.height=6, fig.width=6}
wordcloud(tdm.overall$term, tdm.overall$`sum(count)`,
scale=c(5,0.5),
max.words=100,
min.freq=1,
random.order=FALSE,
rot.per=0.3,
use.r.layout=T,
random.color=FALSE,
colors=brewer.pal(9,"Blues"))
```
# Step 4 - compute TF-IDF weighted document-term matrices for individual speeches.
As we would like to identify interesting words for each inaugural speech, we use [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) to weigh each term within each speech. It highlights terms that are more specific for a particular speech.
```{r}
dtm <- DocumentTermMatrix(ff.all,
control = list(weighting = function(x)
weightTfIdf(x,
normalize =FALSE),
stopwords = TRUE))
ff.dtm=tidy(dtm)
```
# Step 5- Interactive visualize important words in individual speeches
```{r, warning=FALSE}
library(shiny)
shinyApp(
ui = fluidPage(
fluidRow(style = "padding-bottom: 20px;",
column(4, selectInput('speech1', 'Speech 1',
speeches,
selected=speeches[5])),
column(4, selectInput('speech2', 'Speech 2', speeches,
selected=speeches[9])),
column(4, sliderInput('nwords', 'Number of words', 3,
min = 20, max = 200, value=100, step = 20))
),
fluidRow(
plotOutput('wordclouds', height = "400px")
)
),
server = function(input, output, session) {
# Combine the selected variables into a new data frame
selectedData <- reactive({
list(dtm.term1=ff.dtm$term[ff.dtm$document==as.character(input$speech1)],
dtm.count1=ff.dtm$count[ff.dtm$document==as.character(input$speech1)],
dtm.term2=ff.dtm$term[ff.dtm$document==as.character(input$speech2)],
dtm.count2=ff.dtm$count[ff.dtm$document==as.character(input$speech2)])
})
output$wordclouds <- renderPlot(height = 400, {
par(mfrow=c(1,2), mar = c(0, 0, 3, 0))
wordcloud(selectedData()$dtm.term1,
selectedData()$dtm.count1,
scale=c(4,0.5),
max.words=input$nwords,
min.freq=1,
random.order=FALSE,
rot.per=0,
use.r.layout=FALSE,
random.color=FALSE,
colors=brewer.pal(10,"Blues"),
main=input$speech1)
wordcloud(selectedData()$dtm.term2,
selectedData()$dtm.count2,
scale=c(4,0.5),
max.words=input$nwords,
min.freq=1,
random.order=FALSE,
rot.per=0,
use.r.layout=FALSE,
random.color=FALSE,
colors=brewer.pal(10,"Blues"),
main=input$speech2)
})
},
options = list(height = 600)
)
```
# Further readings
+ [Text mining with `tidytext`](http://tidytextmining.com/).
+ [Basic Text Mining in R](https://rstudio-pubs-static.s3.amazonaws.com/132792_864e3813b0ec47cb95c7e1e2e2ad83e7.html)