Reproducing co-occurrence networks in Rstudio #1081

lansetaowa · 2023-02-27T02:24:09Z

lansetaowa
Feb 27, 2023

To whom it may concern,

I'm very grateful that you published such a user-friendly software for text analysis. It truely has helped me and my team built many beautiful and meaningful analysis and visualizations.

At this moment, as a moderate R user (but a heavy Python user in recent years), I'm looking to reproduce the co-occurrence networks in R, because this will give us more flexibility to nudge anything we'd like. However, after I carefully learned about the process of calculating input data required for co-occurrence network, and how to graph networks with igraph library, I found myself stuck with getting a similar graph but always a bit off.

Since I'm a data scientist, not a programmer, I can't read the Java script posted on this site for the network. I wonder if you can provide more details about the questions I have below, about generating a similar network.

Here are some crucial steps I take:

generate a word count and a word co-occurrence matrix for key words that I selected
calculate the distance (or similarity) based on the co-occurrence matrix, using whatever distance I choose, such as Jaccard, Cosine, etc.
generate nodes and edges information based on above input. In my case, the nodes information contains "name" and word frequncy, and the edges information contains "from", "to" and "weight". Currently I'm using distance as the weight, that is, subtract the similarity from 1 to get the distance.
plot a network graph using functions from igraph package
tweek details of the graph as I like

Below are some problems I just couldn't figure out and hope you can shed some lights on:

what is the cluster_ method that you use to generate subgraphs? There are so many cluster_ methods out there, and I tried many and always get results with the words of the same cluster distributed everywhere, not next to each other.
what is the layout style that you use on default? I was guessing "layout_on_sphere", but i kept getting overlapping nodes. Or do you have a good way to avoid overlapping of nodes?
did you use Jaccard/Cosine, etc. distance or similarity as the input data for edges? I'm guessing similarity but want to make sure.
what settings do you specify to get lines for edges in the same subgraph, but dashes for edges between different subgraphs?
In KHcoder, I found the option for color: "Subgraph: random walks" catering to my plotting needs most of the time. Can you let me know which parameters I should set to get the same effect?

Below are some simplified codes I used to generate my graph:
`library(igraph)
library(readxl)

df_nodes <- read_excel('...\info_nodes.xlsx')
df_vertice <- read_excel('...\info_vertex.xlsx')

turning networks into igraph objects

net <- graph_from_data_frame(d=df_nodes, vertices=df_vertice, directed=F)
class(net)

keep the most significant links only

cut_edge <- quantile(df_nodes$weight, 0.95)
net.sp <- delete_edges(net, E(net)[weight<cut_edge])

view the attributes of the remaining network

edge_density(net.sp)
ecount(net.sp)

V(net)$freq
V(net)$name

trying to make subgraphs

net_clust <- cluster_fluid_communities(net, no.of.communities = 7)
V(net)$type <- net_clust$membership

plotting and coloring different subgraphs with different colors

colors <- adjustcolor( c("gray50", "tomato", "gold", "yellowgreen", "blue", "lightblue", "orange"), alpha=.6)
plot(net.sp,
edge.arrow.size=0.2, edge.curved=0.2, edge.color = 'lightblue',
vertex.size = log(V(net.sp)$freq+1)*2,
vertex.color = colors[V(net)$type], vertex.label.color = 'black',
vertex.label = V(net.sp)$name, vertex.label.cex = 0.8,
layout = layout_on_sphere(net.sp))
`

That is all I have at this time. Hope I can hear from you soon, and thank you a lot in advance for anything you share!

Sincerely and best,
Saining Zhang

ko-ichi-h · 2023-02-27T04:44:02Z

ko-ichi-h
Feb 27, 2023
Maintainer

Hello,

I'm really happy to hear that KH Coder was useful for you!

what is the cluster_ method that you use to generate subgraphs? There are so many cluster_ methods out there, and I tried many and always get results with the words of the same cluster distributed everywhere, not next to each other.

"fastgreedy.community" (default) or "walktrap.community" (Random walks) function of igraph.

what is the layout style that you use on default? I was guessing "layout_on_sphere", but i kept getting overlapping nodes. Or do you have a good way to avoid overlapping of nodes?

"layout.fruchterman.reingold" of igraph and "wordlayout" of wordcloud.

did you use Jaccard/Cosine, etc. distance or similarity as the input data for edges? I'm guessing similarity but want to make sure.

Jaccard is default. But you can choose Cosine, Dice, etc using KH Coder's interface.

what settings do you specify to get lines for edges in the same subgraph, but dashes for edges between different subgraphs?
In KHcoder, I found the option for color: "Subgraph: random walks" catering to my plotting needs most of the time. Can you let me know which parameters I should set to get the same effect?

Would you "Save" KH Coder's co-occurrence network as an "R Source" file and see into it?
https://twitter.com/khcoder/status/1626520026768613377/photo/1

The code is not clean at all but it may help. In my experience, I had to turn off "R Diagnostics" feature of RStudio. It was too heavy for KH Coder generated codes. Also, igraph's output will changes if Igraph's version changes. So if you want exactly the same results, you should run the R that comes with KH Coder in the deps\R folder.

Best,

0 replies

lansetaowa · 2023-02-28T03:20:27Z

lansetaowa
Feb 28, 2023
Author

Hi Mr. Higuchi Koichi,

Thank you a lot for your timely reply! Your answers helped me a lot.

I found the Perl script "network.pm" and managed to extract R code from it. It's pretty long, so I'll spend some time reading it. I'll try my best to comprehend, but apparently you are much more proficient and knowledgeable in R and statistics than I am, so I'll probably nudge you some time in the future again.

Also, FYI, the projects I'm working on analyze corpus from several Chinese social media platforms, such as Red, Tiktok, Weibo, etc, and provide insights to help some consumer goods Japanese brands to thrive in China. You must be happy that your program is contributing to your country's brands' success oversea!

Best,
Saining

0 replies

ko-ichi-h · 2023-02-28T05:19:54Z

ko-ichi-h
Feb 28, 2023
Maintainer

Hi,

Wow, did you get R code from Perl source code?! Yes, that will work.

But just for confirmation, and for others using KH Coder, I'd like to inform that almost all plots in KH Coder can be saved as R code by pressing "Save" button.

Any additional questions are welcome.

Best,

0 replies

lansetaowa · 2023-02-28T08:19:35Z

lansetaowa
Feb 28, 2023
Author

Hi Mr. Koichi,

This is really helpful. I never noticed that R source code can be saved.

When I examined the source code saved from a co-occurence network, I found it starting with two matrices constructed from numeric vectors, one is "d" which looks like a word count matrix for each verbatim as an observation (I already knew how to make it), and another named "doc_length_mtr". Since those numbers are generated from my raw input somehow, I couldn't figure out what the second matrix is. It contains two columns, "length_c" and "length_w".

Can you please kindly let me know how you come up with doc_length_mtr? I don't need the code, but the way to calculate behind the scene is good enough for me.

Thanks a lot!
Saining Zhang

0 replies

lansetaowa · 2023-02-28T08:28:54Z

lansetaowa
Feb 28, 2023
Author

Hi Mr. Koichi,

This is really helpful. I never noticed that R source code can be saved.

When I examined the source code saved from a co-occurence network, I found it starting with two matrices constructed from numeric vectors, one is "d" which looks like a word count matrix for each verbatim as an observation (I already knew how to make it), and another named "doc_length_mtr". Since those numbers are generated from my raw input somehow, I couldn't figure out what the second matrix is. It contains two columns, "length_c" and "length_w".

Can you please kindly let me know how you come up with doc_length_mtr? I don't need the code, but the way to calculate behind the scene is good enough for me.

Thanks a lot! Saining Zhang

I might just figured it out... but can you confirm?

length_c stands for the length of the corpus, like how many words construct the whole post
length_w stands for the number of the parts after word-cut is performed

Thank you a lot!

0 replies

ko-ichi-h · 2023-02-28T08:33:43Z

ko-ichi-h
Feb 28, 2023
Maintainer

Hi,

When you chose Jaccard, that matrix would not be used at all I think. So the guessing is difficult maybe...

It's length of each document in words (w) and characters (c). Length in words would be used to standardize Term Frequency when you select Cosine.

0 replies

lansetaowa · 2023-03-01T02:14:05Z

lansetaowa
Mar 1, 2023
Author

Thank you for the quick reply. Sorry I wasn't so specific. You are right. I was using Cosine distance to generate the chart.

Feeding such a big matrix from raw numbers into Rstudio crashed the IDE all the time, so I was trying to figure out how to generate those matrices, therefore the questions above. Thank you again~

0 replies

ko-ichi-h · 2023-03-01T03:16:23Z

ko-ichi-h
Mar 1, 2023
Maintainer

Feeding such a big matrix from raw numbers into Rstudio crashed the IDE all the time

in the menu of Rstudio, go to [Tools] [Global Options...]
click [Code] in the left column
click [Diagnostics] tab at the top right
uncheck all checkboxes in "R Diagnostics" section
click [OK]

After applying this setting, Rstudio stopped crashing on my environment.

Best,

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing co-occurrence networks in Rstudio #1081

{{title}}

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Reproducing co-occurrence networks in Rstudio #1081

lansetaowa Feb 27, 2023

turning networks into igraph objects

keep the most significant links only

view the attributes of the remaining network

trying to make subgraphs

plotting and coloring different subgraphs with different colors

Replies: 8 comments

ko-ichi-h Feb 27, 2023 Maintainer

lansetaowa Feb 28, 2023 Author

ko-ichi-h Feb 28, 2023 Maintainer

lansetaowa Feb 28, 2023 Author

lansetaowa Feb 28, 2023 Author

ko-ichi-h Feb 28, 2023 Maintainer

lansetaowa Mar 1, 2023 Author

ko-ichi-h Mar 1, 2023 Maintainer

lansetaowa
Feb 27, 2023

ko-ichi-h
Feb 27, 2023
Maintainer

lansetaowa
Feb 28, 2023
Author

ko-ichi-h
Feb 28, 2023
Maintainer

lansetaowa
Feb 28, 2023
Author

lansetaowa
Feb 28, 2023
Author

ko-ichi-h
Feb 28, 2023
Maintainer

lansetaowa
Mar 1, 2023
Author

ko-ichi-h
Mar 1, 2023
Maintainer