forked from CWTSLeiden/CSSS
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
2 changed files
with
327 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,326 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Preparing files for VOSviewer overlays" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"In this notebook we will load some files from Web of Science, parse them, and use them to prepare advanced overlays map in VOSviewer. Many of the operations you have already seen earlier during the summer school." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"As usual we will start by importing the relevant packages. We will need the `pandas` pacakge, and we will call it `pd` again, and additionally we need the `csv` package for some options, and finally, we also need the `glob` package to easily find the relevant files." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pandas as pd\n", | ||
"import csv\n", | ||
"import glob" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We will start by reading in all files. We already did this in an earlier notebook, here below we repeat this." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"files = sorted(glob.glob('data-files/wos/tab-delimited/*.txt'))\n", | ||
"publications_df = pd.concat(pd.read_csv(f, sep='\\t', quoting=csv.QUOTE_NONE, \n", | ||
" usecols=range(68), index_col='UT') for f in files)\n", | ||
"publications_df = publications_df.sort_index()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We will now prepare files manually for VOSviewer. We will have to prepare two files: \n", | ||
" 1. a so-called corpus file that contains all text for each document.\n", | ||
" 2. a so-called scores file that contains \"scores\" for each document." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Corpus file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We will now first prepare the corpus file. We will concatenate the title and abstract together for this purpose. VOSviewer will simply consider each line in the corpus file a document, and will simply consider all text when creating a term map. In other words, you can apply this to any type of file." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"publications_df['text'] = publications_df['TI'] + '. ' + publications_df['AB']" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We have added the additional full stop (`.`) to make sure that VOSviewer is able to parse the sentences correctly." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Since VOSviewer expects a document at each line, we need to make sure that the titles and abstract are all on a single line. In more technical terms: they cannot contain any newlines, which are represented by a combination of special characters, and this depends on the platform you are using. We will simply remove all possible newline characters as follows:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"publications_df['text'] = publications_df['text'].str.replace('\\n', '').replace('\\r', '');" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we write the text for each document to a corpus file." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"publications_df['text'].to_csv('corpus.txt', index=False, header=False)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Scores file" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we have to determine what type of scores we want to project as overlays in VOSviewer. We will show how to do this using journals, you can repeat the exercise on countries." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Scores in VOSviewer work as follows. For each score it will calculate the average of the scores in documents that match a specific term. It will then color the terms in the term map according to the average of these scores. This can then highlight certain parts of the map showing where this score is particularly high or low. The objective now is to show this for journals, highlighting what part of the map is particularly relevant to a certain journal." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"We will do this for each journal separately. At the moment, the journal is contained in the field `SO`." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"publications_df['SO']" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You may remember that you can get group the dataframe by the journal to get an overview per journal." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"publications_df.groupby('SO').size().sort_values(ascending=False)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Now we would like to translate the `SO` column in such a way that VOSviewer can show a separate overlay for each journal. For those of you are familiar with statistics, we will do this using so-called \"dummy\" variables. That is, for each journal, we will create a new column, and indicate whether the publication is from that journal (Yes, `1`) or not (No, `0`). If VOSviewer then takes the average, this comes down to showing the percentage of publications with a certain term that are publishing in that journal. Fortunately, this is implemented in `pandas`, so we can easily do that." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"journal_scores_df = publications_df['SO'].str.get_dummies()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"If we now look at scores_df, you will see many column names that represent the journal, and only `0` or `1` in each entry." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"journal_scores_df.head()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"VOSviewer wants a specific column name for scores. In particular, it should be called `Score<...>`. We therefore change the column names" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"journal_scores_df.columns = ['Score<{}>'.format(c) for c in journal_scores_df.columns]" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"Finally, we then write the dataframe to a scores files, which should be tab-delimited." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"journal_scores_df.to_csv('scores.txt', sep='\\t', index=None)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## VOSviewer" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"You can now create a term map in VOSviewer using the two files you produced `corpus.txt` and `scores.txt`. To create a term map based on these files, choose \"Create a map based on text data\" in VOSviewer, and then select \"Read data from VOSviewer files.\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Exercise Document type" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<div class=\"alert alert-info\">\n", | ||
" Now repeat the same exercise but using the document type <code>DT</code>.\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<div class=\"alert alert-info\">\n", | ||
" Create the term map in VOSviewer with the document type score file. Does the category of \"Meeting Abstract\" show a particular pattern? Why (not)? Can you explain you observation?\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<div class=\"alert alert-info\">\n", | ||
" You probably now have to different dataframes. You then cannot see the document type overlay at the same time as the journal overlay. Could you try to combine the two dataframes? (Hint: check out the <code>concat</code> function we encountered earlier.)\n", | ||
"</div>" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.8.3" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |