This repository contains scripts for creating three separate paper recommendation systems for the journal Digital Humanities Quarterly (DHQ):
- Keyword-based Recommendations:
Recommendations are based on the DHQ Classification Scheme, an editor-assigned controlled vocabulary comprising 88
terms, such as
#gender
and#machine_learning
. - BM25 Recommendations: Recommendations are based on the full text (i.e., a concatenation of title, abstract, and body text without references) using the BM25 algorithm.
- SPECTER2-based Recommendations: Recommendations are generated using the hidden states from SPECTER2, based on the paper's title and abstract.
To get a set of the most updated recommendations when changes have been introduced to the DHQ-journal repository, please
click Run workflow
at
GitHub Actions.
It will also automatically update the tsv files every Sunday at midnight if
dhq-journal repository changes.
The ten most similar article IDs for each of the systems are documented in
Manual or automatic recommendation works as follows:
- Initialize the official DHQ repository as a submodule.
- Extract relevant elements from DHQ papers in TEI format, with the keyword-based recommendation system primarily
focusing on
dhq_keywords
, and the full text-based recommendation system extracting the title, abstract, and body text as well. Papers in the editorial process are not considered. - Construct a similarity matrix for generating recommendations.
- Retrieve the most similar papers from the similarity matrix, utilizing a random seed to handle ties.
Click to expand
To reproduce the recommendations on your own machine (not recommended for production), please use the following commands:# clone the repository and navigate into the directory
git clone https://github.com/Wang-Haining/DHQ-similar-papers.git
cd DHQ-similar-papers
# initialize and update submodules (dhq-journal)
git submodule update --init --remote
# set up a virtual environment
python3.10 -m venv .venv
source .venv/bin/activate
# install dependencies
python -m pip install -r requirements.txt
# execute the keyword-based recommendation
python -m run_kwd_recs
# execute the BM25 recommendation
python -m run_bm25_recs
# execute the SPECTER2 recommendation
python -m run_spctr_recs
MIT
The Digital Humanities Quarterly Data Analytics Team
Please open an issue for any suggestions, thank you!
Click to expand
- v0.0.1:
- Implemented the keyword-based recommendation system.
- v0.0.2:
- Implemented the full text-based recommendation system.
- Included logic for removing papers in the editorial process.
- Refactored the keyword-based recommendation system.
- Updated data files for both systems.
- v0.0.3
- Merged Ben's SPECTER method.
- Added CI pipeline with Actions.
- Improved module/var naming.
- Updated data files.
- v0.0.4
- Ignored remembrance pieces in recommendations.
- Added unit tests.
- Dumped
annoy
for spctr method. - Added a rule to Actions to run tests before commit.
- v0.0.5
- Streamlined utilities.
- Added a pipeline for recs recalculation regardless of submodule updates.