This code is a modification of the nice arxiv-sanity project. It provides a web interface to browse and find similar papers from top conferences in Computer Vision, Machine Learning and Artificial Intelligence. This code is currently running live at similarpapers.com.
If you just want to download the source code, then you can clone this repository with:
git clone https://github.com/hmorimitsu/similarpapers.git
However, if you want to download the source code and the metadata of the conference papers, then clone recursively with:
git clone https://github.com/hmorimitsu/similarpapers.git --recurse-submodules
It is mostly similar to the arxiv-sanity code. The explanation below is based on the arxiv-sanity one, with some modifications.
There are two large parts of the code:
Indexing code. Download the most recent papers from the available conferences, extracts all text, creates tfidf vectors based on the content of each paper. This code is therefore concerned with the backend scraping and computation: building up a database of papers, calculating content vectors, creating thumbnails, computing paper similarities, etc.
User interface. Then there is a web server (based on Flask) that allows searching through the database and filtering papers by similarity, etc.
Several: You will need numpy, feedparser (to process xml files), scikit learn (for tfidf vectorizer), flask (for serving the results), flask_limiter. Also dateutil, and scipy. Most of these are easy to get through pip
, e.g.:
$ virtualenv env # optional: use virtualenv
$ source env/bin/activate # optional: use virtualenv
$ pip install -r requirements.txt
You will also need pdftotext, which you can install on Ubuntu as sudo apt-get install poppler-utils
.
The processing pipeline requires you to run a series of scripts, and at this stage I really encourage you to manually inspect each script, as they may contain various inline settings you might want to change. In order, the processing pipeline is:
- Run the fetcher for the selected conference. Some fetchers are available in the directory
fetchers
, and you may create your own for other conferences. All fetchers will append the data to a filedb.p
. You may run fetchers one after another and they will all write to the samedb.p
file, without destroying previous data. However, you cannot run multiple fetchers at the same time. You can interrupt the script and restart it, and it should skip papers that are already in the database. - Run
download_pdfs.py
, which iterates over all papers in parsed pickle and downloads the papers into folderdata/pdf
- Run
parse_pdf_to_text.py
to export all text from pdfs to files indata/txt
- Run
analyze.py
to compute tfidf vectors for all documents based on bigrams. Saves atfidf.p
,tfidf_meta.p
andsim_dict.p
pickle files. - Run
make_cache.py
for various preprocessing so that server starts faster. - Run the flask server with
serve.py
. Visit localhost:5000 and enjoy sane viewing of papers!
If you'd like to run the flask server online (e.g. AWS), you can probably use tornado and run python serve.py --prod
, like the arxiv-sanity. However, I have not tried it myself.
The way similarpapers.com is served is by using Dokku with Gunicorn. If you want to serve in this way, I suggest you follow this tutorial and adapt it accordingly. This code should run without any modifications.
You also want to create a secret_key.txt
file and fill it with random text (see top of serve.py
).
Running the site live is not currently set up for automatic operation. Instead, I run the pipeline on a local machine to update all the databases whenever a new conference comes up and then upload the processed databases to the website.