Skip to content

LilyPads: Exploring the Spatiotemporal Dissemination of Historical Newspaper Articles

License

Notifications You must be signed in to change notification settings

UniStuttgart-VISUS/LilyPads

Repository files navigation

LilyPads

This is the source code for the LilyPads application. LilyPads was presented at IVAPP 2020, and the publication can be found here. See How to Cite or the NOTICE file for how to reference the publication and the code.

Installation

To build, npm, sass and pip are required. To install the requisites, run the following commands:

$ npm install
$ python3 -m venv env
$ source env/bin/activate
(env) $ python setup.py install

Dataset Creation

Input Dataset Format

Input datasets are CSV files with the following columns:

Column Name Description
Index An index number unique to each row.
Date The publication date, in YYYY-mm-dd format.
Title (Newspaper) Title of the newspaper.
Location Location of publication, as text.
Search term Search term used to find the article (optional).
Text The full text of the article.
Language Language of the document.
Corpus Corpus this was extracted from (optional).
Link URL to the source.
place_id ID of place (see Geolocations).
translated Translated full text, optional, if the translation should be used for the word cloud.

Stop Words

Stop words are words that can be filtered out for text analysis because they carry little meaning by themselves. Examples are "the", "a", "for". In the context of OCR (optical character recognition), mis-scanned artifacts can also be considered stop words. Stop words can be provided to the conversion program by placing a newline-separated text file of such words in data/stopwords/stopwords.<lang>.txt, where <lang> is the ISO 639-1 code for the language in question (e.g., stopwords.en.txt).

Geolocations

The program needs to look up the geographical location of places of publication. Each article has a place_id field that references one such location. The geographical data must be placed as GZIP-ed JSON files in the folder data/geolocations.d/ (e.g., data/geolocations.d/geo.json.gz). Each JSON in that directory is a dictionary with the following structure (place-1 is an example for a place_id):

{
  "place-1": {
    "formatted_address": "New Orleans",
    "geometry": {
      "location": {
        "lat": 29.9510658,
        "lng": -90.0715323
      }
    },
    "place_id": "place-1"
  },
  ...
}

Metadata

Each dataset must have an attributed metadata file. That file has the same file name as the dataset file, but with the ending .meta.json (e.g., dataset.csv -> dataset.meta.json). The JSON contains at least a name field with the name of the dataset, and a roles field with an array of strings, which are the roles that may view the dataset (see User Management). The metadata can contain additional info as required, such as copyright statements, creation dates, etc. All that info will be included in generated datasets.

Creating the Datasets

The datasets can be built using the Makefile in the data/ directory. The Makefile looks for CSV files in the data/ directory and creates a dataset file from each of them. The Makefile in the root directory will also call the Makefile in the data/ directory.

User Management

The server expects an SQLite3 database in the working directory with the following structure:

CREATE TABLE users (
  id          TEXT PRIMARY KEY,
  password    TEXT,
  expires     DATE,
  roles       TEXT DEFAULT '',
  comment     TEXT DEFAULT NULL
);

The id field is the user's login name, the password is a hashed password entry as generated by htpasswd(1). The expires field can be used to specify when an account expires, and the user cannot login after that date. The roles field is a comma-separated list of roles the user is part of. The user is only allowed to see and load datasets which share at least one of these roles.

Build and Deployment

The entire project (JavaScript, assets, CSS, datasets) can be built using the provided Makefile:

$ make
$ # or, for a production build
$ make prod

The built project can then be bundled into a Python wheel file for deployment as such:

$ source env/bin/activate
(env) $ python setup.py bdist_wheel

This will generate a .whl file in dist/, which can be copied to the appropriate location. To install it there, create a virtual environment and install it using pip:

$ python3 -m venv env
$ source env/bin/activate
(env) $ pip install --upgrade path/to/wheelfile.whl

The server can be started using gunicorn:

$ source env/bin/activate
(env) $ gunicorn -b localhost:8000 lilypads:app

See also the host.sh file for an example. This would be the appropriate place to pass SSL certificates to the gunicorn process. Another possibility is to host the server behind an nginx or Apache httpd web server.

How to Cite

Max Franke, Markus John, Moritz Knabben, Jana Keck, Tanja Blascheck, and Steffen Koch. LilyPads: Exploring the Spatiotemporal Dissemination of Historical Newspaper Articles. In Proceedings of the 15th International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: IVAPP, pp. 17--28. SciTePress, 2020. DOI:10.5220/0008871400170028.

BibTeX:

@inproceedings{franke2020lilypads,
 author = {Franke, Max and John, Markus and Knabben, Moritz and Keck, Jana and Blascheck, Tanja and Koch, Steffen},
 title = {LilyPads: Exploring the Spatiotemporal Dissemination of Historical Newspaper Articles},
 journal = {Proceedings of the 15th International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: IVAPP},
 publisher = {SciTePress},
 year = {2020},
 month = {2},
 pages = {17--28},
 doi = {10.5220/0008871400170028},
 isbn = {978-989-758-402-2},
 organization = {INSTICC}
}