This is the source code for the LilyPads application.
LilyPads was presented at IVAPP 2020, and the publication can be found here.
See How to Cite or the NOTICE
file for how to reference the publication and the code.
To build, npm
, sass
and pip
are required.
To install the requisites, run the following commands:
$ npm install
$ python3 -m venv env
$ source env/bin/activate
(env) $ python setup.py install
Input datasets are CSV files with the following columns:
Column Name | Description |
---|---|
Index |
An index number unique to each row. |
Date |
The publication date, in YYYY-mm-dd format. |
Title (Newspaper) |
Title of the newspaper. |
Location |
Location of publication, as text. |
Search term |
Search term used to find the article (optional). |
Text |
The full text of the article. |
Language |
Language of the document. |
Corpus |
Corpus this was extracted from (optional). |
Link |
URL to the source. |
place_id |
ID of place (see Geolocations). |
translated |
Translated full text, optional, if the translation should be used for the word cloud. |
Stop words are words that can be filtered out for text analysis because they carry little meaning by themselves.
Examples are "the", "a", "for".
In the context of OCR (optical character recognition), mis-scanned artifacts can also be considered stop words.
Stop words can be provided to the conversion program by placing a newline-separated text file of such words in data/stopwords/stopwords.<lang>.txt
, where <lang>
is the ISO 639-1 code for the language in question (e.g., stopwords.en.txt
).
The program needs to look up the geographical location of places of publication.
Each article has a place_id
field that references one such location.
The geographical data must be placed as GZIP-ed JSON files in the folder data/geolocations.d/
(e.g., data/geolocations.d/geo.json.gz
).
Each JSON in that directory is a dictionary with the following structure (place-1
is an example for a place_id
):
{
"place-1": {
"formatted_address": "New Orleans",
"geometry": {
"location": {
"lat": 29.9510658,
"lng": -90.0715323
}
},
"place_id": "place-1"
},
...
}
Each dataset must have an attributed metadata file.
That file has the same file name as the dataset file, but with the ending .meta.json
(e.g., dataset.csv
-> dataset.meta.json
).
The JSON contains at least a name
field with the name of the dataset, and a roles
field with an array of strings, which are the roles that may view the dataset (see User Management).
The metadata can contain additional info as required, such as copyright statements, creation dates, etc.
All that info will be included in generated datasets.
The datasets can be built using the Makefile
in the data/
directory.
The Makefile
looks for CSV files in the data/
directory and creates a dataset file from each of them.
The Makefile
in the root directory will also call the Makefile
in the data/
directory.
The server expects an SQLite3
database in the working directory with the following structure:
CREATE TABLE users (
id TEXT PRIMARY KEY,
password TEXT,
expires DATE,
roles TEXT DEFAULT '',
comment TEXT DEFAULT NULL
);
The id
field is the user's login name, the password
is a hashed password entry as generated by htpasswd(1)
.
The expires
field can be used to specify when an account expires, and the user cannot login after that date.
The roles
field is a comma-separated list of roles the user is part of.
The user is only allowed to see and load datasets which share at least one of these roles.
The entire project (JavaScript, assets, CSS, datasets) can be built using the provided Makefile
:
$ make
$ # or, for a production build
$ make prod
The built project can then be bundled into a Python wheel file for deployment as such:
$ source env/bin/activate
(env) $ python setup.py bdist_wheel
This will generate a .whl
file in dist/
, which can be copied to the appropriate location.
To install it there, create a virtual environment and install it using pip
:
$ python3 -m venv env
$ source env/bin/activate
(env) $ pip install --upgrade path/to/wheelfile.whl
The server can be started using gunicorn
:
$ source env/bin/activate
(env) $ gunicorn -b localhost:8000 lilypads:app
See also the host.sh
file for an example.
This would be the appropriate place to pass SSL certificates to the gunicorn
process.
Another possibility is to host the server behind an nginx
or Apache httpd
web server.
Max Franke, Markus John, Moritz Knabben, Jana Keck, Tanja Blascheck, and Steffen Koch. LilyPads: Exploring the Spatiotemporal Dissemination of Historical Newspaper Articles. In Proceedings of the 15th International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: IVAPP, pp. 17--28. SciTePress, 2020. DOI:10.5220/0008871400170028.
BibTeX:
@inproceedings{franke2020lilypads,
author = {Franke, Max and John, Markus and Knabben, Moritz and Keck, Jana and Blascheck, Tanja and Koch, Steffen},
title = {LilyPads: Exploring the Spatiotemporal Dissemination of Historical Newspaper Articles},
journal = {Proceedings of the 15th International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 3: IVAPP},
publisher = {SciTePress},
year = {2020},
month = {2},
pages = {17--28},
doi = {10.5220/0008871400170028},
isbn = {978-989-758-402-2},
organization = {INSTICC}
}