A simple REST API for witnessing the web using the Scoop web archiving capture engine.
This first iteration is built around:
Note: this application has only been tested on UNIX-like systems.
The following shortcut will:
- Install all Python dependencies using
poetry
- Install Scoop and related dependencies using
npm
bash install.sh
Depending on your system configuration, the following step might be needed to install Playwright's system-level dependencies:
sudo npx playwright install-deps chromium
Copy scoop_witness_api/config.example.py
as scoop_witness_api/config.py
and adjust as needed.
If you would like to use an alternative way of providing configuration, see:
- https://flask.palletsprojects.com/en/2.3.x/config/#configuring-from-data-files
- https://flask.palletsprojects.com/en/2.3.x/config/#configuring-from-environment-variables
... and update
scoop_witness_api/__init__.py
accordingly.
The following command creates and initializes the database tables for the application to use.
poetry run flask create-tables
The API is being authentication. Access keys can be created with the create-access-key
command.
poetry run flask create-access-key --label="My key"
The following command starts the development server on port 5000.
poetry run flask run
The following command starts the capture process.
poetry run flask start-capture-process
# Capture process runs until interrupted
Alternatively, this command can be used to run parallel capture processes:
poetry run flask start-parallel-capture-processes
More details in the CLI and API sections of this document.
The application's settings are defined globally using Flask's configuration system.
All available options listed detailed in config.example.py, which needs to be copied and edited as config.py
.
config.example.py can be edited in place and used as-is, or replaced with another file / method of storing configuration data that Flask supports.
Be sure to edit __init__.py
accordingly if you choose to use a different method of providing settings to Flask.
With few exceptions -- all related to input/output --, all of the CLI options available for Scoop can be configured and tweaked in config.example.py.
This application was built using Flask for both its REST API and CLI components.
Custom commands were created as a way to operate the application and administer it.
Listing available commands
poetry run flask --help`
# Sub-commands also have a help menu:
poetry run flask create-access-key --help
create-tables
poetry run flask create-tables
Creates a new SQLite database if needed and populates it with tables.
create-access-key
poetry run flask create-access-key --label "John Doe"
Creates a new API access key. Said access key will only be displayed once, as a result of this command.
cancel-access-key
poetry run flask cancel-access-key --id_access_key 1
Makes a given access key inoperable.
status
poetry run flask status
Lists access key ids, as well as pending and started captures.
start-capture-process
poetry run flask start-capture-process
Starts a capture process. Runs until it is manually interrupted with SIGINT (Ctrl + C).
This process:
- Picks a pending capture request from the database, if any
- Marks it as started
- Uses Scoop to complete the capture
- Store results
- Starts over / waits for a new request to come in
The --proxy-port
option allows to specify on which port the proxy should run on:
poetry run flask start-capture-process --proxy-port 9905
start-parallel-capture-processes
poetry run flask start-parallel-capture-processes
Starts parallel capture processes, the number of which is determined at application configuration level.
cleanup
poetry run flask cleanup
Removes "expired" files from storage.
Shelf-life is determined by TEMPORARY_STORAGE_EXPIRATION
at application configuration level.
This command should ideally be run on a scheduler.
inspect-capture
poetry run flask inspect-capture --id_capture "8130d6fe-4adb-4142-a685-00a64bb6ff29"
Returns full details about a given capture as JSON. Can be used by administrators to inspect logs.
Note:
Unless specified otherwise, every capture-related object returned by the API is generated using capture_to_dict()
.
[GET] /
Simple "ping" route to ensure the API is running. Returns HTTP 200 and an empty body.
[POST] /capture
Creates a capture request.
Authentication: Requires a valid access key, passed via the Access-Key
header.
Accepts JSON body with the following properties:
url
: URL to capture (required)callback_url
: URL to be called once capture is complete (optional). This URL will receive a JSON object describing the capture request and its current status.
Returns HTTP 200 and capture info.
The capture request will be rejected if the capture server is over capacity, as defined by the MAX_PENDING_CAPTURES
setting in config.py
.
Sample request:
{
"url": "https://lil.law.harvard.edu",
}
Sample response:
{
"callback_url": null,
"created_timestamp": "Wed, 28 Jun 2023 16:30:28 GMT",
"ended_timestamp": null,
"follow": "https://scoop-witness-api.host/capture/5234bb37-58a8-4071-a65c-0f7815da5202",
"id_capture": "5234bb37-58a8-4071-a65c-0f7815da5202",
"started_timestamp": null,
"status": "pending",
"url": "https://lil.law.harvard.edu"
}
The follow
property is a direct link to [GET] /capture/<id_capture>
, described below.
[GET] /capture/<id_capture>
Returns information about a specific capture.
Authentication: Requires a valid access key, passed via the Access-Key
header. Access is limited to captures initiated using said access key.
Sample response:
{
"artifacts": [
"https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/archive.wacz",
"https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/provenance-summary.html",
"https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/screenshot.png",
"https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/lil.law.harvard.edu.pem",
"https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/analytics.lil.tools.pem"
],
"callback_url": null,
"created_timestamp": "Wed, 28 Jun 2023 16:30:28 GMT",
"ended_timestamp": "Wed, 28 Jun 2023 16:30:45 GMT",
"id_capture": "2eb7145f-dd8e-4354-bf06-6afc6015c446",
"started_timestamp": "Wed, 28 Jun 2023 16:30:30 GMT",
"status": "success",
"temporary_playback_url": "https://replayweb.page/?source=https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/archive.wacz",
"url": "https://lil.law.harvard.edu"
}
The entries under artifacts
are direct links to [GET] /artifact/<id_capture>/<filename>
.
temporary_playback_url
allows for checking the resulting WACZ against replayweb.page.
[GET] /artifact/<id_capture>/<filename>
Allows for accessing and downloading artifacts generated as part of the capture process.
This route is not access-controlled.
Files are only stored temporarily (see cleanup
CLI command).
Flask applications can be deployed in many different ways, therefore this section will focus mostly on what is specific about this project:
- The Flask application itself should be run using a production-ready WSGI server such as gunicorn, and ideally put behind a reverse proxy.
- The
start-parallel-capture-processes
command should run continually in a dedicated process. - The
cleanup
command should be run on a scheduler, for example every 5 minutes.
The default settings assume that Scoop runs in headful mode, which generally yields better results.
Running in headful mode requires that a window system, if none is available:
- You may consider switching to headless mode (
--headless true
) - Or use
xvfb-run
to provide a simulated X environment to thestart-parallel-capture-processes
command:
xvfb-run --auto-servernum -- flask start-parallel-capture-processes
This project uses pytest.
The test suite creates a "throw-away" database for the duration of the test session.
The test suite will also create a temporary storage folder that gets deleted at the end of the test suite.
# Running tests
poetry run pytest -v
# Run linter
poetry run flake8
# Run auto formatter
poetry run black scoop_witness_api
# Bump app version
poetry version patch