Skip to content
This repository has been archived by the owner on Oct 8, 2024. It is now read-only.

A simple REST API for witnessing the web using the Scoop web archiving capture engine.

License

Notifications You must be signed in to change notification settings

harvard-lil/scoop-witness-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

44 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Scoop Witness API 🍨

Linting Test suite

A simple REST API for witnessing the web using the Scoop web archiving capture engine.

This first iteration is built around:


Summary


Getting Started

Note: this application has only been tested on UNIX-like systems.

1. Machine-level dependencies

2. Project-level dependencies

The following shortcut will:

  • Install all Python dependencies using poetry
  • Install Scoop and related dependencies using npm
bash install.sh

Depending on your system configuration, the following step might be needed to install Playwright's system-level dependencies:

sudo npx playwright install-deps chromium

3. Creating a config file

Copy scoop_witness_api/config.example.py as scoop_witness_api/config.py and adjust as needed.

If you would like to use an alternative way of providing configuration, see:

4. Setting up the database

The following command creates and initializes the database tables for the application to use.

poetry run flask create-tables

5. Creating an access key

The API is being authentication. Access keys can be created with the create-access-key command.

poetry run flask create-access-key --label="My key"

6. Starting the server

The following command starts the development server on port 5000.

poetry run flask run 

7. Starting the capture process

The following command starts the capture process.

poetry run flask start-capture-process
# Capture process runs until interrupted

Alternatively, this command can be used to run parallel capture processes:

poetry run flask start-parallel-capture-processes

More details in the CLI and API sections of this document.

πŸ‘† Back to the summary


Configuration

The application's settings are defined globally using Flask's configuration system.

All available options listed detailed in config.example.py, which needs to be copied and edited as config.py.

config.example.py can be edited in place and used as-is, or replaced with another file / method of storing configuration data that Flask supports.

Be sure to edit __init__.py accordingly if you choose to use a different method of providing settings to Flask.

With few exceptions -- all related to input/output --, all of the CLI options available for Scoop can be configured and tweaked in config.example.py.

πŸ‘† Back to the summary


CLI

This application was built using Flask for both its REST API and CLI components.

Custom commands were created as a way to operate the application and administer it.

Listing available commands
poetry run flask --help`
# Sub-commands also have a help menu:
poetry run flask create-access-key --help
create-tables
poetry run flask create-tables

Creates a new SQLite database if needed and populates it with tables.

create-access-key
poetry run flask create-access-key --label "John Doe"

Creates a new API access key. Said access key will only be displayed once, as a result of this command.

cancel-access-key
poetry run flask cancel-access-key --id_access_key 1

Makes a given access key inoperable.

status
poetry run flask status

Lists access key ids, as well as pending and started captures.

start-capture-process
poetry run flask start-capture-process

Starts a capture process. Runs until it is manually interrupted with SIGINT (Ctrl + C).

This process:

  • Picks a pending capture request from the database, if any
  • Marks it as started
  • Uses Scoop to complete the capture
  • Store results
  • Starts over / waits for a new request to come in

The --proxy-port option allows to specify on which port the proxy should run on:

poetry run flask start-capture-process --proxy-port 9905
start-parallel-capture-processes
poetry run flask start-parallel-capture-processes

Starts parallel capture processes, the number of which is determined at application configuration level.

cleanup
poetry run flask cleanup

Removes "expired" files from storage. Shelf-life is determined by TEMPORARY_STORAGE_EXPIRATION at application configuration level.

This command should ideally be run on a scheduler.

inspect-capture
poetry run flask inspect-capture --id_capture "8130d6fe-4adb-4142-a685-00a64bb6ff29"

Returns full details about a given capture as JSON. Can be used by administrators to inspect logs.

πŸ‘† Back to the summary


API

Note: Unless specified otherwise, every capture-related object returned by the API is generated using capture_to_dict().

[GET] /

Simple "ping" route to ensure the API is running. Returns HTTP 200 and an empty body.

[POST] /capture

Creates a capture request.

Authentication: Requires a valid access key, passed via the Access-Key header.

Accepts JSON body with the following properties:

  • url: URL to capture (required)
  • callback_url: URL to be called once capture is complete (optional). This URL will receive a JSON object describing the capture request and its current status.

Returns HTTP 200 and capture info.

The capture request will be rejected if the capture server is over capacity, as defined by the MAX_PENDING_CAPTURES setting in config.py.

Sample request:

{
  "url": "https://lil.law.harvard.edu",
}

Sample response:

{
  "callback_url": null,
  "created_timestamp": "Wed, 28 Jun 2023 16:30:28 GMT",
  "ended_timestamp": null,
  "follow": "https://scoop-witness-api.host/capture/5234bb37-58a8-4071-a65c-0f7815da5202",
  "id_capture": "5234bb37-58a8-4071-a65c-0f7815da5202",
  "started_timestamp": null,
  "status": "pending",
  "url": "https://lil.law.harvard.edu"
}

The follow property is a direct link to [GET] /capture/<id_capture>, described below.

[GET] /capture/<id_capture>

Returns information about a specific capture.

Authentication: Requires a valid access key, passed via the Access-Key header. Access is limited to captures initiated using said access key.

Sample response:

{
  "artifacts": [
    "https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/archive.wacz",
    "https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/provenance-summary.html",
    "https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/screenshot.png",
    "https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/lil.law.harvard.edu.pem",
    "https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/analytics.lil.tools.pem"
  ],
  "callback_url": null,
  "created_timestamp": "Wed, 28 Jun 2023 16:30:28 GMT",
  "ended_timestamp": "Wed, 28 Jun 2023 16:30:45 GMT",
  "id_capture": "2eb7145f-dd8e-4354-bf06-6afc6015c446",
  "started_timestamp": "Wed, 28 Jun 2023 16:30:30 GMT",
  "status": "success",
  "temporary_playback_url": "https://replayweb.page/?source=https://scoop-witness-api.host/artifact/2eb7145f-dd8e-4354-bf06-6afc6015c446/archive.wacz",
  "url": "https://lil.law.harvard.edu"
}

The entries under artifacts are direct links to [GET] /artifact/<id_capture>/<filename>.

temporary_playback_url allows for checking the resulting WACZ against replayweb.page.

[GET] /artifact/<id_capture>/<filename>

Allows for accessing and downloading artifacts generated as part of the capture process.

This route is not access-controlled.

Files are only stored temporarily (see cleanup CLI command).

πŸ‘† Back to the summary


Deployment

Flask applications can be deployed in many different ways, therefore this section will focus mostly on what is specific about this project:

  • The Flask application itself should be run using a production-ready WSGI server such as gunicorn, and ideally put behind a reverse proxy.
  • The start-parallel-capture-processes command should run continually in a dedicated process.
  • The cleanup command should be run on a scheduler, for example every 5 minutes.

Running in headful mode

The default settings assume that Scoop runs in headful mode, which generally yields better results.

Running in headful mode requires that a window system, if none is available:

  • You may consider switching to headless mode (--headless true)
  • Or use xvfb-run to provide a simulated X environment to the start-parallel-capture-processes command:
xvfb-run --auto-servernum -- flask start-parallel-capture-processes

πŸ‘† Back to the summary


Tests and linting

This project uses pytest.

The test suite creates a "throw-away" database for the duration of the test session.

The test suite will also create a temporary storage folder that gets deleted at the end of the test suite.

# Running tests
poetry run pytest -v 

# Run linter
poetry run flake8

# Run auto formatter
poetry run black scoop_witness_api

# Bump app version
poetry version patch

πŸ‘† Back to the summary

About

A simple REST API for witnessing the web using the Scoop web archiving capture engine.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 3

  •  
  •  
  •