All the ETLs are written in Python. Each ETL is 1 file in the covid19-tools/covid19-etl/etl/
folder. This file contain a class inheriting from the base ETL class.
The ETL modules are automatically imported by this code. The main function assumes all the ETLs inherit from the base class: it calls the files_to_submissions
function and then the submit_metadata
function. The existence of these functions is enforced in the tests.
All the ETLs are initialized with a base_url
, an access_token
and an s3_bucket
, even if each individual ETL doesn't necessarily need all of them.
In the root of the covid19-tools repo:
pip install -r covid19-etl-requirements.txt
export ACCESS_TOKEN=<access token>
JOB_NAME=<name of the ETL to run> S3_BUCKET=<bucket> python covid19-etl/main.py
JOB_NAME
is requiredACCESS_TOKEN
is required. If the ETL you are running does not need an access token, use a fake valueS3_BUCKET
is optional, but ETLs that upload files to S3 need it
- Create a file in the
covid19-tools/covid19-etl/etl/
folder. The file name should be<ETL identifier (lowercase)>.py
. - In this file, create an ETL child class as follows (replace the class name):
from etl import base
class <ETL identifier (uppercase)>(base.BaseETL):
def __init__(self, base_url, access_token, s3_bucket):
self.base_url = base_url
self.access_token = access_token
self.s3_bucket = s3_bucket
def files_to_submissions(self):
# ETL code that reads from the data source
# and generates the data to submit/upload
def submit_metadata(self):
# submit/upload the transformed data
- Write
files_to_submissions
andsubmit_metadata
functions. - In the root of the
covid19-tools
repo, you can run the following to make sure the format is correct:
pip install -r covid19-etl-requirements.txt
pip install -r test-requirements.txt
pip install pytest~=3.6
pytest -vv covid19-etl/tests