Tools to work with the COVID-19 Data Commons

Jira	Dataset	Source	Scheduled / One-time
COV-24	John Hopkins Data	here	Scheduled
COV-12	IDPH County-level data	here, here and here	Scheduled
COV-79	IDPH Zipcode data	here	Scheduled
COV-273	IDPH Facility data	here (JSON)	Scheduled
~~COV-345~~	~~IDPH Hospital data~~	~~here (JSON)~~	~~Scheduled~~
COV-1014	IDPH Regional ICU Capacity	here (JSON)	Scheduled
COV-1019	IDPH Hospital Utilization	here (JSON)	Scheduled
COV-720	IDPH Vaccine	here	scheduled
COV-925	IDPH Vaccine to S3	IDPH Vaccine data	scheduled
COV-18	nCOV2019	here	One-time
~~COV-34~~, ~~COV-454~~	~~CTP~~	~~here~~ and ~~here~~	~~Scheduled~~
COV-97	DS4C	Kaggle	One-time
COV-126	DSCI	Kaggle	One-time
COV-172	DSFSI	here	One-time
~~COV-170~~	~~CCMap~~	~~here~~	~~One-time~~
COV-192	OWID2	here	Scheduled
COV-237	Chicago Neighborhoods Data	here (JSON)	Scheduled
~~COV-361~~	~~NPI-PRO~~	~~here~~	~~One-time~~
COV-220	COXRAY	Kaggle	One-time
COV-422	SSR	Controlled data	One-time (for now)
~~COV-434~~	~~STOPLIGHT~~	~~here~~	~~scheduled~~
COV-450	VAC-TRACKER	here	scheduled
COV-453	CHESTX-RAY8	here	One-time
COV-521	ATLAS	here	One-time
COV-465	NCBI-FILE	bucket	scheduled
COV-482	NCBI-MANIFEST	bucket	scheduled
COV-465	NCBI	bucket	scheduled
COV-532	COM-MOBILITY	here	scheduled
COV-1151	CITYOFCHICAGO	here	scheduled

Deployment

To deploy the daily/weekly ETLs, use the following setup in adminVM in crontab:

crontab -e

And add the following:

USER=<username with submission access>
S3_BUCKET=<name of bucket to upload data to>

# Sample format for a new job is as follows. Please create a new job for an etl in the same format and make sure it's execution does not overlap with any other jobs ( Just to avoid causing overload )

 0   6   *   *   *    (if [ -f $HOME/cloud-automation/files/scripts/covid19-etl-job.sh ]; then JOB_NAME=jhu bash $HOME/cloud-automation/files/scripts/covid19-etl-job.sh; else echo "no covid19-etl-job.sh"; fi) > $HOME/covid19-etl-$JOB_NAME-cronjob.log 2>&1

Note: The time in adminVM is in UTC.

Special instructions

COXRAY

This is local-only ETL. It requires data available locally. Before running the ETL, the data, which is available here and requires Kaggle account. The content of archive should go into the folder ./data (this can be changed via COXRAY_DATA_PATH in coxray.py and coxray_file.py) resulting in the following structure:

covid19-tools
...
├── data
│   ├── annotations
│   │   └── ...
│   ├── images
│   │   └── ...
│   └── metadata.csv
...

The ETL is consist of two parts: COXRAY_FILE - for file upload and COXRAY for metadata submission.

COXRAY_FILE should run first. It will upload the files. COXRAY should run after COXRAY_FILE and it will create clinical data and it will link it to files in indexd.

CHESTX-RAY8

This is local-only ETL. It requires data available locally. Before running the ETL, the data, which is available here. The repository should be cloned into the folder ./data (this can be changed via CHESTXRAY8_DATA_PATH in chestxray8.py) resulting in the following structure:

covid19-tools
...
├── data
│   ├── COVID-19
│   │   ├── X-Ray Image DataSet
│   │   │   ├── No_findings
│   │   │   ├── Pneumonia
│   │   │   └── Pneumonia
...

NCBI

There are 3 NCBI ETL processes:

NCBI_MANIFEST: Index virus sequence object data in indexd.
NCBI_FILE: Split the clinical metadata into multiple files by accession numbers, and index the files in indexd.
NCBI: Submit NCBI clinical data to the graph by creating metadata records for the files indexed by NCBI_FILE.

While either NCBI_MANIFEST or NCBI_FILE can run first, NCBI needs to run last because it needs the indexd information from the other two. It is common for object data to become available before the associated metadata, so the NCBI_MANIFEST job might index files that we don't have metadata for yet, in that case the files are not linked to the graph.

The input data for NCBI_MANIFEST is available in public bucket sra-pub-sars-cov2.

The input data for NCBI and NCBI_FILE are available in public bucket sra-pub-sars-cov2-metadata-us-east-1 with the following structure:

covid19-tools
...
├── sra-pub-sars-cov2-metadata-us-east-1"
│   |── contigs
│   │   │   ├── contigs.json
│   │   pipetide
├── ├── ├   │── pipetide.json
│   │  
...

Deployment: NCBI ETL needs a google cloud setup to access the biqquery public table. For Gen3, the credential needs to put in Gen3Secrets/g3auto/covid19-etl/default.json

Notes:

An accession number is supposed in the format of [SDE]RR\d+. SRR for data submitted to NCBI, ERR for EMBL-EBI (European Molecular Biology Laboratory), and DRR for DDBJ (DNA Data Bank of Japan)
NCBI_MANIFEST ETL uses last_submission_identifier field of the project node to keep track the last submission datetime. That prevents the etl from checking and re-indexing the files which were already indexed.
Virus sequence run taxonomy without a matching submitter id in virus sequence link to CMC only, otherwise link to both CMC and virus sequence

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Tools to work with the COVID-19 Data Commons

Deployment

Special instructions

COXRAY

CHESTX-RAY8

NCBI

Files

README.md

Latest commit

History

README.md

File metadata and controls

Tools to work with the COVID-19 Data Commons

Deployment

Special instructions

COXRAY

CHESTX-RAY8

NCBI