Skip to content

Latest commit

 

History

History
191 lines (162 loc) · 10.5 KB

README.md

File metadata and controls

191 lines (162 loc) · 10.5 KB

Tools to work with the COVID-19 Data Commons

Jira Dataset Source Scheduled / One-time
COV-24 John Hopkins Data here Scheduled
COV-12 IDPH County-level data here, here and here Scheduled
COV-79 IDPH Zipcode data here Scheduled
COV-273 IDPH Facility data here (JSON) Scheduled
COV-345 IDPH Hospital data here (JSON) Scheduled
COV-1014 IDPH Regional ICU Capacity here (JSON) Scheduled
COV-1019 IDPH Hospital Utilization here (JSON) Scheduled
COV-720 IDPH Vaccine here scheduled
COV-925 IDPH Vaccine to S3 IDPH Vaccine data scheduled
COV-18 nCOV2019 here One-time
COV-34, COV-454 CTP here and here Scheduled
COV-97 DS4C Kaggle One-time
COV-126 DSCI Kaggle One-time
COV-172 DSFSI here One-time
COV-170 CCMap here One-time
COV-192 OWID2 here Scheduled
COV-237 Chicago Neighborhoods Data here (JSON) Scheduled
COV-361 NPI-PRO here One-time
COV-220 COXRAY Kaggle One-time
COV-422 SSR Controlled data One-time (for now)
COV-434 STOPLIGHT here scheduled
COV-450 VAC-TRACKER here scheduled
COV-453 CHESTX-RAY8 here One-time
COV-521 ATLAS here One-time
COV-465 NCBI-FILE bucket scheduled
COV-482 NCBI-MANIFEST bucket scheduled
COV-465 NCBI bucket scheduled
COV-532 COM-MOBILITY here scheduled
COV-1151 CITYOFCHICAGO here scheduled

Deployment

To deploy the daily/weekly ETLs, use the following setup in adminVM in crontab:

crontab -e

And add the following:

USER=<username with submission access>
S3_BUCKET=<name of bucket to upload data to>

# Sample format for a new job is as follows. Please create a new job for an etl in the same format and make sure it's execution does not overlap with any other jobs ( Just to avoid causing overload )

 0   6   *   *   *    (if [ -f $HOME/cloud-automation/files/scripts/covid19-etl-job.sh ]; then JOB_NAME=jhu bash $HOME/cloud-automation/files/scripts/covid19-etl-job.sh; else echo "no covid19-etl-job.sh"; fi) > $HOME/covid19-etl-$JOB_NAME-cronjob.log 2>&1


Note: The time in adminVM is in UTC.

Special instructions

COXRAY

This is local-only ETL. It requires data available locally. Before running the ETL, the data, which is available here and requires Kaggle account. The content of archive should go into the folder ./data (this can be changed via COXRAY_DATA_PATH in coxray.py and coxray_file.py) resulting in the following structure:

covid19-tools
...
├── data
│   ├── annotations
│   │   └── ...
│   ├── images
│   │   └── ...
│   └── metadata.csv
...

The ETL is consist of two parts: COXRAY_FILE - for file upload and COXRAY for metadata submission.

COXRAY_FILE should run first. It will upload the files. COXRAY should run after COXRAY_FILE and it will create clinical data and it will link it to files in indexd.

CHESTX-RAY8

This is local-only ETL. It requires data available locally. Before running the ETL, the data, which is available here. The repository should be cloned into the folder ./data (this can be changed via CHESTXRAY8_DATA_PATH in chestxray8.py) resulting in the following structure:

covid19-tools
...
├── data
│   ├── COVID-19
│   │   ├── X-Ray Image DataSet
│   │   │   ├── No_findings
│   │   │   ├── Pneumonia
│   │   │   └── Pneumonia
...

NCBI

There are 3 NCBI ETL processes:

  • NCBI_MANIFEST: Index virus sequence object data in indexd.
  • NCBI_FILE: Split the clinical metadata into multiple files by accession numbers, and index the files in indexd.
  • NCBI: Submit NCBI clinical data to the graph by creating metadata records for the files indexed by NCBI_FILE.

While either NCBI_MANIFEST or NCBI_FILE can run first, NCBI needs to run last because it needs the indexd information from the other two. It is common for object data to become available before the associated metadata, so the NCBI_MANIFEST job might index files that we don't have metadata for yet, in that case the files are not linked to the graph.

The input data for NCBI_MANIFEST is available in public bucket sra-pub-sars-cov2.

The input data for NCBI and NCBI_FILE are available in public bucket sra-pub-sars-cov2-metadata-us-east-1 with the following structure:

covid19-tools
...
├── sra-pub-sars-cov2-metadata-us-east-1"
│   |── contigs
│   │   │   ├── contigs.json
│   │   pipetide
├── ├── ├   │── pipetide.json
│   │  
...

Deployment: NCBI ETL needs a google cloud setup to access the biqquery public table. For Gen3, the credential needs to put in Gen3Secrets/g3auto/covid19-etl/default.json

Notes:

  • An accession number is supposed in the format of [SDE]RR\d+. SRR for data submitted to NCBI, ERR for EMBL-EBI (European Molecular Biology Laboratory), and DRR for DDBJ (DNA Data Bank of Japan)
  • NCBI_MANIFEST ETL uses last_submission_identifier field of the project node to keep track the last submission datetime. That prevents the etl from checking and re-indexing the files which were already indexed.
  • Virus sequence run taxonomy without a matching submitter id in virus sequence link to CMC only, otherwise link to both CMC and virus sequence