DeepVariant whole genome sequencing case study

In this case study, we describe applying DeepVariant to a real WGS sample. We provide two examples, one running on one CPU-only machine, and one running on a machine with GPU. And finally we assess the quality of the DeepVariant variant calls with hap.py.

NOTE: This case study demonstrates an example of how to run DeepVariant end-to-end on one machine. We report the runtime with specific machine type for the sake of consistency in reporting run time. This is NOT the fastest or cheapest configuration. For more scalable execution of DeepVariant see the External Solutions section.

Running DeepVariant with one command

DeepVariant pipeline consists of 3 steps: make_examples, call_variants, and postprocess_variants. You can now run DeepVariant with one command using the run_deepvariant script.

Running on a CPU-only machine

Here is an example command:

sudo docker run \
  -v "${DATA_DIR}":"/input" \
  -v "${OUTPUT_DIR}:/output" \
  google/deepvariant:"${BIN_VERSION}" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="/input/${REF}" \
  --reads="/input/${BAM}" \
  --output_vcf=/output/${OUTPUT_VCF} \
  --output_gvcf=/output/${OUTPUT_GVCF} \
  --num_shards=${N_SHARDS}

By specifying --model_type=WGS, you'll be using a model that is best suited for Illumina Whole Genome Sequencing data.

The script run_wgs_case_study_docker.sh shows a full example of which data to download, and run DeepVariant with the data.

Before you run the script, you can read through all sections to understand the details. Here is a quick way to get the script and run it:

curl https://raw.githubusercontent.com/google/deepvariant/r0.10/scripts/run_wgs_case_study_docker.sh | bash

Running on a machine with GPU

Here is an example command:

sudo nvidia-docker run \
  -v "${DATA_DIR}":"/input" \
  -v "${OUTPUT_DIR}:/output" \
  google/deepvariant:"${BIN_VERSION}-gpu" \
  /opt/deepvariant/bin/run_deepvariant \
  --model_type=WGS \
  --ref="/input/${REF}" \
  --reads="/input/${BAM}" \
  --output_vcf=/output/${OUTPUT_VCF} \
  --output_gvcf=/output/${OUTPUT_GVCF} \
  --num_shards=${N_SHARDS}

Note that instead of using docker, we're using nvidia-docker to make use of the GPU. call_variants is the only step that uses the GPU. make_examples and postprocess_variants do not run on GPU.

The script run_wgs_case_study_docker_gpu.sh shows a full example, including calling a script install_nvidia_docker.sh that helps you install nvidia-docker.

Runtime

See this page for the commands used to obtain different machine types on Google Cloud. Currently, the call_variants step cannot use more than 1 GPU on the same machine, and the postprocess_variants step is single-process, single-thread.

With the example in run_wgs_case_study_docker.sh on a CPU machine,

Step	Hardware	Wall time
`make_examples`	64 CPUs	~ 1h 46m
`call_variants`	64 CPUs	~ 3h 09m
`postprocess_variants` (with gVCF)	1 CPU	~ 53m

With the example in run_wgs_case_study_docker_gpu.sh on a GPU machine,

Step	Hardware	Wall time
`make_examples`	16 CPUs	~ 5h 50m
`call_variants`	1 P100 GPU, 16 CPUs	~ 56m
`postprocess_variants` (with gVCF)	1 CPU	~ 43m

Since make_examples doesn't utilize GPUs, bringing up one GPU machine for all steps might not be the most cost-effective solution. For more scalable execution of DeepVariant see the External Solutions section.

Data description

The original source of data used in this case study:

BAM file: HG002_NIST_150bp_50x.bam

The original FASTQ file comes from the PrecisionFDA Truth Challenge. We ran it through PrecisionFDA's BWA-MEM app with default setting, and then got the HG002_NIST_150bp_50x.bam file as output.

Since 2019-02-26, the file was further processed using SAMtools 1.9 and HTSlib 1.9 to address formatting problems caused by an older version of HTSlib:
```
samtools view -bh HG002_NIST_150bp_50x.bam -o HG002_NIST_150bp_50x.bam
```
The FASTQ files are originally from the Genome in a Bottle Consortium. For more information, see this Scientific Data paper.
FASTA file: hs37d5.fa.gz

The original file came from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence. Because DeepVariant requires bgzip files, we had to unzip and bgzip it, and create corresponding index files.
Truth VCF and BED

These come from NIST, as part of the Genome in a Bottle project. They are downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/

Variant call quality

We used the hap.py (https://github.com/Illumina/hap.py) program from Illumina to evaluate the resulting vcf file. This serves as a check to ensure the three DeepVariant commands ran correctly and produced high-quality results.

Type	# FN	# FP	Recall	Precision	F1_Score
INDEL	937	717	0.997984	0.998519	0.998251
SNP	1405	748	0.999539	0.999755	0.999647

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

deepvariant-case-study.md

deepvariant-case-study.md

DeepVariant whole genome sequencing case study

Running DeepVariant with one command

Running on a CPU-only machine

Running on a machine with GPU

Runtime

Data description

Variant call quality

Files

deepvariant-case-study.md

Latest commit

History

deepvariant-case-study.md

File metadata and controls

DeepVariant whole genome sequencing case study

Running DeepVariant with one command

Running on a CPU-only machine

Running on a machine with GPU

Runtime

Data description

Variant call quality