In this case study, we describe applying DeepVariant to a real WGS sample. We
provide two examples, one running on one CPU-only machine, and one running on a
machine with GPU. And finally we assess the quality of the DeepVariant variant
calls with hap.py
.
NOTE: This case study demonstrates an example of how to run DeepVariant end-to-end on one machine. We report the runtime with specific machine type for the sake of consistency in reporting run time. This is NOT the fastest or cheapest configuration. For more scalable execution of DeepVariant see the External Solutions section.
DeepVariant pipeline consists of 3 steps: make_examples
, call_variants
, and
postprocess_variants
. You can now run DeepVariant with one command using the
run_deepvariant
script.
Here is an example command:
sudo docker run \
-v "${DATA_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref="/input/${REF}" \
--reads="/input/${BAM}" \
--output_vcf=/output/${OUTPUT_VCF} \
--output_gvcf=/output/${OUTPUT_GVCF} \
--num_shards=${N_SHARDS}
By specifying --model_type=WGS
, you'll be using a model that is best suited
for Illumina Whole Genome Sequencing data.
The script run_wgs_case_study_docker.sh shows a full example of which data to download, and run DeepVariant with the data.
Before you run the script, you can read through all sections to understand the details. Here is a quick way to get the script and run it:
curl https://raw.githubusercontent.com/google/deepvariant/r0.10/scripts/run_wgs_case_study_docker.sh | bash
Here is an example command:
sudo nvidia-docker run \
-v "${DATA_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
google/deepvariant:"${BIN_VERSION}-gpu" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WGS \
--ref="/input/${REF}" \
--reads="/input/${BAM}" \
--output_vcf=/output/${OUTPUT_VCF} \
--output_gvcf=/output/${OUTPUT_GVCF} \
--num_shards=${N_SHARDS}
Note that instead of using docker
, we're using nvidia-docker
to make use of
the GPU. call_variants
is the only step that uses the GPU.
make_examples
and postprocess_variants
do not run on GPU.
The script run_wgs_case_study_docker_gpu.sh shows a full example, including
calling a script install_nvidia_docker.sh that helps you install
nvidia-docker
.
See
this page
for the commands used to obtain different machine types on Google Cloud.
Currently, the call_variants
step cannot use more than 1 GPU on the same
machine, and the postprocess_variants
step is single-process, single-thread.
With the example in run_wgs_case_study_docker.sh on a CPU machine,
Step | Hardware | Wall time |
---|---|---|
make_examples |
64 CPUs | ~ 1h 46m |
call_variants |
64 CPUs | ~ 3h 09m |
postprocess_variants (with gVCF) |
1 CPU | ~ 53m |
With the example in run_wgs_case_study_docker_gpu.sh on a GPU machine,
Step | Hardware | Wall time |
---|---|---|
make_examples |
16 CPUs | ~ 5h 50m |
call_variants |
1 P100 GPU, 16 CPUs | ~ 56m |
postprocess_variants (with gVCF) |
1 CPU | ~ 43m |
Since make_examples
doesn't utilize GPUs, bringing up one GPU machine for all
steps might not be the most cost-effective solution. For more scalable execution
of DeepVariant see the External Solutions section.
The original source of data used in this case study:
-
BAM file: HG002_NIST_150bp_50x.bam
The original FASTQ file comes from the PrecisionFDA Truth Challenge. We ran it through PrecisionFDA's BWA-MEM app with default setting, and then got the HG002_NIST_150bp_50x.bam file as output.
Since 2019-02-26, the file was further processed using SAMtools 1.9 and HTSlib 1.9 to address formatting problems caused by an older version of HTSlib:
samtools view -bh HG002_NIST_150bp_50x.bam -o HG002_NIST_150bp_50x.bam
The FASTQ files are originally from the Genome in a Bottle Consortium. For more information, see this Scientific Data paper.
-
FASTA file: hs37d5.fa.gz
The original file came from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_reference_assembly_sequence. Because DeepVariant requires bgzip files, we had to unzip and bgzip it, and create corresponding index files.
-
Truth VCF and BED
These come from NIST, as part of the Genome in a Bottle project. They are downloaded from ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/AshkenazimTrio/HG002_NA24385_son/NISTv3.3.2/GRCh37/
We used the hap.py
(https://github.com/Illumina/hap.py)
program from Illumina to evaluate the resulting vcf file. This serves as a check
to ensure the three DeepVariant commands ran correctly and produced high-quality
results.
Type | # FN | # FP | Recall | Precision | F1_Score |
---|---|---|---|---|---|
INDEL | 937 | 717 | 0.997984 | 0.998519 | 0.998251 |
SNP | 1405 | 748 | 0.999539 | 0.999755 | 0.999647 |