Update WES case study and Quick Start.

PiperOrigin-RevId: 329596948
wu123tj · Sep 1, 2020 · 5105fb4 · 5105fb4
1 parent 23f2bc2
commit 5105fb4
Show file tree

Hide file tree

Showing 2 changed files with 122 additions and 83 deletions.
diff --git a/docs/deepvariant-exome-case-study.md b/docs/deepvariant-exome-case-study.md
@@ -4,116 +4,141 @@ Similar to the [case study on whole genome sequencing data], in this
 study we describe applying DeepVariant to a real exome sample using a single
 machine.
 
-NOTE: This case study demonstrates an example of how to run DeepVariant
-end-to-end on one machine. We report the runtime with [specific machine type]
-for the sake of consistency in reporting run time. This is NOT the fastest or
-cheapest configuration. For more scalable execution of DeepVariant see the
-[External Solutions] section.
+## Prepare environment
 
-## Running DeepVariant with one command
+### Tools
 
-DeepVariant pipeline consists of 3 steps: `make_examples`, `call_variants`, and
-`postprocess_variants`. You can now run DeepVariant with one command using the
-`run_deepvariant` script.
+[Docker](https://docs.docker.com/get-docker/) will be used to run DeepVariant
+and [hap.py](https://github.com/illumina/hap.py),
 
-### Running on a CPU-only machine
+### Download Reference
 
-Here is an example command:
+We will be using GRCh37 for this case study.
 
 ```bash
-sudo docker run \
-  -v "${DATA_DIR}":"/input" \
-  -v "${OUTPUT_DIR}:/output" \
-  google/deepvariant:"${BIN_VERSION}" \
-  /opt/deepvariant/bin/run_deepvariant \
-  --model_type=WES \
-  --ref="/input/${REF}" \
-  --reads="/input/${BAM}" \
-  --regions="/input/${CAPTURE_BED}" \
-  --output_vcf=/output/HG002.output.vcf.gz \
-  --output_gvcf=/output/HG002.output.g.vcf.gz \
-  --num_shards=${N_SHARDS}
+mkdir -p reference
+
+HTTPDIR=https://storage.googleapis.com/deepvariant/exome-case-study-testdata
+
+curl ${HTTPDIR}/hs37d5.fa.gz | gunzip > reference/hs37d5.fa
+curl ${HTTPDIR}/hs37d5.fa.fai > reference/hs37d5.fa.fai
 ```
 
-By specifying `--model_type=WES`, you'll be using a model that is best suited
-for Illumina Whole Exome Sequencing data.
+### Download Genome in a Bottle Benchmarks
+
+We will benchmark our variant calls against v4.1 of the Genome in a Bottle small
+variant benchmarks for HG002.
 
-The script [run_wes_case_study_docker.sh] shows a full example of which data to
-download, and run DeepVariant with the data.
+```bash
+mkdir -p benchmark
 
-Before you run the script, you can read through all sections to understand the
-details. Here is a quick way to get the script and run it:
+FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37
+
+curl ${FTPDIR}/HG002_GRCh37_1_22_v4.1_draft_benchmark.bed > benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.bed
+curl ${FTPDIR}/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz > benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz
+curl ${FTPDIR}/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz.tbi > benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz.tbi
+```
+
+### Download HG002 BAM
 
 ```bash
-curl https://raw.githubusercontent.com/google/deepvariant/r1.0/scripts/run_wes_case_study_docker.sh | bash
+mkdir -p input
+
+FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome
+BAM=151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup
+
+curl ${FTPDIR}/${BAM}.bam > input/${BAM}.bam
+curl ${FTPDIR}/${BAM}.bai > input/${BAM}.bai
 ```
 
-### Runtime
+### Download capture target BED file
 
-See
-[this page](deepvariant-details.md#commands-for-requesting-machines-used-in-case-studies)
-for the commands used to obtain different machine types on Google Cloud.
+According to the paper "[Extensive sequencing of seven human genomes to
+characterize benchmark reference
+materials](https://www.nature.com/articles/sdata201625)", the HG002 exome was
+generated with Agilent SureSelect. In this case study we'll use the SureSelect
+v5 BED (`agilent_sureselect_human_all_exon_v5_b37_targets.bed`). For evaluation,
+`hap.py` will intersect this BED with the GIAB confident regions.
 
-Step                               | Hardware | Wall time
----------------------------------- | -------- | ---------
-`make_examples`                    | 64 CPUs  | ~ 19m
-`call_variants`                    | 64 CPUs  | ~ 2m
-`postprocess_variants` (with gVCF) | 1 CPU    | ~ 1m
+```bash
+HTTPDIR=https://storage.googleapis.com/deepvariant/exome-case-study-testdata
 
-In this example, `call_variants` does not take much time on 64 CPUs. Running
-with GPU might be unnecessary. You can read
-[case study on whole genome sequencing data] about the use of GPU. If you want
-to use GPU on the exome data in this case study, you can see
-[run_wes_case_study_docker_gpu.sh] shows a full example.
+curl ${HTTPDIR}/agilent_sureselect_human_all_exon_v5_b37_targets.bed > input/agilent_sureselect_human_all_exon_v5_b37_targets.bed
+```
 
-## Data description
 
-### BAM file:
+## Running on a CPU-only machine
 
-`151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam`
+```bash
+mkdir -p output
+mkdir -p output/intermediate_results_dir
 
-Downloaded from
-[https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015](https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015)
+BIN_VERSION=1.0.0
 
-### FASTA
+sudo docker run \
+  -v "${PWD}/input":"/input" \
+  -v "${PWD}/output":"/output" \
+  -v "${PWD}/reference":"/reference" \
+  google/deepvariant:"${BIN_VERSION}" \
+  /opt/deepvariant/bin/run_deepvariant \
+  --model_type WES \
+  --ref /reference/hs37d5.fa \
+  --reads /input/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam \
+  --regions /input/agilent_sureselect_human_all_exon_v5_b37_targets.bed \
+  --output_vcf /output/HG002.output.vcf.gz \
+  --output_gvcf /output/HG002.output.g.vcf.gz \
+  --num_shards $(nproc) \
+  --intermediate_results_dir /output/intermediate_results_dir
+```
 
-Same as described in the
-[case study for whole genome data](deepvariant-case-study.md#test_data)
+By specifying `--model_type WES`, you'll be using a model that is best suited
+for Illumina Whole Exome Sequencing data.
 
-### Truth VCF and BED
+`--intermediate_results_dir` flag is optional. By specifying it, the
+intermediate outputs of `make_examples` and `call_variants` stages can be found
+in the directory. After the command, you can find these files in the directory:
 
-`HG002_GRCh37_1_22_v4.1_draft_benchmark.*` are from NIST, as part of the
-[Genomes in a Bottle project](http://jimb.stanford.edu/giab/). They are
-downloaded from
-[ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37/](ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37/)
+```
+call_variants_output.tfrecord.gz
+gvcf.tfrecord-?????-of-?????.gz
+make_examples.tfrecord-?????-of-?????.gz
+```
 
-### Capture target BED file
+For running on GPU machines, or using Singularity instead of Docker, see
+[Quick Start](deepvariant-quick-start.md).
 
-According to the paper "[Extensive sequencing of seven human genomes to
-characterize benchmark reference
-materials](https://www.nature.com/articles/sdata201625)", the HG002 exome was
-generated with Agilent SureSelect. In this case study we'll use the SureSelect
-v5 BED (`agilent_sureselect_human_all_exon_v5_b37_targets.bed`) and intersect it
-with the GIAB confident regions for evaluation.
+## Benchmark on chr20
 
-## Variant call quality
+```bash
+mkdir -p happy
 
-We used the `hap.py`
-([https://github.com/Illumina/hap.py](https://github.com/Illumina/hap.py))
-program from Illumina to evaluate the resulting vcf file. This serves as a check
-to ensure the three DeepVariant commands ran correctly and produced high-quality
-results.
+sudo docker pull pkrusche/hap.py
 
-We evaluate against the capture region:
+sudo docker run \
+  -v "${PWD}/benchmark":"/benchmark" \
+  -v "${PWD}/input":"/input" \
+  -v "${PWD}/output":"/output" \
+  -v "${PWD}/reference":"/reference" \
+  -v "${PWD}/happy:/happy" \
+  pkrusche/hap.py /opt/hap.py/bin/hap.py \
+  /benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz \
+  /output/HG002.output.vcf.gz \
+  -f /benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.bed \
+  -T /input/agilent_sureselect_human_all_exon_v5_b37_targets.bed \
+  -r /reference/hs37d5.fa \
+  -o /happy/happy.output \
+  --engine=vcfeval
+```
+
+Output:
 
-Type  | # TP  | # FN | # FP | Recall   | Precision | F1\_Score
------ | ----- | ---- | ---- | -------- | --------- | ---------
-INDEL | 2894  | 126  |  58  | 0.958278 | 0.980602  | 0.969312
-SNP   | 38173 | 403  | 130  | 0.989553 | 0.996610  | 0.993069
+```
+Benchmarking Summary:
+  Type Filter  TRUTH.TOTAL  TRUTH.TP  TRUTH.FN  QUERY.TOTAL  QUERY.FP  QUERY.UNK  FP.gt  METRIC.Recall  METRIC.Precision  METRIC.Frac_NA  METRIC.F1_Score  TRUTH.TOTAL.TiTv_ratio  QUERY.TOTAL.TiTv_ratio  TRUTH.TOTAL.het_hom_ratio  QUERY.TOTAL.het_hom_ratio
+ INDEL    ALL         3020      2896       124         3718        81        703     35       0.958940          0.973134        0.189080         0.965985                     NaN                     NaN                   1.325453                   1.368421
+ INDEL   PASS         3020      2896       124         3718        81        703     35       0.958940          0.973134        0.189080         0.965985                     NaN                     NaN                   1.325453                   1.368421
+   SNP    ALL        38576     38180       396        41157       116       2816     27       0.989735          0.996975        0.068421         0.993341                2.624273                2.568779                   1.548701                   1.563630
+   SNP   PASS        38576     38180       396        41157       116       2816     27       0.989735          0.996975        0.068421         0.993341                2.624273                2.568779                   1.548701                   1.563630
+```
 
-[specific machine type]: deepvariant-details.md#commands-for-requesting-machines-used-in-case-studies
-[install_nvidia_docker.sh]: ../scripts/install_nvidia_docker.sh
-[run_wes_case_study_docker.sh]: ../scripts/run_wes_case_study_docker.sh
-[run_wes_case_study_docker_gpu.sh]: ../scripts/run_wes_case_study_docker_gpu.sh
-[External Solutions]: https://github.com/google/deepvariant#external-solutions
 [case study on whole genome sequencing data]: deepvariant-case-study.md
diff --git a/docs/deepvariant-quick-start.md b/docs/deepvariant-quick-start.md
@@ -36,7 +36,7 @@ the [External Solutions] section.
 ### Get Docker image
 
 ```bash
-BIN_VERSION="rc1.0.0"
+BIN_VERSION="1.0.0"
 
 sudo apt -y update
 sudo apt-get -y install docker.io
@@ -132,10 +132,11 @@ sudo docker run \
   --regions "chr20:10,000,000-10,010,000" \
   --output_vcf=/output/output.vcf.gz \
   --output_gvcf=/output/output.g.vcf.gz \
+  --intermediate_results_dir /output/intermediate_results_dir \ **This flag is optional. Set to keep the intermediate results.
   --num_shards=1 \ **How many cores the `make_examples` step uses. Change it to the number of CPU cores you have.**
 ```
 
-This will generate 5 files in `${OUTPUT_DIR}`:
+This will generate 5 files and 1 directory in `${OUTPUT_DIR}`:
 
 ```bash
 ls -1 ${OUTPUT_DIR}
@@ -144,20 +145,31 @@ ls -1 ${OUTPUT_DIR}
 outputting:
 
 ```
+intermediate_results_dir
 output.g.vcf.gz
 output.g.vcf.gz.tbi
 output.vcf.gz
 output.vcf.gz.tbi
 output.visual_report.html
 ```
 
+The directory "intermediate_results_dir" exists because
+`--intermediate_results_dir /output/intermediate_results_dir` is specified. This
+directory contains the intermediate output of make_examples and call_variants
+steps.
+
 For more information about `output.visual_report.html`, see the
 [VCF stats report documentation](deepvariant-vcf-stats-report.md).
 
 ## Notes on GPU image
 
 If you are using GPUs, you can pull the GPU version, and make sure you run with
-`--gpus 1` (because DeepVariant currently can only make use of 1 GPU):
+`--gpus 1`. `call_variants` is the only step that uses the GPU, and can only use
+one at a time. `make_examples` and `postprocess_variants` do not run on GPU.
+
+For an example to install GPU driver and docker, see [install_nvidia_docker.sh].
+
+
 
 ```
 sudo docker run --gpus 1 \
@@ -186,6 +198,7 @@ singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
   --regions "chr20:10,000,000-10,010,000" \
   --output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
   --output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
+  --intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir" \ **Optional.
   --num_shards=1 \ **How many cores the `make_examples` step uses. Change it to the number of CPU cores you have.**
 ```
 
@@ -245,3 +258,4 @@ Benchmarking Summary:
 [Quick Start in r0.7]: https://github.com/google/deepvariant/blob/r0.7/docs/deepvariant-quick-start.md
 [VCF]: https://samtools.github.io/hts-specs/VCFv4.3.pdf
 [run_deepvariant.py]: ../scripts/run_deepvariant.py
+[install_nvidia_docker.sh]: ../scripts/install_nvidia_docker.sh