Skip to content

Commit

Permalink
Update WES case study and Quick Start.
Browse files Browse the repository at this point in the history
PiperOrigin-RevId: 329596948
  • Loading branch information
pichuan authored and copybara-github committed Sep 1, 2020
1 parent 23f2bc2 commit 5105fb4
Show file tree
Hide file tree
Showing 2 changed files with 122 additions and 83 deletions.
185 changes: 105 additions & 80 deletions docs/deepvariant-exome-case-study.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,116 +4,141 @@ Similar to the [case study on whole genome sequencing data], in this
study we describe applying DeepVariant to a real exome sample using a single
machine.

NOTE: This case study demonstrates an example of how to run DeepVariant
end-to-end on one machine. We report the runtime with [specific machine type]
for the sake of consistency in reporting run time. This is NOT the fastest or
cheapest configuration. For more scalable execution of DeepVariant see the
[External Solutions] section.
## Prepare environment

## Running DeepVariant with one command
### Tools

DeepVariant pipeline consists of 3 steps: `make_examples`, `call_variants`, and
`postprocess_variants`. You can now run DeepVariant with one command using the
`run_deepvariant` script.
[Docker](https://docs.docker.com/get-docker/) will be used to run DeepVariant
and [hap.py](https://github.com/illumina/hap.py),

### Running on a CPU-only machine
### Download Reference

Here is an example command:
We will be using GRCh37 for this case study.

```bash
sudo docker run \
-v "${DATA_DIR}":"/input" \
-v "${OUTPUT_DIR}:/output" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type=WES \
--ref="/input/${REF}" \
--reads="/input/${BAM}" \
--regions="/input/${CAPTURE_BED}" \
--output_vcf=/output/HG002.output.vcf.gz \
--output_gvcf=/output/HG002.output.g.vcf.gz \
--num_shards=${N_SHARDS}
mkdir -p reference

HTTPDIR=https://storage.googleapis.com/deepvariant/exome-case-study-testdata

curl ${HTTPDIR}/hs37d5.fa.gz | gunzip > reference/hs37d5.fa
curl ${HTTPDIR}/hs37d5.fa.fai > reference/hs37d5.fa.fai
```

By specifying `--model_type=WES`, you'll be using a model that is best suited
for Illumina Whole Exome Sequencing data.
### Download Genome in a Bottle Benchmarks

We will benchmark our variant calls against v4.1 of the Genome in a Bottle small
variant benchmarks for HG002.

The script [run_wes_case_study_docker.sh] shows a full example of which data to
download, and run DeepVariant with the data.
```bash
mkdir -p benchmark

Before you run the script, you can read through all sections to understand the
details. Here is a quick way to get the script and run it:
FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37

curl ${FTPDIR}/HG002_GRCh37_1_22_v4.1_draft_benchmark.bed > benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.bed
curl ${FTPDIR}/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz > benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz
curl ${FTPDIR}/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz.tbi > benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz.tbi
```

### Download HG002 BAM

```bash
curl https://raw.githubusercontent.com/google/deepvariant/r1.0/scripts/run_wes_case_study_docker.sh | bash
mkdir -p input

FTPDIR=ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/HG002_NA24385_son/OsloUniversityHospital_Exome
BAM=151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup

curl ${FTPDIR}/${BAM}.bam > input/${BAM}.bam
curl ${FTPDIR}/${BAM}.bai > input/${BAM}.bai
```

### Runtime
### Download capture target BED file

See
[this page](deepvariant-details.md#commands-for-requesting-machines-used-in-case-studies)
for the commands used to obtain different machine types on Google Cloud.
According to the paper "[Extensive sequencing of seven human genomes to
characterize benchmark reference
materials](https://www.nature.com/articles/sdata201625)", the HG002 exome was
generated with Agilent SureSelect. In this case study we'll use the SureSelect
v5 BED (`agilent_sureselect_human_all_exon_v5_b37_targets.bed`). For evaluation,
`hap.py` will intersect this BED with the GIAB confident regions.

Step | Hardware | Wall time
---------------------------------- | -------- | ---------
`make_examples` | 64 CPUs | ~ 19m
`call_variants` | 64 CPUs | ~ 2m
`postprocess_variants` (with gVCF) | 1 CPU | ~ 1m
```bash
HTTPDIR=https://storage.googleapis.com/deepvariant/exome-case-study-testdata

In this example, `call_variants` does not take much time on 64 CPUs. Running
with GPU might be unnecessary. You can read
[case study on whole genome sequencing data] about the use of GPU. If you want
to use GPU on the exome data in this case study, you can see
[run_wes_case_study_docker_gpu.sh] shows a full example.
curl ${HTTPDIR}/agilent_sureselect_human_all_exon_v5_b37_targets.bed > input/agilent_sureselect_human_all_exon_v5_b37_targets.bed
```

## Data description

### BAM file:
## Running on a CPU-only machine

`151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam`
```bash
mkdir -p output
mkdir -p output/intermediate_results_dir

Downloaded from
[https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015](https://github.com/genome-in-a-bottle/giab_data_indexes/blob/master/AshkenazimTrio/alignment.index.AJtrio_OsloUniversityHospital_IlluminaExome_bwamem_GRCh37_11252015)
BIN_VERSION=1.0.0

### FASTA
sudo docker run \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
google/deepvariant:"${BIN_VERSION}" \
/opt/deepvariant/bin/run_deepvariant \
--model_type WES \
--ref /reference/hs37d5.fa \
--reads /input/151002_7001448_0359_AC7F6GANXX_Sample_HG002-EEogPU_v02-KIT-Av5_AGATGTAC_L008.posiSrt.markDup.bam \
--regions /input/agilent_sureselect_human_all_exon_v5_b37_targets.bed \
--output_vcf /output/HG002.output.vcf.gz \
--output_gvcf /output/HG002.output.g.vcf.gz \
--num_shards $(nproc) \
--intermediate_results_dir /output/intermediate_results_dir
```

Same as described in the
[case study for whole genome data](deepvariant-case-study.md#test_data)
By specifying `--model_type WES`, you'll be using a model that is best suited
for Illumina Whole Exome Sequencing data.

### Truth VCF and BED
`--intermediate_results_dir` flag is optional. By specifying it, the
intermediate outputs of `make_examples` and `call_variants` stages can be found
in the directory. After the command, you can find these files in the directory:

`HG002_GRCh37_1_22_v4.1_draft_benchmark.*` are from NIST, as part of the
[Genomes in a Bottle project](http://jimb.stanford.edu/giab/). They are
downloaded from
[ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37/](ftp://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_v4.1_SmallVariantDraftBenchmark_12182019/GRCh37/)
```
call_variants_output.tfrecord.gz
gvcf.tfrecord-?????-of-?????.gz
make_examples.tfrecord-?????-of-?????.gz
```

### Capture target BED file
For running on GPU machines, or using Singularity instead of Docker, see
[Quick Start](deepvariant-quick-start.md).

According to the paper "[Extensive sequencing of seven human genomes to
characterize benchmark reference
materials](https://www.nature.com/articles/sdata201625)", the HG002 exome was
generated with Agilent SureSelect. In this case study we'll use the SureSelect
v5 BED (`agilent_sureselect_human_all_exon_v5_b37_targets.bed`) and intersect it
with the GIAB confident regions for evaluation.
## Benchmark on chr20

## Variant call quality
```bash
mkdir -p happy

We used the `hap.py`
([https://github.com/Illumina/hap.py](https://github.com/Illumina/hap.py))
program from Illumina to evaluate the resulting vcf file. This serves as a check
to ensure the three DeepVariant commands ran correctly and produced high-quality
results.
sudo docker pull pkrusche/hap.py

We evaluate against the capture region:
sudo docker run \
-v "${PWD}/benchmark":"/benchmark" \
-v "${PWD}/input":"/input" \
-v "${PWD}/output":"/output" \
-v "${PWD}/reference":"/reference" \
-v "${PWD}/happy:/happy" \
pkrusche/hap.py /opt/hap.py/bin/hap.py \
/benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.vcf.gz \
/output/HG002.output.vcf.gz \
-f /benchmark/HG002_GRCh37_1_22_v4.1_draft_benchmark.bed \
-T /input/agilent_sureselect_human_all_exon_v5_b37_targets.bed \
-r /reference/hs37d5.fa \
-o /happy/happy.output \
--engine=vcfeval
```

Output:

Type | # TP | # FN | # FP | Recall | Precision | F1\_Score
----- | ----- | ---- | ---- | -------- | --------- | ---------
INDEL | 2894 | 126 | 58 | 0.958278 | 0.980602 | 0.969312
SNP | 38173 | 403 | 130 | 0.989553 | 0.996610 | 0.993069
```
Benchmarking Summary:
Type Filter TRUTH.TOTAL TRUTH.TP TRUTH.FN QUERY.TOTAL QUERY.FP QUERY.UNK FP.gt METRIC.Recall METRIC.Precision METRIC.Frac_NA METRIC.F1_Score TRUTH.TOTAL.TiTv_ratio QUERY.TOTAL.TiTv_ratio TRUTH.TOTAL.het_hom_ratio QUERY.TOTAL.het_hom_ratio
INDEL ALL 3020 2896 124 3718 81 703 35 0.958940 0.973134 0.189080 0.965985 NaN NaN 1.325453 1.368421
INDEL PASS 3020 2896 124 3718 81 703 35 0.958940 0.973134 0.189080 0.965985 NaN NaN 1.325453 1.368421
SNP ALL 38576 38180 396 41157 116 2816 27 0.989735 0.996975 0.068421 0.993341 2.624273 2.568779 1.548701 1.563630
SNP PASS 38576 38180 396 41157 116 2816 27 0.989735 0.996975 0.068421 0.993341 2.624273 2.568779 1.548701 1.563630
```

[specific machine type]: deepvariant-details.md#commands-for-requesting-machines-used-in-case-studies
[install_nvidia_docker.sh]: ../scripts/install_nvidia_docker.sh
[run_wes_case_study_docker.sh]: ../scripts/run_wes_case_study_docker.sh
[run_wes_case_study_docker_gpu.sh]: ../scripts/run_wes_case_study_docker_gpu.sh
[External Solutions]: https://github.com/google/deepvariant#external-solutions
[case study on whole genome sequencing data]: deepvariant-case-study.md
20 changes: 17 additions & 3 deletions docs/deepvariant-quick-start.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ the [External Solutions] section.
### Get Docker image

```bash
BIN_VERSION="rc1.0.0"
BIN_VERSION="1.0.0"

sudo apt -y update
sudo apt-get -y install docker.io
Expand Down Expand Up @@ -132,10 +132,11 @@ sudo docker run \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf=/output/output.vcf.gz \
--output_gvcf=/output/output.g.vcf.gz \
--intermediate_results_dir /output/intermediate_results_dir \ **This flag is optional. Set to keep the intermediate results.
--num_shards=1 \ **How many cores the `make_examples` step uses. Change it to the number of CPU cores you have.**
```

This will generate 5 files in `${OUTPUT_DIR}`:
This will generate 5 files and 1 directory in `${OUTPUT_DIR}`:

```bash
ls -1 ${OUTPUT_DIR}
Expand All @@ -144,20 +145,31 @@ ls -1 ${OUTPUT_DIR}
outputting:

```
intermediate_results_dir
output.g.vcf.gz
output.g.vcf.gz.tbi
output.vcf.gz
output.vcf.gz.tbi
output.visual_report.html
```

The directory "intermediate_results_dir" exists because
`--intermediate_results_dir /output/intermediate_results_dir` is specified. This
directory contains the intermediate output of make_examples and call_variants
steps.

For more information about `output.visual_report.html`, see the
[VCF stats report documentation](deepvariant-vcf-stats-report.md).

## Notes on GPU image

If you are using GPUs, you can pull the GPU version, and make sure you run with
`--gpus 1` (because DeepVariant currently can only make use of 1 GPU):
`--gpus 1`. `call_variants` is the only step that uses the GPU, and can only use
one at a time. `make_examples` and `postprocess_variants` do not run on GPU.

For an example to install GPU driver and docker, see [install_nvidia_docker.sh].



```
sudo docker run --gpus 1 \
Expand Down Expand Up @@ -186,6 +198,7 @@ singularity run -B /usr/lib/locale/:/usr/lib/locale/ \
--regions "chr20:10,000,000-10,010,000" \
--output_vcf="${OUTPUT_DIR}"/output.vcf.gz \
--output_gvcf="${OUTPUT_DIR}"/output.g.vcf.gz \
--intermediate_results_dir "${OUTPUT_DIR}/intermediate_results_dir" \ **Optional.
--num_shards=1 \ **How many cores the `make_examples` step uses. Change it to the number of CPU cores you have.**
```

Expand Down Expand Up @@ -245,3 +258,4 @@ Benchmarking Summary:
[Quick Start in r0.7]: https://github.com/google/deepvariant/blob/r0.7/docs/deepvariant-quick-start.md
[VCF]: https://samtools.github.io/hts-specs/VCFv4.3.pdf
[run_deepvariant.py]: ../scripts/run_deepvariant.py
[install_nvidia_docker.sh]: ../scripts/install_nvidia_docker.sh

0 comments on commit 5105fb4

Please sign in to comment.