Lk wgs methods doc (broadinstitute#247)

Added Methods doc to WGS pipeline
sickler-alex · Mar 4, 2021 · 1aefcc4 · 1aefcc4
1 parent 8c96457
commit 1aefcc4
Show file tree

Hide file tree

Showing 3 changed files with 33 additions and 1 deletion.
diff --git a/website/docs/.vuepress/config.js b/website/docs/.vuepress/config.js
@@ -163,7 +163,8 @@ module.exports = {
               title: "Whole Genome Germline Single Sample Pipeline",
               collapsable: true,
               children: [
-                "Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/"
+                "Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/",
+                "Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/wgs.methods"                
               ],
             },
           ],

diff --git a/.../documentation/Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/README.md b/.../documentation/Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/README.md
@@ -47,6 +47,11 @@ The Whole Genome Germline Single Sample [workflow](https://github.com/broadinsti
 
 Learn more about the software tools implemented in these tasks by reading the GATK [data pre-processing](https://gatk.broadinstitute.org/hc/en-us/articles/360035535912) and [germline short variant discovery](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932) documentation.
 
+:::tip Want to use the Whole Genome Germline Single Sample workflow in your publication?
+Check out the workflow [Methods](./wgs.methods.md) to get started!
+:::  
+
+
 ## Workflow Outputs
 
 * CRAM, CRAM index, and CRAM MD5

diff --git a/...mentation/Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/wgs.methods.md b/...mentation/Pipelines/Whole_Genome_Germline_Single_Sample_Pipeline/wgs.methods.md
@@ -0,0 +1,26 @@
+# Whole Genome Germline Single Sample v2.2.0 Methods
+The following contains a detailed methods description outlining the pipeline’s process, software, and tools that can be modified for a publication methods section.
+
+## Detailed Methods
+
+Preprocessing and variant calling was performed using the WholeGenomeGermlineSingleSample 2.2.0 pipeline using Picard 2.23.8, GATK 4.1.8, and Samtools 1.11 with default tool parameters unless otherwise specified. All reference files are available in the public [Broad References Google Bucket](https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0). The pipeline follows GATK Best Practices as previously described ([Auwera et al., 2013](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4243306/)) as well as the Functional Equivalence Specification ([Regier et al., 2018](https://www.nature.com/articles/s41467-018-06159-4)). 
+
+### Pre-processing and QC
+Whole genome paired-end reads in unmapped BAM (uBAM) format were first scattered to perform QC and alignment in parallel. Quality metrics were calculated using Picard CollectQualityYieldMetrics. uBAMs were converted to FASTQ using Picard SamToFastq and aligned to the Hg38 reference genome using BWA mem 0.7.15 with batch size set using -K 100000000. Metadata from the uBAMs was then merged with the aligned BAMs using Picard MergeBamAlignment with the parameters --SORT_ORDER="unsorted", allowing the data to be grouped by read name for efficient downstream marking of duplicates, and --UNMAP_CONTAMINANT_READS=true, to remove cross-species contamination.
+
+QC metrics (base distribution by cycle, insert size metrics, mean quality by cycle, and quality score distribution) were collected for the aligned, unsorted read-groups using Picard CollectMultipleMetrics. The read-group specific aligned BAMs were then aggregated and duplicate reads were flagged using Picard MarkDuplicates assuming queryname-sorted order and the parameter --OPTICAL_DUPLICATE_PIXEL_DISTANCE=2500, which is appropriate for patterned flowcells.
+
+The aggregated BAM file was then sorted using Picard SortSam with coordinate sort order. The fingerprints of separate read groups were verified using Picard CrosscheckFingerprints with a LOD threshold of -20. Cross-sample contamination was checked using verifyBamID2. 
+
+The aligned BAM was then scattered for parallelization during base recalibration. A Base Quality Score Recalibration (BQSR) table was created with GATK BaseRecalibrator using the original base qualities (under the OQ Sam tag). The model was applied using ApplyBQSR with the static-quantized-quals option used according to the [Functional Equivalence specification](https://github.com/CCDG/Pipeline-Standardization/blob/master/PipelineStandard.md) ([Regier et al., 2018](https://www.nature.com/articles/s41467-018-06159-4)). Recalibrated BAM files were then merged using Picard GatherSortedBamFiles. 
+
+QC metrics were calculated for the base-recalibrated BAM using Picard CollectMultipleMetrics. Fingerprints were verified using Picard CheckFingerprint and high duplication levels and chimerism were checked using the calculated summary metrics.  
+
+To evaluate the coverage and performance of the whole genome sequencing experiment, the BAM was assessed using the Picard tools CollectWGSMetrics and CollectRawWgsMetrics. 
+
+The final base-recalibrated BAM was converted to CRAM using Samtools view and validated using Picard ValidateSamFile.
+
+### Variant Calling
+Prior to variant calling, the variant calling interval list was subsetted to enable parallelization. Using the GATK PrintReads tool, the WellformedReadFilter was applied to reads. Variant calling was then applied to reads that passed the filter using GATK (v3.5) HaplotypeCaller with the parameters --max_alternate_alleles 3 (sufficient for human data),  --ERC GVCF, and --read_filter OverclippedRead (to reduce false positives resulting from contamination). The resulting GVCFs were merged using Picard MergeVcfs and the final VCF file was validated using GATK ValidateVariants. Variant metrics were calculated using Picard CollectVariantCallingMetrics. 
+
+The pipeline’s final outputs included metrics, validation reports, an aligned CRAM with index, and a GVCF containing variant calls with an accompanying index.