Skip to content

Commit

Permalink
improved doc and don't type HLA by default
Browse files Browse the repository at this point in the history
  • Loading branch information
lh3 committed Dec 21, 2014
1 parent 5aba188 commit c05a721
Show file tree
Hide file tree
Showing 4 changed files with 62 additions and 28 deletions.
11 changes: 6 additions & 5 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,18 @@
Release 0.7.11 (XX November, 2014)
-----------------------------------
Release 0.7.11 (XX December, 2014)
----------------------------------

A major change to BWA-MEM is the support of mapping to ALT contigs in addition
to the primary assembly. Part of the ALT mapping strategy is implemented in
BWA-MEM and the rest in a postprocessing script for now. Due to the extra
layer of complexity on generating the reference genome and on the two-step
mapping, we start to provide a wrapper script and precompiled binaries since
this release. Please check README-alt.md for details.
this release. The package may be more convenient to some specific use cases.
For general uses, the single BWA binary still works like the old way.

Another major addition to BWA-MEM is HLA typing, which made possible with the
new ALT mapping strategy. Necessary data and programs are included in the
binary release. The wrapper script also performs HLA typing when HLA genes are
also included in the reference genome as additional ALT contigs.
included in the reference genome as additional ALT contigs.

Other notable changes to BWA-MEM:

Expand Down Expand Up @@ -44,7 +45,7 @@ Other notable changes to BWA-MEM:
writing SAM. This saves significant wall-clock time when reading from
or writing to a slow Unix pipe.

(0.7.11: XX November 2014, rXXX)
(0.7.11: XX December 2014, r10XX)



Expand Down
20 changes: 10 additions & 10 deletions README-alt.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,10 @@
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
bwa.kit/bwa index hs38d6.fa # create BWA index
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
bwa.kit/bwa index hs38D1.fa # create BWA index
# mapping
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
```

This generates `out.aln.bam` as the final alignment, `out.hla.top` for best HLA
Expand Down Expand Up @@ -94,11 +94,11 @@ CHM1 short reads and present also in NA12878. You can try [BLAT][blat] or

For a more complete reference genome, we compiled a new set of decoy sequences
from GenBank clones and the de novo assembly of 254 public [SGDP][sgdp] samples.
The sequences are included in `hs38d6-extra.fa` from the [BWA binary
The sequences are included in `hs38D1-extra.fa` from the [BWA binary
package][res].

In addition to decoy, we also put multiple alleles of HLA genes in
`hs38d6-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
`hs38D1-extra.fa`. These genomic sequences were acquired from [IMGT/HLA][hladb],
version 3.18.0 and are used to collect reads sequenced from these genes.

### HLA typing
Expand All @@ -125,26 +125,26 @@ most of them are distributed under restrictive licenses.

To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878
unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy
(hs38d6), and called small variants from the alignment. CHM1 is haploid.
(hs38D1), and called small variants from the alignment. CHM1 is haploid.
Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The
true positive (TP) heterozygous calls from NA12878 are approximately equal
to the difference between NA12878 and CHM1 heterozygous calls. A better assembly
should yield higher TP and lower FP. The following table shows the numbers for
these assemblies:

|Assembly|hs37 |hs38 |hs38d6|CHM1_1.1| huref|
|Assembly|hs37 |hs38 |hs38D1|CHM1_1.1| huref|
|:------:|------:|------:|------:|------:|------:|
|FP | 255706| 168068| 142516|307172 | 575634|
|TP |2142260|2163113|2150844|2167235|2137053|

With this measurement, hs38 is clearly better than hs37. Genome hs38d6 reduces
With this measurement, hs38 is clearly better than hs37. Genome hs38D1 reduces
FP by ~25k but also reduces TP by ~12k. We manually inspected variants called
from hs38 only and found the majority of them are associated with excessive read
depth, clustered variants or weak alignment. We believe most hs38-only calls are
problematic. In addition, if we compare two NA12878 replicates from HiSeq X10
with nearly identical library construction, the difference is ~140k, an order
of magnitude higher than the difference between hs38 and hs38d6. ALT contigs,
decoy and HLA genes in hs38d6 improve variant calling and enable the analyses of
of magnitude higher than the difference between hs38 and hs38D1. ALT contigs,
decoy and HLA genes in hs38D1 improve variant calling and enable the analyses of
ALT contigs and HLA typing at little cost.

## Problems and Future Development
Expand Down
32 changes: 28 additions & 4 deletions bwakit/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,10 +21,10 @@ how to use bwakit:
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.11_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38d6 # download GRCh38 and write hs38d6.fa
bwa.kit/bwa index hs38d6.fa # create BWA index
bwa.kit/run-gen-ref hs38D1 # download GRCh38 and write hs38D1.fa
bwa.kit/bwa index hs38D1.fa # create BWA index
# mapping
bwa.kit/run-bwamem -o out hs38d6.fa read1.fq read2.fq | sh
bwa.kit/run-bwamem -o out -H hs38D1.fa read1.fq read2.fq | sh
```

The last mapping command line will generate the following files:
Expand All @@ -44,7 +44,31 @@ Bwakit can be [downloaded here][res]. It is only available to x86_64-linux. The
scripts in the package are available in the [bwa/bwakit][kit] directory.
Packaging is done manually for now.

## Contents
## Limitations

* HLA typing only works for high-coverage human data. The typing accuracy can
still be improved. We encourage researchers to develop better HLA typing tools
based on the intermediate output of bwakit (for each HLA gene included in the
index, bwakit writes all reads matching it in a separate file).

* Duplicate marking only works when all reads from a single paired-end library
are provided as the input. This limitation is the necessary tradeoff of fast
MarkDuplicate provided by samblaster.

* The adapter trimmer is chosen as it is fast, pipe friendly and does not
discard reads. However, it is conservative and suboptimal. If this is a
concern, it is recommended to preprocess input reads with a more sophisticated
adapter trimmer. We also hope existing trimmers can be modified to operate on
an interleaved FASTQ stream. We will replace trimadap once a better trimmer
meets our needs.

* Bwakit can be memory demanding depends on the functionality invoked. For 30X
human data, bwa-mem takes about 6GB RAM, samblaster uses close to 10GB and BAM
shuffling (if the input is sorted BAM) uses several GB. In the current
setting, sorting uses about 10GB.


## Package Contents
```
bwa.kit
|-- README.md This README file.
Expand Down
27 changes: 18 additions & 9 deletions bwakit/run-bwamem
Original file line number Diff line number Diff line change
Expand Up @@ -18,29 +18,38 @@ Options: -o STR prefix for output files [inferred from
ont2d: Oxford Nanopore reads (~10kb query, higher error rate)
-t INT number of threads [1]
-H apply HLA typing
-a trim HiSeq2000/2500 PE resequencing adapters (via trimadap)
-d mark duplicate (via samblaster)
-S for SAM/BAM input, don\'t shuffle
-s sort the output alignment (requring more RAM)
-H skip HLA typing
-S for BAM input, don\'t shuffle
-s sort the output alignment (via samtools; requring more RAM)
-k keep temporary files generated by typeHLA
Examples:
* Map paired-end reads to GRCh38+ALT+decoy+HLA and perform HLA typing:
run-bwamem -o prefix -t8 -R"@RG\tID:foo\tSM:bar" hs38d6.fa read1.fq.gz read2.fq.gz
run-bwamem -o prefix -t8 -HR"@RG\tID:foo\tSM:bar" hs38D1.fa read1.fq.gz read2.fq.gz
Note: HLA typing is only effective for high-coverage data. The typing accuracy varies
with the quality of input. It is only intended for research purpose, not for diagnostic.
* Remap coordinate-sorted BAM, transfer read groups tags, trim Illumina PE adapters and
sort the output. The BAM may contain single-end or paired-end reads, or a mixture of
the two types. Specifying -R stops read group transfer.
run-bwamem -sao prefix hs38d6.fa old-srt.bam
run-bwamem -sao prefix hs38D1.fa old-srt.bam
Note: the adaptor trimmer included in bwa.kit is chosen because it fits the current
mapping pipeline better. It is conservative and suboptimal. A more sophisticated
trimmer is recommended if this becomes a concern.
* Remap name-grouped BAM and mark duplicates. Note that in this case, all reads from
a single library should be aligned at the same time. Paired-end only.
* Remap name-grouped BAM and mark duplicates:
run-bwamem -Sdo prefix hs38D1.fa old-unsrt.bam
run-bwamem -Sdo prefix hs38d6.fa old-unsrt.bam
Note: streamed duplicate marking requires all reads from a single paired-end library
to be aligned at the same time.
Output files:
Expand Down Expand Up @@ -156,7 +165,7 @@ if (-f "$ARGV[0].alt") {
my $t_sort = $opts{t} < 4? $opts{t} : 4;
$cmd .= defined($opts{s})? " | $root/samtools sort -@ $t_sort -m1G - $prefix.aln;\n" : " | $root/samtools view -1 - > $prefix.aln.bam;\n";

if ($has_hla && !defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
if ($has_hla && defined($opts{H}) && (!defined($opts{x}) || $opts{x} eq 'intractg')) {
$cmd .= "$root/run-HLA ". (defined($opts{x}) && $opts{x} eq 'intractg'? "-A " : "") . "$prefix.hla > $prefix.hla.top 2> $prefix.log.hla;\n";
$cmd .= "cat $prefix.hla.HLA*.gt | grep ^GT | cut -f2- > $prefix.hla.all;\n";
$cmd .= "rm -f $prefix.hla.HLA*;\n" unless defined($opts{k});
Expand Down

0 comments on commit c05a721

Please sign in to comment.