Skip to content

Commit

Permalink
finished results
Browse files Browse the repository at this point in the history
  • Loading branch information
lh3 committed Aug 7, 2021
1 parent 8a1d52b commit e37f5ff
Show file tree
Hide file tree
Showing 2 changed files with 83 additions and 19 deletions.
18 changes: 18 additions & 0 deletions tex/minimap2.bib
Original file line number Diff line number Diff line change
Expand Up @@ -431,3 +431,21 @@ @article{Zook:2020aa
Title = {A robust benchmark for detection of germline large deletions and insertions},
Volume = {38},
Year = {2020}}

@article{Harpak:2017aa,
Author = {Harpak, Arbel and others},
Journal = {Proc Natl Acad Sci U S A},
Pages = {12779-12784},
Title = {Frequent nonallelic gene conversion on the human lineage and its effect on the divergence of gene duplicates},
Volume = {114},
Year = {2017}}

@article{Li:2018aa,
Author = {Li, Heng and others},
Journal = {Nat Methods},
Month = {Aug},
Number = {8},
Pages = {595-597},
Title = {A synthetic-diploid benchmark for accurate variant-calling evaluation},
Volume = {15},
Year = {2018}}
84 changes: 65 additions & 19 deletions tex/mm2-update.tex
Original file line number Diff line number Diff line change
Expand Up @@ -41,18 +41,18 @@ \section{Contact:} [email protected]
\end{abstract}

\section{Introduction}
Minimap2~\citep{Li:2018ab} is a widely used aligner for maping long sequence
reads and assembly contigs. \citet{Jain:2020aa} found minimap2 occasionally
Minimap2~\citep{Li:2018ab} is widely used for maping long sequence
reads and assembly contigs. \citet{Jain:2020aa} found minimap2 v2.18 or earlier occasionally
misaligned reads from highly repetitive regions as minimap2 ignored seeds of
high occurrence. They also noticed minimap2 may misplace reads with structural
variations (SVs) in such regions~\citep{Jain2020.11.01.363887}. These
misalignments have become a pressing issue in the advent of
temolere-to-telomore human assembly~\citep{Miga:2020aa}. Meanwhile, minimap2
temolere-to-telomore human assembly~\citep{Miga:2020aa}. Meanwhile, old minimap2
was unable to efficiently align long insertions/deletions (INDELs) and often
breaks an alignment around variable-number tandem repeats (VNTRs). This has
inspired new chaining algorithms~\citep{Li:2020aa,Ren:2021aa} which are not
integrated into minimap2. Here we will describe recent improvements implemented
in v2.19 through v2.22.
integrated into minimap2. Here we will describe recent efforts implemented
in v2.19 through v2.22 to improve mapping results.

\begin{methods}
\section{Methods}
Expand All @@ -66,21 +66,22 @@ \subsection{Rescuing high-occurrence $k$-mers}

To resolve this issue, we implemented a new heuristic to add additional
minimizers. Suppose we are looking at two adjacent low-occurence $k$-mers
located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge500$, the
new minimap2 adds $\lfloor|x_1-x_2|/500\rfloor$ minimizers among
high-occurrence minimizers between $x_1$ and $x_2$. We use a binary heap data
located at position $x_1$ and $x_2$, respectively. If $|x_1-x_2|\ge500$,
minimap2 v2.22 additionally selects $\lfloor|x_1-x_2|/500\rfloor$ minimizers
of the lowest occurrence among minimizers between $x_1$ and $x_2$.
We use a binary heap data
structure to select minimizers of the lowest occurrence in this interval.
This strategy adds necessary anchors at the cost of increasing total alignment
time by a few percent on real data.

\subsection{Aligning through longer INDELs}
The original minimap2 may fail to align long INDELs due to its chaining
heuristics. Briefly, minimap2 applies dynamic programming (DP) to chain
minimizer anchors. It is a quadratic algorithm, which is slow for chaining
minimizer anchors. This is a quadratic algorithm, which is slow for chaining
contigs. For acceptable performance, the original minimap2 uses a 500bp band by
default. If there is an INDEL longer than 500bp and the two chains around it
default. If there is an INDEL longer than 500bp and the two chains around the INDEL
have no overlaps on either the query or the reference sequence, minimap2 may
join the two short chains later as a postprocessing step. We call it the
join the two short chains later at a later step. We call it the
long-join heuristic. This heuristic may fail around VNTRs because short chains
often have overlaps in VNTRs. More subtly, minimap2 may escape the inner DP
loop early, again for performance, if the chaining result is not improved for
Expand All @@ -91,10 +92,10 @@ \subsection{Aligning through longer INDELs}
In minigraph~\citep{Li:2020aa}, we developed a new chaining algorithm that
finds short INDELs with DP-based chaining and goes through long INDELs with a
subquadratic algorithm~\citep{DBLP:conf/wabi/AbouelhodaO03}. We ported the same
algorithm to minimap2 for contig mapping. For read mapping, the minigraph
algorithm is slower. The updated minimap2 still uses the DP-based algorithm to
find short chains and then uses the minigraph algorithm to rechain anchors in
these short chains. The rechaining steps achieves the same goal as long-join
algorithm to minimap2 for contig mapping. For long-read mapping, the minigraph
algorithm is slower. Minimap2 v2.22 now still uses the DP-based algorithm to
find short chains and then invokes the minigraph algorithm to rechain anchors in
these short chains. The rechaining step achieves the same goal as long-join
but is more reliable as it can resolve overlaps between short chains. The old
long-join heuristic has since been removed.

Expand All @@ -107,7 +108,7 @@ \subsection{Properly mapping long reads with SVs}

In our view, this problem is rooted in impropriate scoring: affine-gap penalty
over-penalizes a long INDEL that was often evolutionarily created in one event.
We should not penalize a SV linearly in its length. The new minimap2 rescores
We should not penalize a SV linearly in its length. Minimap2 v2.22 rescores
an alignment with the following scoring function. Suppose an alignment consists
of $M$ matching bases, $N$ substitutions and $G$ gap opens, we empirically
score the alignment with
Expand Down Expand Up @@ -139,7 +140,7 @@ \section{Results}

\begin{table}
\processtable{Evaluation of minimap2 v2.22}
{\footnotesize\begin{tabular}{p{4.2cm}rrrr}
{\footnotesize\label{tab:1}\begin{tabular}{p{4.2cm}rrrr}
\toprule
$[$Benchmark$]$ Metric & v2.22 & v2.18 & Winno & lra \\
\midrule
Expand All @@ -165,14 +166,59 @@ \section{Results}
reads at 30 folds with the same pbsim2 command line. SVs were called with
``sniffles -q 10''~\citep{Sedlazeck:2018ab} and compared to the simulated truth with ``SURVIVOR eval
call.vcf truth.bed 50''. In $[$real-sv-1k$]$, small and long variants were
called by dipcall-0.3 for HG002 assemblies (AC: GCA\_018852605.1 and
called by dipcall-0.3~\citep{Li:2018aa} for HG002 assemblies (AC: GCA\_018852605.1 and
GCA\_018852615.1) and compared to the GIAB truth~\citep{Zook:2020aa} using ``truvari -r 2000 -s
1000 -S 400 -{}-multimatch -{}-passonly'' which sets the minimum INDEL size to 1kb in evaluation. }
\end{table}

We evaluated minimap2 v2.22 along with v2.18, Winnowmap2 v2.03 and lra v1.3.2
(Table~\ref{tab:1}). Both versions of minimap2 achieved high mapping accuracy on
simulated Nanopore reads (sim-map). Winnowmap2 aligned more reads at mapping
quality 10 or higher (mapQ10). However, it may occasionally assign a high mapping
quality to a read with multiple identical best alignments. This reduced its
mapping accuracy.

In lack of groud truth for real data, so we took Winnowmap2 mapping as ground
truth to evaluate other mappers (winno-cmp). Out of 1,378,092 reads with mapQ10
alignments by Winnowmap2, minimap2 v2.22 could map all of them. 118 reads, less
than 0.01\% of all reads, were mapped differently by v2.22. 51 of them have
multiple identical best alignments. We believe these are more likely to be
Winnowmap2 errors.

The two benchmarks above only evaluate read mappings without variations.
To measure the mapping accuracy in the presence of SVs (sim-sv), we reproduced
the results by~\citep{Jain2020.11.01.363887}. Minimap2 v2.22 is as good as
Winnowmap2 now. Note that we were setting the Sniffles mapping quality
threshold to 10 in consistent with the benchmarks above. If we used the
default threshold 20, v2.22 would miss additional 0.5\% SVs, suggesting
minimap2 v2.22 could map variant reads correctly but with conservative mapping
quality. This observation is more about the interaction between mappers and
callers. Furthermore, the simulation here only considers a simple scenario in
evolution. Non-allelic gene conversions, which happen often in segmental
duplications~\citep{Harpak:2017aa}, would obscure the optimal mapping
strategies. How much such simple SV simulation informs real-world SV calling
remains a question.

To see if minimap2 v2.22 could improve long INDEL alignment, we ran dipcall on
contig-to-reference alignments and focused on INDELs longer than 1kb
(real-sv-1k). v2.22 is more sensitive at comparable specificity, confirming its
advantage in more contiguous alignment. lra is supposed to handle long INDELs
better, too. However, we could not get lra to work well with dipcall, so did
not report the numbers.

Minimap2 spends most computing time on base alignment. As recent improvements
in v2.22 do not change the base alignment algorithm, the new version has similar
performance to older verions. Minimap2 is consistently faster than Winnowmap2
by several times.


\section*{Acknowledgements}
We thank Arang Rhie and Chirag Jain for providing motivating examples where
older minimap2 underperforms.

\paragraph{Funding\textcolon} This work is funded by NHGRI grant R01HG010040.

\paragraph{Funding\textcolon} NHGRI R01HG010040
~\\*

\bibliography{minimap2}

Expand Down

0 comments on commit e37f5ff

Please sign in to comment.