This repository was archived by the owner on Oct 12, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathfermi.tex
952 lines (849 loc) · 47.1 KB
/
fermi.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
\documentclass{bioinfo}
\copyrightyear{2012}
\pubyear{2012}
\usepackage[ruled,vlined]{algorithm2e}
\usepackage{natbib}
\bibliographystyle{apalike}
\DeclareMathOperator*{\argmax}{argmax}
\begin{document}
\firstpage{1}
\title[Variant calling from de novo assembly]{Exploring single-sample SNP and INDEL calling with whole-genome de novo assembly}
\author[Li]{Heng Li$^{1,}$\footnote{to whom correspondence should be addressed}}
\address{$^1$Broad Institute, 7 Cambridge Center, Cambridge, MA 02142, USA}
\history{Received on XXXXX; revised on XXXXX; accepted on XXXXX}
\editor{Associate Editor: XXXXXXX}
\maketitle
\begin{abstract}
\section{Motivation:}
Eugene Myers in his string graph paper~\citep{Myers:2005bh} suggested that in a
string graph or equivalently a unitig graph, any path spells a valid assembly.
As a string/unitig graph also encodes every valid assembly of reads, such a
graph, provided that it can be constructed correctly, is in fact a lossless
representation of reads. In principle, every analysis based on whole-genome
shotgun sequencing (WGS) data, such as SNP and insertion/deletion (INDEL)
calling, can also be achieved with unitigs.
\section{Results:}
To explore the feasibility of using de novo assembly in the context of
resequencing, we developed a de novo assembler, \emph{fermi}, that assembles
Illumina short reads into unitigs while preserving most of information of the
input reads. SNPs and INDELs can be called by mapping the unitigs against a
reference genome. By applying the method on 35-fold human resequencing data,
we showed that in comparison to the standard pipeline, our approach yields
similar accuracy for SNP calling and better results for INDEL calling. It has
higher sensitivity than other de novo assembly based methods for variant
calling. Our work suggests that variant calling with de novo assembly can be a
beneficial complement to the standard variant calling pipeline for whole-genome
resequencing. In the methodological aspects, we propose FMD-index for
forward-backward extension of DNA sequences, a fast algorithm for finding
all super-maximal exact matches and one-pass construction of unitigs
from an FMD-index.
\section{Availability:} \href{http://github.com/lh3/fermi}{http://github.com/lh3/fermi}
\section{Contact:} [email protected]
\end{abstract}
\vspace*{-1em}
\section{INTRODUCTION}
The rapidly decreasing sequencing cost has enabled whole-genome shotgun (WGS)
resequencing at an affordable price. Many software packages have been developed
to call variants, including SNPs, short insertions and deletions (INDELs) and
structural variations (SVs), from WGS data. At present, the standard approach to
variant calling is to map raw sequence reads against a reference genome and then
to detect differences from the reference. It is well established and has been
proved to work from a single sample to thousands of
samples~\citep{1000-Genomes-Project-Consortium:2010qc}. Nonetheless, a
fundamental flaw in this mapping based approach is that mapping algorithms
ignore the correlation between sequence reads. They are unable to take full
advantage of data and may produce inconsistent outputs which complicate variant
calling. This flaw has gradually attracted the attention of various research
groups who subsequently proposed several methods to alleviate the effect,
including post alignment
filtering~\citep{Ossowski:2008if}, iterative mapping~\citep{Manske:2009ve},
read realignment~\citep{Albers:2010ud,Homer:2010kx,Li:2011kx,Depristo:2011vn}
and local assembly~\citep{Carnevali:2011fk}. However, because these methods
still rely on the initial mapping, it is difficult for them to identify and
recover mismapped or unmapped reads due to high sequence divergence, long
insertions, SVs, copy number changes or misassemblies of the reference genome.
They have not solved the problem from the root.
Another distinct approach to variant calling that fundamentally avoids the flaw of the
mapping based approach is to assemble sequence reads into contigs and to
discover variants via assembly-to-assemby alignment. It was probably more
widely used in the era of capillary sequencing. The assembly based method
became less used since 2008 due to the great difficulties in assembling 25bp
reads, but with longer paired-end reads and improved methodology,
de novo assembly is reborn as the preferred choice for variant discovery
between small genomes.
For variant discovery between human genomes, however, the assembly based
approach has not attracted much attention. Assembling a human genome is far
more challenging than assembling a bacterial genome, firstly due to the sheer
size of the genome, secondly to the rich repeats and thirdly due to the
diploidy of the human genome. Many heuristics effective for assembling small
genomes are not directly applicable to the human genome assembly. As a result,
only a few de novo assemblers have been applied on human short-read data. Among
them, ABySS~\citep{Simpson:2009ys}, SOAPdenovo~\citep{Li:2010vn} and
SGA~\citep{Simpson:2011ly}, as of now, do not explicitly output heterozygotes.
Although in theory it is possible to recover heterozygotes from their
intermediate output, it may be difficult in practice as the assemblers may not
distinguish heterozygotes from sequencing errors. Cortex~\citep{Iqbal:2012ys}
is specifically designed for retaining heterozygous variants in an assembly,
but it may be missing heterozygotes. ALLPATHS-LG~\citep{Gnerre:2011ys}
also paid particular attention to keep heterozygotes, but it still has
a relatively low sensitivity. In addition, ALLPATHS-LG only works with
reads from libraries with distinct insert size distributions and prefers read
pairs with mean insert size below three times of the read length, while many
resequencing projects do not meet these requirements and thus ALLPATHS-LG may
not be applied or work to the best performance. Even if we also include de novo
assemblers developed for capillary sequence reads, the version of the Celera
assembler used for assembling the HuRef genome~\citep{Levy:2007uq} is the only
one that retains heterozygotes while capable of assembling a mammalian genome.
At last, one may think to map sequence reads back to the assembled contigs to
recover heterozygous events, but this
procedure will be affected by the same flaw of read mapping. To the best of our
knowledge, no existing de novo assemblers are able to achieve the sensitivity
of the standard mapping based approach for a diploid mammalian genome.
In this article, we will show that the assembly based variant
calling can achieve a SNP accuracy close to the standard mapping approach
and have particular strength in INDEL calling, confirming previous
studies~\citep{Iqbal:2012ys}. In addition, the de novo assembly
algorithm, \emph{fermi}, developed for this practice is also a capable
assembler for human assembly.
\vspace*{-1em}
\begin{methods}
\section{METHODS}
The methods section is organized as follows. We first review the history
of de novo assembly in the theoretical aspects, which leads to the rationale
behind fermi: to use unitigs as a lossless representation of reads. We then
summarize the notations used in the article and introduce bidirectional
FM-index for DNA sequences. We will present several algorithms for assembling
using the bidirectional FM-index. The key algorithm is based on previous
works~\citep{Simpson:2010uq}, but we need to adapt it to our new index. We
also remove the recursion in the original algorithm. Finally we will discuss
practical concerns in implementation.
\subsection{Theoretical background}
\subsubsection{A history of the overlap-layout-consensus paradigm}
Computer assisted sequence assembly can be dated back to the late
1970s~\citep{Staden:1979dq,Gingeras:1979cr}.
In 1984, \citeauthor{Peltola:1984qf} first formulated the DNA assembling
problem as finding the shortest string (the assembly) such that each sequence
read can be mapped to the assembly within a required error rate.
To solve the problem, they proposed a three-step procedure, which is
essentially the overlap-layout-consensus (OLC) approach.
\citet{Myers:1995nx} pointed out that reducing DNA assembly to a shortest
string problem is flawed in the presence of repeat. He further proposed the
concept of \emph{overlap graph}, where a vertex corresponds to a read and a
bidirectional edge to an overlap. Naively, the DNA assembling problem can be
cast as finding a path in the overlap graph such that each vertex/read is
visited exactly once (though edge/overlap caused by repeats are not required to
be traversed), equivalent to a Hamilton path problem which is known to be
NP-complete. This has led many to believe that the OLC approach is
theoretically crippled.
However, it is worth pointing out that although the assembly problem can be reduced to a
Hamilton path problem, it can be reduced to other problems as well and in
practice almost no assemblers try to solve a Hamilton path problem. We note that a
fundamental difference between a generic graph and an overlap graph is the
latter can be transitively reduced while retaining the read
relationship. More formally, if $v_1\to v_2$, $v_2\to v_3$ and $v_1\to v_3$ are
all present, edge $v_1\to v_3$ is said to be \emph{reducible}. When we removed
all the contained reads and reducible edges, a procedure called
\emph{transitive reduction}, the resulting graph is still a loyal
representation of the overlap graph~\citep{Myers:1995nx}, but the path
corresponding to the assembly is not a Hamilton path any more because reads
from repetitive regions need to be traversed multiple times.
In a transitively reduced graph, if there exists $v_1\to v_2$ with the
out-degree of $v_1$ and in-degree of $v_2$ both equal to 1, we are able to
merge $v_1$ and $v_2$ into one vertex without altering the topology of the
graph. After we performed all possible merges, we get a \emph{unitig graph} in
which each vertex corresponds to a \emph{unitig}, representing a maximal linear
sequence that can be resolved by reads. Multiple copies of a repeat may be
collapsed to a single unitig. The concept of unitig helps to greatly simplify
an assembly graph. It has played a central role in the Celera
assembler~\citep{Myers:2000kl}.
Finding the optimal tour in a unitig graph is still
NP-hard~\citep{DBLP:conf/wabi/MedvedevGMB07}, but such a formulation may
not be useful in practice as we can rarely assemble the entire genome into
one string. A more practical solution is to compute a traversal count
for each edge by solving a minimum cost network flow
problem~\citep{Myers:2005bh} and to drop edges with zero count as false
overlaps. In the resulting graph, each unambiguous path can be considered to
spell a contig.
Computing traversal counts in a transitively reduced graph can be conducted
in small subgraphs separated by some unambiguous edges. The overall time
complexity is not much worse than linear -- the worst case almost never happens
globally. However, deriving an overlap graph takes $O(N^2)$ time, where $N$ is
the number of reads, and transitive reduction takes at least $O(E)$ time, where
$E$ is the number of edges which is usually much larger than $N$. This still
makes an OLC based approach less favorable in short-read assembly where $N$ can
be of the order of $10^9$.
A breakthrough achieved by~\citet{Simpson:2010uq} finally solved
this last remaining problem at least when we only consider exact overlaps.
These authors developed an $O(N)$ algorithm to find all the irreducible edges,
effectively replacing the overlapping and transitive reduction phases.
In summary, in the OLC paradigm, contig sequences can be constructed in a time
roughly linear in the total length of reads, though deriving a single assembled
sequence is NP-hard in theory.
\subsubsection{De Bruijn graph and read coherence}
The de Bruijn graph is an alternative graph representation of sequence
reads~\citep{Idury:1995oq}. It can be trivially constructed with a simple
linear-time algorithm and finding the optimal tour has polynomial-time
solutions. These make the de Bruijn graph approach very attractive for
assembling many short reads.
However, de Bruijn is `lossy'. From a theoretical point view, a de Bruijn graph
is equivalent to an overlap graph built by splitting a long read into overlap
$k$-mers and requiring ($k$-1)-mer exact overlaps between non-redundant
$k$-mers. Such a graph does not have transitive edges. Because long reads all
effectively work as $k$-bp reads in a de Bruijn graph, long-range information
is lost. As a result, a path in the graph may be invalidated by reads. In
contrast, in a unitig graph or equivalently a string graph each path models a
valid assembly from input reads. \citet{Myers:2005bh} called this property of
path consistency as \emph{read coherence}.
Losing long-range information in reads, a de Bruijn graph by itself has
reduced power to resolve short repeats. This flaw is usually amended by solving
a \emph{Eulerian superpath problem}~\citep{Pevzner:2001vn} whereby we
map reads back to the graph and bisect repeats shorter than the reads, a
procedure some also called as \emph{read threading}. Many de Buijn graph based
assemblers essentially take this
strategy~\citep{Chaisson:2009fk,Zerbino:2009ly,Li:2010vn}, though they may use
different terminologies. With read threading, it is possible to transform a de
Bruijn graph to a coherent graph, but finding the optimal solution is known to be
NP-hard~\citep{DBLP:conf/wabi/MedvedevGMB07} and may be complex to implement
given rich repeat structures.
\subsubsection{Concluding remark}
We noted that we only focused on the theoretical aspects of de novo assembly.
In practice, many assemblers derived the final assembly by applying heuristics
on the simplified graph instead of solving a network flow problem or a Eulerian
problem. Furthermore, correcting errors, utilizing read pairs and controlling memory
usage all pose challenges to large-scale de novo assembly. Many practical
problems are not solved perfectly. De novo assembly is still a field under
active development.
\subsection{Rationale}
Being coherent, a perfectly constructed unitig graph annotated with per-unitig
read counts in fact encapsulates all the information of reads and encodes no
information invalidated by reads. In this sense, any unitig based analysis
has an equivalent read based analysis, and vice versa. This article just
uses this property to explore the applications for which we usually rely on
reads.
\subsection{Strings and FM-index}
\begin{table}[tb]\label{tab:notation}
\processtable{Notations}
{\begin{tabular}{lp{7cm}}
\toprule
Symbol & Description \\
\midrule
$T$ & String: $T=a_0a_1\ldots a_{n-1}$ with $a_{n-1}=\$$\\
$|T|$ & Length of $T$ including sentinels: $|T|=n$\\
$T[i]$ & The $i$-th symbol in string $T$: $T[i]=a_i$\\
$T[i,j]$ & Substring: $T[i,j]=a_i\ldots a_j$\\
$T_i$ & Suffix: $T_i=T[i,n-1]$\\
$S$ & Suffix array; $S(i)$ is the position of the $i$-th smallest suffix\\
$B$ & BWT: $B[i]=T[S(i)-1]$ if $S(i)>0$ or $B[i]=\$$ otherwise\\
$C(a)$ & Accumul. count array: $C(a)=|\{0\le i\le n-1:T[i]<a\}|$ \\
$O(a,i)$ & Occurrence array: $O(a,i)=|\{0\le j\le i:B[j]=a\}|$\\
$P\circ W$ & String concatenation of string $P$ and $W$\\
$Pa$ & String concatenation of string $P$ and symbol $a$: $Pa=P\circ a$\\
$\overline{P}$ & Watson-Crick reverse complement of DNA string $P$\\
\botrule
\end{tabular}}{}
\end{table}
\subsubsection{Strings with multiple sentinels}
Let $\Sigma=\{\mbox{\tt \$},\mbox{\tt A},\mbox{\tt C},\mbox{\tt G},\mbox{\tt
T},\mbox{\tt N}\}$ be the \emph{alphabet} of DNA sequences with a predefined
lexicographical order $\mbox{\tt \$}<\mbox{\tt A}<\mbox{\tt C}<\mbox{\tt
G}<\mbox{\tt T}<\mbox{\tt N}$, where `{\tt N}' represents an ambiguous base and
`{\tt \$}' is a sentinel that marks the end of a string. An element in $\Sigma$
is called a \emph{symbol} and a sequence of symbols is called a \emph{string}.
Given a string $T$, let $|T|$ be the length of the string, $T[i]$,
$i=0,\ldots,|T|-1$, be the $i$-th symbol in the string, $T[i,j]$, $0\le i\le
j<|T|$, be a substring and $T_i=T[i,|T|-1]$ be a suffix of $T$. Following
the definition by~\citet{en:2009fk}, we define a string terminated with `{\tt
\$}' as a \emph{text}. A text may have multiple sentinels. In a text $T$, if
$T[i]=\$$ and $T[j]=\$$, we mandate $T[i]<T[j]$ if and only if $i<j$. Thus when
we compare two suffixes of $T$, we do not need to compare beyond a sentinel
because each sentinel has a different lexicographical rank.
For two strings $P$ and $W$, let $P\circ W$ be their string concatenation.
We may sometimes write $P\circ W$ as $PW$ if it is unambiguous in the context.
Given an ordered set of texts, we call their ordered string concatenation
as a \emph{collection}, which is also a text. For example, suppose we have
two reads. The first is {\tt ACG} and the second is {\tt GTG}. The collection
of the two reads is $T=\mbox{\tt ACG\$GTG\$}$. Suffix $T_2<T_6$ because the first
sentinel is lexicographically smaller than the second.
For convenience, we assign an integer from 0 to 5 to `{\tt \$}', `{\tt A}',
`{\tt C}', `{\tt G}', `{\tt T}' and `{\tt N}', respectively. We may use both
the integer and the letter representations throughout the article. In addition,
given a symbol $a$, we define $\overline{a}$ as the Watson-Crick complement of $a$.
We regard the complement of `{\tt \$}' and `{\tt N}' is identical to itself.
\subsubsection{FM-index}
The \emph{suffix array} $S$ of text $T$ is a permutation of integers between 0 and
$|T|-1$, where $S(i)$, $0\le i<|T|$, is the position of the $i$-th smallest
suffix of $T$. Given a string $P$, the \emph{suffix array interval}
$I^l(P),I^u(P)]$ of $P$ in $T$ is defined as
\begin{eqnarray*}
I^l(P)&=&\min\{k:\mbox{$P$ is the prefix of $T_{S(k)}$}\}\\
I^u(P) &=&\max\{k:\mbox{$P$ is the prefix of $T_{S(k)}$}\}
\end{eqnarray*}
For convenience, we also define $I^s(P)=I^u(P)-I^l(P)+1$ as the size of the interval.
The \emph{Burrows-Wheeler Transform}~\citep{Burrows:1994gd}, or \emph{BWT}, of $T$ is a permutation of
symbols in $T$. The BWT string $B$ is computed as $B[i]=T[S(i)-1]$ for
$S(i)>0$ and $B[i]=\mbox{\tt \$}$ otherwise. Given a text $T$, also define
the accumulative count array $C(a)$ as the number of symbols in $T$ that are
lexicographically smaller than $a$, and the occurrence array $O(a,i)$ as
the occurrence of symbols $a$ in $B[0,i]$.
\emph{FM-index}~\citep{DBLP:conf/focs/FerraginaM00} is a compressed representation of the BWT $B$, the occurrence
array $O(a,i)$ and the suffix array $S(i)$. The key property of FM-index is
\begin{eqnarray}\label{eq:back}
I^l(aP)&=&C(a)+O(a,I^l(P)-1)\\
I^u(aP)&=&C(a)+O(a,I^u(P))-1
\end{eqnarray}
and $I^l(aP)\le I^u(aP)$ if and only if $aP$ is a substring of $T$.
We note that these two equations are different from the ones in our previous
paper~\citep{Li:2009uq} in that $C(a)$ and $O(a,i)$ defined here include
the sentinels, but the two arrays in the previous paper exclude them.
Given a collection $T=Q_0 Q_1\ldots Q_{n-1}$, we can retrieve
sequence $Q_i$ in linear time with Algorithm~1~\citep{DBLP:conf/recomb/MakinenNSV09}.
The second return value is the rank of $Q_i$ which equals $|\{Q_j:Q_j<Q_i\}|$.
\begin{algorithm}[h]
\DontPrintSemicolon
\footnotesize
\KwIn{Sequence index $i\ge0$; $B$, $O$ and $C$ defined in the text}
\KwOut{Sequence $P$ and $k$, the rank of $P$}
\BlankLine
\textbf{Function} {\sc GetSeq}$(i)$
\Begin {
$k\gets i$\;
$P\gets$ \emph{empty string}\;
\While{\bf true} {
$a\gets B[k]$\;
$k\gets C(a)+O(a,k)-1$\;
\If{$a=0$}{
\Return $(P,k)$
}
$P\gets aP$
}
}
\caption{Sequence retrieval}
\end{algorithm}
\subsection{FMD-index}
Given DNA texts $R_0,\ldots,R_{n-1}$, define $T=R_0\overline{R}_0
R_1\overline{R}_1\ldots R_{n-1}\overline{R}_{n-1}$ as the
\emph{bidirectional collection} of the texts. We call the FM-index of $T$ as
the \emph{FMD-index} of $R_0,\ldots,R_{n-1}$ and define the \emph{bi-interval} of a
string $P$ as $[I^l(P),I^l(\overline{P}),I^s(P)]$. We will show how to compute
the bi-interval of $aP$ and $Pa$ when we know the bi-interval of $P$.
We note that when we know the bi-interval of $P$, $I^l(aP)$ and $I^s(aP)$ can
be readily computed with Equation~\eqref{eq:back}.
$[I^l(\overline{aP}),I^u(\overline{aP})]$ is a sub-interval of
$[I^l(\overline{P}),I^u(\overline{P})]$ because $\overline{P}$ is a prefix
of $\overline{aP}=\overline{P}\circ\overline{a}$. Due to the innate symmetry
of $T$, $I^s(\overline{cP})=I^s(cP)$ for all $c\in\Sigma$ with
$\sum_{c}I^s(cP)=I^s(P)=I^s(\overline{P})$. We can compute $I^s(cP)$ for all $c\in\Sigma$
with Equation~\eqref{eq:back}, use these interval sizes to divide
$[I^l(\overline{P}),I^u(\overline{P})]$ and finally derive
$[I^l(\overline{aP}),I^u(\overline{aP})]$. This completes the computation of
the bi-interval of $aP$ (Algorithm~2). Furthermore, when we backward extend
$P$, we actually forward extend $\overline{P}$. Conversely, backward extension
of $\overline{P}$ yields forward extension of $P$ (Algorithm~3).
An FMD-index is bidirectional.
In comparison to the bidirectional BWT~\citep{Lam:2009fk} which uses two FM-indices,
the FMD-index builds both forward and reverse strand DNA sequences in one index.
Although the FMD-index is not applicable to generic texts, it is conceptually
more consistent with double-strand DNA and improves the speed of exact matching
as we only need to search against one index. For example,
BWA-SW~\citep{Li:2010fk} gets a 80\% speedup when we adopt the FMD-index as the
data structure.
\begin{algorithm}[h]
\DontPrintSemicolon
\footnotesize
\KwIn{Bi-interval $[k,l,s]$ of string $W$ and a symbol $a$}
\KwOut{Bi-interval of string $aW$}
\BlankLine
\textbf{Function} {\sc BackwardExt}$([k,l,s],a)$
\Begin {
\For{$b\gets0$ \KwTo $5$} {
$k_b\gets C(b) + O(b,k-1)$
$s_b\gets O(b,k+s-1) - O(b,k-1)$
}
$l_0\gets l$\;
$l_4\gets l_0+s_0$\;
\For{$b\gets3$ \KwTo $1$} {
$l_b\gets l_{b+1}+s_{b+1}$
}
$l_5\gets l_1+s_1$\;
\Return{$[k_a,l_a,s_a]$}
}
\caption{Backward extension}
\end{algorithm}
\begin{algorithm}[h]
\DontPrintSemicolon
\footnotesize
\KwIn{Bi-interval $[k,l,s]$ of string $W$ and a symbol $a$}
\KwOut{Bi-interval of string $Wa$}
\BlankLine
\textbf{Function} {\sc ForwardExt}$([k,l,s],a)$
\Begin {
$[l',k',s']\gets${\sc BackwardExt}$([l,k,s],\overline{a})$\;
\Return{$[k',l',s']$}
}
\caption{Forward extension}
\end{algorithm}
\subsection{Unitig construction}
\subsubsection{Labeling reads and overlaps}
Given a bidirectional collection $T=R_0\overline{R}_0\ldots
R_{n-1}\overline{R}_{n-1}$, fermi labels the $i$-th input read $R_i$ with an
\emph{ordered} integer pair $[k,l]$, where $k$ is the rank of $R_i$
and $l$ the rank of $\overline{R}_i$. The pair $[k,l]$ can be computed by
{\sc GetSeq}$(2i)$ and {\sc GetSeq}$(2i+1)$, respectively. Obviously, if read
$R_i$ is labeled by $[k,l]$, $\overline{R}_i$ should be labeled by
$[l,k]$, with the two integer swapped.
For two reads labeled by $[k,l]$ and $[k',l']$, if the tail (3'-end) of read
$[k,l]$ overlaps the head (5'-end) of $[k',l']$, we use an \emph{unordered}
integer pair $\langle l,k'\rangle$ to label the overlap. Such is a tail-to-head
overlap. Similarly, we use $\langle l',k\rangle$ for a head-to-tail overlap,
$\langle l,l'\rangle$ for tail-to-tail and $\langle k,k'\rangle$ for a
head-to-head overlap. The four types of overlaps correspond to the four
types of bidirectional edges in the bidirectional overlap
graph~\citep{Myers:1995nx}.
\subsubsection{Finding irreducible overlaps}
Finding irreducible overlaps plays a central role in fermi as well as in SGA.
Given its importance, we present a restructured version of this algorithm
(SD10;~\citealt{Simpson:2010uq}) using our notations (Algorithm~4).
In Algorithm~4, line 1 computes the bi-interval of a single symbol. The loop at
line 2 uses backward extensions to find all the reads overlapping with the
input string $P$. The loop at line 3 uses forward extensions base by base to exclude
reducible overlaps found at the previous step. $W$ is this loop keeps
the common substring of reads overlapping $P$ extended from the 3'-end of $P$. If
in an iteration we find the sentinel of a read $R$ (line 5), then all the reads sharing
the same $W$ with $R$ must overlap with both $R$ and $P$ and therefore
their overlaps with $P$ are reducible. In this case, no further forward
extensions are necessary (line 4 and 6).
Similar to the original algorithm, Algorithm 4 requires that there are no contained reads.
Fermi actually implements a modified version that detects reads containment on the fly,
but we think the algorithm is a little overcomplicated. It is probably easier to
filter contained reads first and then run Algorithm 4, as SGA does.
\begin{algorithm}[h]
\DontPrintSemicolon
\footnotesize
\KwIn{Read $P$ and the minimum overlap length $x$}
\KwOut{Set of bi-intervals of reads having irreducible overlaps with the 3'-end of $P$}
\BlankLine
\textbf{Function} {\sc IrrOverlap}$(P,x)$
\Begin {
\emph{Initialize ${\sf Curr}$ and ${\sf Prev}$ as empty arrays}\;
$a\gets P[|P|-1]$\;
\nl$[k,l,s]\gets [C(a), C(\overline{a}), C(a+1) - C(a)]$\;
\nl\For{$i\gets |P|-2$ \KwTo $0$} {
\If{$|P|-i-1\ge x$} {
$[k',l',s']\gets${\sc BackwardExt$([k,l,s],0)$}\;
\If{$s'\not=0$} {
\emph{Append $([k',l',s'],\epsilon)$ to ${\sf Curr}$}\;
}
}
$[k,l,s]\gets${\sc BackwardExt$([k,l,s],P[i])$}\;
}
\emph{Reverse array ${\sf Curr}$, and swap ${\sf Curr}$ and ${\sf Prev}$}\;
${\sf Finished}=\emptyset$\;
$\mathcal{I}=\emptyset$\;
\nl\While{${\sf Prev}$ is not empty} {
\emph{Reset ${\sf Curr}$ to empty}\;
\For{$([k,l,s],W)$ {\bf in} ${\sf Prev}$} {
\If{$W\in{\sf Finished}$} {
\nl{\bf continue}\;
}
$[k',l',s']\gets${\sc ForwardExt}$([k,l,s],0)$\;
\nl\If{$s'\not=0$} {
${\sf Finished}\gets{\sf Finished}\cup \{W\}$\;
$\mathcal{I}\gets\mathcal{I}\cup\{[k',l',s']\}$\;
\nl{\bf continue}\;
}
\For{$a\gets1$ \KwTo $5$} {
$[k',l',s']\gets${\sc ForwardExt}$([k,l,s],a)$\;
\If{$s'\not=0$ {\bf and} $[k',l',s']$ is not in ${\sf Curr}$} {
\emph{Append $([k',l',s'],Wa)$ to ${\sf Curr}$}\;
}
}
}
\emph{Swap ${\sf Curr}$ and ${\sf Prev}$}
}
\Return{${\sf IrrOvlp}$}
}
\caption{Finding irreducible overlaps}
\end{algorithm}
\subsubsection{Unitig construction}
Unitig construction is a process of unambiguous merge of overlapped reads.
If $[k,l]$ and $[k',l']$ have an irreducible overlap $\langle l,k'\rangle$ and
can be unambiguously merged, we label the merged sequence with $[k,l']$; the
similar can be applied to other three types of overlaps. With this simple
labeling procedure, we are able to fully keep track of the graph topology
during the unitig construction and without staging the graph in RAM. This
procedure can also be easily multi-threaded.
\subsection{Finding the supermaximal exact matches}
An FMD-index can be used to find \emph{supermaximal exact matches} (SMEMs)
between a reference and a query sequence. Formally, a \emph{maximal exact
match} (MEM) is a an exact match that cannot be extended in either direction of
the match. An SMEM is a MEM that is not contained in other MEMs on the query
sequence. Fermi uses SMEMs to map reads back to the unitigs.
Algorithm~5 describes the details. Basically, we use forward-backward extension
to extend an exact match and detect the boundary of a maximal match by tracking
the change of interval sizes. Fermi implements a variant of Algorithm~5. It
finds full-length read matches and can optionally exclude matches identical to
the query sequence.
\begin{algorithm}[h]
\DontPrintSemicolon
\footnotesize
\KwIn{String $P$ and start position $i_0$; $P[-1]=0$}
\KwOut{Set of bi-intervals of SMEMs overlapping $i_0$}
\BlankLine
\textbf{Function} {\sc SuperMEM1}$(P,x)$
\Begin {
\emph{Initialize ${\sf Curr}$, ${\sf Prev}$ and ${\sf Match}$ as empty arrays}\;
$[k,l,s]\gets [C(P[i_0]), C(\overline{P[i_0]}), C(P[i_0]+1) - C(P[i_0])]$\;
\For{$i\gets i_0+1$ \KwTo $|P|$} {
\If{$i=|P|$} {
\emph{Append $[k,l,s]$ to ${\sf Curr}$}
} \Else {
$[k',l',s']\gets${\sc ForwardExt$([k,l,s],P[i])$}\;
\If{$s'\not=s$} {
\emph{Append $[k,l,s]$ to ${\sf Curr}$}
}
\If{$s'=0$} {
{\bf break}\;
}
$[k,l,s]\gets[k',l',s']$
}
}
\emph{Swap array ${\sf Curr}$ and ${\sf Prev}$}\;
$i'\gets |P|$\;
\For{$i\gets i_0-1$ \KwTo $-1$} {
\emph{Reset ${\sf Curr}$ to empty}\;
$s''\gets -1$\;
\For{$[k,l,s]$ {\bf in} ${\sf Prev}$}{
$[k',l',s']\gets${\sc BackwardExt$([k,l,s],P[i])$}\;
\If{$s' = 0$ {\bf or} $i = -1$} {
\If{${\sf Curr}$ is empty {\bf and} $i+1<i'+1$} {
$i'\gets i$\;
\emph{Append $[k,l,s]$ to ${\sf Match}$}
}
}
\If{$s'\not=0$ {\bf and} $s'\not=s''$} {
$s''\gets s'$\;
\emph{Append $[k,l,s]$ to ${\sf Curr}$}
}
}
\If{${\sf Curr}$ is empty} {
{\bf break}
}
\emph{Swap ${\sf Curr}$ and ${\sf Prev}$}\;
}
\Return ${\sf Match}$
}
\caption{Finding super-maximal exact matches}
\end{algorithm}
\subsection{Other implementation details}
\subsubsection{Constructing FM-index}
To compute suffix arrays for strings with multiple sentinels, we modified an
optimized implementation of the SA-IS algorithm~\citep{DBLP:journals/tc/NongZC11} by Yuta
Mori. We used the established algorithm to merge BWTs of subsets of
reads~\citep{DBLP:journals/algorithmica/HonLSSY07,en:2009fk,DBLP:conf/latin/FerraginaGM10}.
The BWT string is run-length encoded with the length in the delta
encoding~\citep{Elias:1975ly}.
\subsubsection{Error correction}
Fermi corrects potential sequencing errors using an algorithm similar to
solving the spectrum alignment problem~\citep{Pevzner:2001vn}, correcting
bases in underrepresented $k$-mers. It also shares similarity to
HiTEC~\citep{Ilie:2011fk}. Nonetheless, the fermi's algorithm differs in that
it is quality aware and does not rely on a user defined threshold on the
$k$-mer occurrences.
Fermi corrects errors in two phases. In the first phase, it collects all 23-mer
occurring 3 or more times using a top-down traversal over the trie represented
by the FMD-index. For each such 23-mer, fermi counts the occurrences of
the next (i.e. the 24-th) base and stores the information in a hash table
with the 23-mer being the key. In the second phase, fermi processes each read
by using the 23-mer hash table to correct errors by minimizing a heuristic cost
function of base quality and the occurrences of the 24-th base. Roughly
speaking, fermi tries to correct a low-quality base if by looking up its 23-mer
prefix we know the base is different from an overwhelmingly frequent 24-th base.
This algorithm can be adapted to correct INDEL sequencing errors in principle,
but this has not been done. More works are needed to perform minimization efficiently.
\subsubsection{Simplifying complex bubbles}
A \emph{bubble} is a directed acyclic subgraph with a single source and a
single sink having at least two paths between the source and the sink. A
\emph{closed bubble} is a bubble with no incomming edges from or outgoing edges
to other parts of the entire graph, except at the source and the sink vertices.
A closed bubble is \emph{simple} if there are exactly two paths between the
source and the sink; otherwise it is \emph{complex}. In de novo assembly, a
bubble is frequently caused by sequencing errors or heterozygotes. Most
short-read assemblers uses a modified Dijkstra's algorithm to pop bubbles
progressively. Such an algorithm works fine for haploid genomes, but it is not
straightforward to distinguish heterozygotes from errors when the bubble is
complex.
Fermi uses a different algorithm. It effectively performs topological sorting
from the end of a vertex while keeping track of the top two paths containing
most reads. A bubble is detected when every path ends at a single vertex. It
then drops vertices not on the top two paths and thus turns a complex bubble to
a simple one.
\subsubsection{Using the paired-end information}
Given paired-end reads with short-insert sizes, fermi maps reads back to the
unitigs with Algorithm~5. If two unitigs are linked by at least five read
pairs, fermi will locally assemble the ends of unitigs together with unpaired
reads pointing to the gap under a relax setting. Fermi tries to align the
ends of unitigs using the Smith-Waterman algorithm, which may reveal imperfect
overlaps caused by sequencing errors or heterozygotes. Fermi also uses
paired-end reads to break contigs at regions without bridging read pairs. This
helps to reduce misassemblies during the unitig construction.
\end{methods}
\section{RESULTS}
We evaluated fermi on 101bp paired-end reads from NA12878~\citep{Depristo:2011vn}.
The total coverage of the original data is about 70-fold, but we only used half
of them. We assembled the 35-fold reads with fermi on a machine with 12 CPUs
and 96GB memory in about 5 days. The peak memory usage is 92GB.
We obtained unitigs of N50 1,022bp, totaling 3.83Gb. After collapsing
most heterozygotes and closing gaps with paired-end reads, we got longer contigs
(Table~4). Unitigs are short and redundant mainly because they break at
heterozygotes.
For SNP and INDEL calling, we aligned unitigs to the reference genome using
BWA-SW~\citep{Li:2010fk} with command line options `-b9 -q16 -r1 -w500'. We
called SNPs with the SAMtools caller and called INDELs by directly counting INDELs
from the pileup output. We did not run a standard INDEL caller as short-read
INDEL callers do not work well with long contig sequences.
\subsection{Performance on de novo assembly}
We obtained the HuRef capillary read assembly~\citep{Levy:2007uq} and the
ALLPATHS-LG NA12878 contigs (AC:AEKP01000000) from NCBI, the SGA scaffolds from
\mbox{\href{http://bit.ly/jts12878}{http://bit.ly/jts12878}}~\citep{Simpson:2011ly}
and the ABySS assembly provided by Shaun Jackman (personal communication).
For both SGA and ABySS scaffolds, we split at any ambiguous bases to get
contigs; for the HuRef assembly, we split at contiguous `N' longer than 20bp.
The ABySS, fermi and SGA assemblies are derived from essentially the same input
reads. ALLPATH-LG uses a superset of reads at 100-fold coverage, including
reads from multiple long-insert libraries.
\begin{table}[bt]
\processtable{Statistics on human whole-genome assemblies}
{\begin{tabular}{lrrrrr}
\toprule
%& AllPaths-LG & fermi & SGA & SOAP & HuRef \\
%\midrule
%Aligned contig bases & 2.62G & 2.82G & 2.74G & 2.71G & 2.88G \\
%Aligned N50 & 22.6k & 15.6k & 9.8k & 7.0k & 81.4k \\
%Covered ref. bases & 2.59G & 2.74G & 2.70G & 2.67G & 2.78G \\
%\# Type-1 breaks & 13,738 & 5,704 & 6,049 & 4,962 & 16,318 \\
%\# Type-2 breaks & 3,823 & 1,120 & 1,735 & 727 & 6,626 \\
& ABySS & AllPaths-LG & fermi & SGA & HuRef \\
\midrule
%Aligned contig bp & 2.73G & 2.62G & 2.82G & 2.74G & 2.88G \\
%Aligned N50 & 9.0k & 22.6k & 15.6k & 9.8k & 81.4k \\
%Covered ref. bp & 2.69G & 2.59G & 2.74G & 2.70G & 2.78G \\
%\# Type-1 breaks & 5,856 &13,738 & 5,704 & 6,049 & 16,318 \\
%\# Type-2 breaks & 1,617 & 3,823 & 1,120 & 1,735 & 6,626 \\
Aligned contig bp & 2.73G & 2.60G & 2.81G & 2.74G & 2.81G \\
Aligned N50 & 9.4k & 23.8k & 15.8k & 9.9k &103.0k \\
Covered ref. bp & 2.69G & 2.59G & 2.74G & 2.70G & 2.78G \\
\# Type-1 breaks & 1,524 & 3,382 & 920 & 1,562 & 4,728 \\
\# Type-2 breaks & 1,495 & 3,314 & 897 & 1,534 & 4,503 \\
\botrule
\end{tabular}}{Contigs over 150bp in length are aligned to the human reference
genome GRCh37 with BWA-SW using option `-b9 -q16 -r1 -w500'. A type-1 break point
is detected if a contig is split during alignment and mapped to two distict
locations, and at each location the alignment is longer than 500bp and the
mapping quality is no less than 10. Type-2 break points exclude type-1 break
points which can be patched with gaps no longer than 500bp.} \end{table}
From Table~2, we can see that the HuRef assembly has much better contiguity
than short-read assemblies. It appears to yield more alignment break points, some
of which may be caused by true structural variations not easily detectable with
short reads. The quality of short-read assemblies varies in terms of contiguity,
misassembly rate and redundancy between contigs, but overall, they are largely
comparable to each other.
\subsection{Performance on SNP and INDEL calling}
One of the key motivations of fermi is to explore the power of de novo assembly
in calling short variants. We collected several SNP and INDEL call sets (Table~3)
and compared the performance of fermi (Table~4 and 5).
\begin{table}[t]
\processtable{Evaluated SNP and INDEL call sets}
{\begin{tabular}{lllll}
\toprule
Label & Data & Assembler & Mapper & Caller \\
\midrule
AC & 96X Illumina PE$^1$ & AllPaths-LG & BWA-SW$^2$ & SAMtools$^3$ \\
BS & 70X Illumina PE & & BWA$^4$ & SAMtools \\
CG & Complete Genom.& & cgatools2$^5$ & cgatools2 \\
CV & 26X Illumina SE$^6$ & Cortex & & Cortex-var \\
FC & 35X Illumina SE$^6$ & fermi & BWA-SW$^2$ & SAMtools$^3$ \\
MD & 60X multiple & & MAQ & 1000g pilot$^7$ \\
MI & Capillary reads$^8$ & & & \\
SS & 35X Illumina SE$^6$ & & BWA-SW & SAMtools \\
\botrule
\end{tabular}}{
$^1$ AS uses reads from Illumina jumping and fosmid libraries\\
$^2$ BWA-SW is invoked with `bwa bwasw -b9 -q16 -r1 -w500'\\
$^3$ INDELs are called from pileup without using the SAMtools caller\\
$^4$ Realigned by GATK~\citep{Depristo:2011vn} also around known INDELs\\
$^5$ By Complete Genomics~\citep{Drmanac:2010ly}; only `VQHIGH' calls retained\\
$^6$ CV, FC and SS do not use the pairing information in calling\\
$^7$ 1000 Genomes Project pilot calls; generated from Dindel and multiple SNP callers\\
$^8$ INDEL calls by \citet{Mills:2011fk}\\
}
\end{table}
\begin{table}[b]
\processtable{Statistics of SNP call sets}
{\begin{tabular}{p{2.3cm}rrrrrrr}
\toprule
& FC & CV & SS & BS & CG & MD \\
\midrule
\#SNPs (M) & 3.37 & 2.20 & 3.24 & 3.50 & 3.34 & 2.69\\
\#hets (M) & 1.97 & 1.04 & 1.94 & 2.11 & 2.04 & 1.65\\
ts/tv & 2.04 & 2.03 & 2.08 & 2.11 & 2.12 & 2.06\\
DN50 (bp) & 3,593& 6,662& 3,523& 3,392& 3,447&3,992\\
DN2/DN50 & 22.3 & 20.8 & 23.4 & 22.7 & 22.3 & 22.9\\
\botrule
\end{tabular}}{Ts/tv is the transition-to-transversion ratio of SNPs. DN50 is
calculated as follows. The reference genome is masked according to the
align-ability mask~(\href{http://bit.ly/snpable}{http://bit.ly/snpable}) and segmented into intervals at
heterozygous SNPs. DN50 is computed such as 50\% of unique positions in the
genome are in intervals longer than DN50. DN2 is calculated similarly and
D2/DN50 is the ratio of DN2 and DN50. DN50 measures the sensitivity; the
smaller the better. DN2/DN50 measures the precision of heterozygous SNPs; the
higher the better.}
\end{table}
For SNP calling (Table~4), fermi misses 3\% of SNPs called in SS,
but finds more additional ones. Manual examination reveals that
the additional calls are mainly caused by two factors. Firstly, in the
single-end mode, BWA-SW is very conservative. It may consistently give a
correct alignment a low mapping quality which are all downweighted by samtools.
Fermi is able to assemble such reads into
longer sequences which increase the power of BWA-SW. Secondly, in the fermi
alignment, some regions may be mapped with a high mismatching rate. These may
be due to small-scale misassemblies in fermi unitigs or in the reference
assembly, or copy-number variations. It is possible that these clustered
SNPs contain more errors. Such errors may lead to reduced ts/tv, but tend
not to break long homozygous blocks due to very recent coalescences. That
is why FC has a good DN2/DN50 ratio, which measures how often false heterozygotes
arise from a long homozygous block.
Table~5 shows the comparison between different INDEL call sets. We excluded
INDELs around long homopolymer runs in all call sets because INDEL sequencing
errors tend to occur around long homopolymer runs and their error profile is still unclear
(the 1000 Genomes Project Analysis group, personal communication). In addition,
we have excluded the SS INDEL call set which is nearly contained in BS due to
the use of the same INDEL caller.
For the call sets in Table~5, MD and CG are relatively small due to the use of
very short reads. CV uses 26X 100bp reads. It is a small call set due to the
high false negative rate of the calling method~\citep{Iqbal:2012ys}. The fermi
call set FC is slightly smaller than BS, but it has larger overlap with other
call sets than BS, and more FC calls are confirmed by others. One explanation
to the lower overlapping ratio between BS and ALL is that BS is the only call
set that uses 101bp paired-end information, which gives it higher power for
INDELs not detectable with single-end or very short reads. Nonetheless, purely
based on Table~5, fermi appears to have higher overall accuracy.
% 1:16907626 - deletion visible from fermi/bwa alignment; low mapping quality; uncalled
% 1:16954475 - DP=139 (expected=35); no indel in alignment
% 1:17054971 - DP=199; no indel
% 1:17223282 - DP=31; 3 1bp deletion reads; not low-complexity
% 1:21753831 - 3bp deletion visible, but in low-mapQ region
% 1:66921509 - left-right long distance
% 1:73102391 - fermi/bwa predicts 12bp INS; Mills: 2bp INS at slightly different pos
% 1:94244466 - fermi/bwa: clear no indel, DP=39
% 1:99729488 - no indel
% 1:120598347- no indel
% 1:143940178- fermi gives the indel, but only two supporting reads
% 1:167442597- no indel; Mills: a long deletion
% 1:175140045- excessive left-right distance
% 1:184320994- low-complexity; do not know what's happening
% 1:245778332- no indel, but alignment is not clean
% 2:30551766 - fermi calls it, but large left-right distance
% 2:43168955 - no indel, alignment okay
% 2:52735433 - no indel, clean alignment, DP=35
% 2:82262306 - clean alignment, slightly elevated DP
% 2:95526297 - no indel, but excessive mismatches, centromere
% 2:133151670- no indel, but bad alignment, excessive DP
% 2:187143096- no indel, clean alignment, Mills: long insertion; long del in huref
% 2:198119486- 2bp deletion visible, but different position
% 3:50567499 - no indel, very good alignment
% 3:112086369- different position
% 3:164267467- no indel, slightly elevated read depth
% 4:67284247 - no indel, clean alignment
% 4:84822017 - fermi/bwa: 17bp del at diff pos; Mills: 10bp del
% 4:137529249- no indel, clean alignment
% 4:181071302- no indel, clean alignment
Even with all short-read call sets combined, as many as 14\% of double-hit
INDELs called by~\citet{Mills:2011fk} are missed. We manually checked 30
missing INDELs in an alignment viewer. For half of the cases, the short-read
alignment and fermi alignment strongly suggest no variations, and for all these
cases, the HuRef sequences are identical to GRCh37. In addition, there are a few
cases called from regions under clear copy-number changes. In all, we believe
INDELs called by~\citet{Mills:2011fk} only may have high error rate. With
short reads, we can recover most of short INDELs found by capillary sequencing.
% FS CV SS BS CG MD AS sum1 sum2 sum3
% ------------------------------------------------------------------
% FS 383078 198513 295926 332481 237810 223302 235105 352900 364167
% CV 219795 251769 207981 222029 182752 178131 164424 229231 231985 232448
% SS 292270 190528 322095 309800 222997 222804 199882 313746 318967 319350
% BS 330405 201341 312624 404646 241722 228431 229152 336172 363859 372496
% CG 237294 166611 224035 242198 264696 199071 161052 247969 253760 255437
% MD 219497 163070 220968 225339 196936 240424 149753 231786 234230 235027
% AS 231508 143791 200100 227719 160311 151455 323423 242816 253042 253358
% MD CG BS CV FC AC MI all
% -----------------------------------------------------------
% MD 240424 196936 225339 163070 227796 149559 12981 234930
% CG 199071 264696 242198 166611 244573 160786 13861 255440
% BS 228431 241722 404646 201341 341626 228833 17682 366599
% CV 178131 182752 222029 251769 227206 164320 13045 232406
% FC 231787 245165 344031 205983 393841 239822 17726 374760
% AC 151147 159973 227291 143648 236222 321682 13784 252817
% MI 13772 14348 18350 12226 18672 14740 23216 20069
\begin{table}[tb]
\processtable{Fraction of INDELs found in other call sets\label{tab:indel}}
{%\begin{tabular*}{\textwidth}{@{\extracolsep{\fill}}lrrrrrrrr}
\begin{tabular}{lrrrrrrr}
\toprule
& MD & CG & BS & CV & FC & MI & ALL \\
\midrule
MD & 240424 & 0.819 & 0.937 & 0.678 & 0.947 & 0.054 & 0.977 \\
CG & 0.752 & 264696 & 0.915 & 0.629 & 0.924 & 0.052 & 0.965 \\
BS & 0.564 & 0.597 & 404646 & 0.498 & 0.844 & 0.044 & 0.906 \\
CV & 0.708 & 0.726 & 0.882 & 251769 & 0.902 & 0.052 & 0.923 \\
FC & 0.588 & 0.624 & 0.873 & 0.522 & 393841 & 0.045 & 0.952 \\
MI & 0.593 & 0.618 & 0.790 & 0.527 & 0.804 & 23216 & 0.864 \\
\botrule
\end{tabular}}
{INDELs that start within a homopolymer run longer than 6bp are excluded in all
call sets. An INDEL in call set $R$ (indexed by row) is said to be \emph{found}
in call set $C$ (indexed by column) if there exists an INDEL in $C$ such that
the left-aligned starting positions of the two INDELs are within 20bp from each
other. An INDEL in $R$ is considered to be found in `ALL' if it is found in
one of the other INDEL sets in the table, plus the AC call set. In the table,
a number on the diagonal equals $|R|$, the number of INDEL calls in the call
set. The fraction equals $|\{g\in R:g \mbox{ is found in } C\}|/|R|$.}
\end{table}
\vspace*{-1em}
\section{Discussions}
In this article, we derived FMD-index by storing both forward and reverse
complement DNA sequences in FM-index. This simple modification enables faster
forward-backward search than bi-directional BWT~\citep{Lam:2009fk} and makes
FMD-index a more natural representation of DNA sequences. Based on FMD-index,
we developed a new de novo assembler, fermi, which achieves similar quality
to other mainstream assemblers.
We demonstrated that it is possible to call SNPs and short INDELs by aligning
assembled unitigs to the reference genome. This approach has similar SNP
accuracy to the standard mapping based SNP calling and arguably outperforms
the existing methods on INDEL calling in terms of both sensitivity and
precision. Assembly based variant calling is a practical and beneficial
complement to mapping based calling.
In the course of evaluating INDEL accuracy, we found that outside long
homopolymer regions, INDEL call sets do not often contain false positives, but
they may have high false negative rate, which leads to the apparent small
overlap between call sets~\citep{Lam:2012fk}.
As a theoretical remark, we note that with read counts kept, unitigs are a
\emph{lossless} but \emph{reduced} representation of sequence reads. They are
`reduced' in that individual reads are lost; they are `lossless' in that all
the information in reads, such as small variants, copy numbers and structural
changes are fully preserved in unitigs, as long as they are constructed
correctly. For single-end reads, it is theoretically possible to `compress'
reads to unitigs, which are largely non-redundant and much smaller in size.
Accurately and efficiently constructing unitigs might provide an interesting
alternative to data storage and downstream analyses in future, though practical
challenges, such as the high computational cost and the lack of accuracy of
unitigs, remain at present.
\vspace*{-1em}
\section*{ACKNOWLEDGEMENTS}
We are grateful to David Reich, Evan Eichler and Peter Sudmant for providing
additional data and computing resources for evaluating fermi, to David
Altshuler and the GSA group at Broad for the helpful discussions and to Richard
Durbin and Jared Simpson for their comments on the initial draft of the
manuscript. We also thank Shaun Jackman for providing the ABySS
assembly and Michael Brudno for his insight into the algorithm time complexity.
\paragraph{Funding\textcolon} NIH 1U01HG005208-01.
\bibliography{fermi}
\end{document}