Skip to content
/ miniasm Public

Ultrafast de novo assembly for long noisy reads (though having no consensus step)

License

Notifications You must be signed in to change notification settings

lh3/miniasm

Repository files navigation

Getting Started

# Install minimap and miniasm
git clone https://github.com/lh3/minimap && (cd minimap && make)
git clone https://github.com/lh3/miniasm && (cd miniasm && make)
# Overlapping
minimap/minimap -Sw5 -L100 -m0 -t8 reads.fa reads.fa | gzip -1 > reads.paf.gz
# Assembly
miniasm/miniasm -f reads.fa reads.paf.gz > reads.gfa

Introduction

Miniasm is a very fast OLC-based de novo assembler for noisy long reads. It takes all-vs-all read self-mappings (typically by minimap) as input and outputs an assembly graph in the GFA format. Different from mainstream assemblers, miniasm does not have a consensus step. It simply concatenates pieces of read sequences to generate the final unitig sequences. Thus the per-base error rate is similar to the raw input reads.

So far miniasm is in very early development stage. It has only been tested on twelve bacterial genomes sequenced with PacBio. Including the mapping step, it takes about 3 minutes to assmble a bacterial genome. Under the default setting, miniasm assembles 5 out of 12 datasets into a single contig. The 12 data sets are PacBio E. coli sample, ERS473430, ERS544009, ERS554120, ERS605484, ERS617393, ERS646601, ERS659581, ERS670327, ERS685285, ERS743109 and a deprecated PacBio E. coli data set.

Miniasm proves that at least for high-coverage bacterial genomes, it is possible to generate long contigs from raw PacBio reads without error correction. It also shows that minimap can be used as a read overlapper, even though it is probably not as sensitive as the more sophisticated overlapers such as MHAP and DALIGNER. Coupled with long-read error correctors and consensus tools, miniasm may also be useful to produce high-quality assemblies.