Unicycler is an assembly pipeline for bacterial genomes. It can assemble Illumina-only read sets where it functions as a SPAdes-optimiser. It can also assembly long-read-only sets (PacBio or Nanopore) where it runs a miniasm+Racon pipeline. For the best possible assemblies, give it both Illumina reads and long reads, and it will conduct a hybrid assembly.
Read more about Unicycler here:
And read about how we use it to complete bacterial genomes here:
- Introduction
- Requirements
- Installation
- Quick usage
- Background
- Method: Illumina-only assembly
- Method: long-read-only assembly
- Method: hybrid assembly
- Conservative, normal and bold
- Options and usage
- Output files
- Tips
- Other included tools
- Paper
- Acknowledgements
- License
As input, Unicycler takes one of the following:
- Illumina reads from a bacterial isolate (ideally paired-end, but unpaired works too)
- A set of long reads (either PacBio or Nanopore) from a bacterial isolate (uncorrected long reads are fine, though corrected long reads should work too)
- Illumina reads and long reads from the same isolate (best case)
Reasons to use Unicycler:
- It circularises replicons without the need for a separate tool like Circlator.
- It handles plasmid-rich genomes.
- It can use long reads of any depth and quality in hybrid assembly. 10x or more may be required to complete a genome, but Unicycler can make nearly-complete genomes with far fewer long reads.
- It produces an assembly graph in addition to a contigs FASTA file, viewable in Bandage.
- It has very low misassembly rates.
- It can cope with very repetitive genomes, such as Shigella.
- It's easy to use: runs with just one command and usually doesn't require tinkering with parameters.
Reasons to not use Unicycler:
- You're assembling a eukaryotic genome or a metagenome (Unicycler is designed exclusively for bacterial isolates).
- Your Illumina reads and long reads are from different isolates (Unicycler struggles with sample heterogeneity).
- You're impatient (Unicycler is thorough but not especially fast).
- Linux or macOS
- Python 3.4 or later
- C++ compiler with C++14 support:
- setuptools (only required for installation of Unicycler)
- For short-read or hybrid assembly:
- SPAdes v3.6.2 or later (
spades.py
)
- SPAdes v3.6.2 or later (
- For long-read or hybrid assembly:
- Racon (
racon
)
- Racon (
- For polishing
- For rotating circular contigs:
- BLAST+ (
makeblastdb
andtblastn
)
- BLAST+ (
Some of these dependencies can be removed by turning off parts of the Unicycler pipeline (e.g. --no_polish
). Unicycler expects these tools to be available in $PATH
. If they aren't, you can specify their location using Unicycler options (e.g. --samtools_path
).
Bandage isn't required to run Unicycler, but it is very helpful for manually investigating assemblies (the graph images in this README were made with Bandage). Make sure to get the latest release for full compatibility with Unicycler.
These instructions install the most up-to-date version of Unicycler:
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
python3 setup.py install
Notes:
- If the last command complains about permissions, you may need to run it with
sudo
. - If you want a particular version of Unicycler, download the source from the releases page instead of cloning from GitHub.
- Install just for your user:
python3 setup.py install --user
- If you get a strange 'can't combine user with prefix' error, read this.
- Install to a specific location:
python3 setup.py install --prefix=$HOME/.local
- Install with pip (local copy):
pip3 install path/to/Unicycler
- Install with pip (from GitHub):
pip3 install git+https://github.com/rrwick/Unicycler.git
- Install with specific Makefile options:
python3 setup.py install --makeargs "CXX=icpc"
This approach compiles Unicycler code, but doesn't copy executables anywhere:
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
make
Now instead of running unicycler
, you instead use path/to/unicycler-runner.py
.
Illumina-only assembly:
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -o output_dir
Long-read-only assembly:
unicycler -l long_reads_high_depth.fastq.gz -o output_dir
Hybrid assembly:
unicycler -1 short_reads_1.fastq.gz -2 short_reads_2.fastq.gz -l long_reads.fastq.gz -o output_dir
If you don't have any reads of your own, take a look in the sample_data
directory for links to some small read sets.
To understand what Unicycler is doing, you need to know about assembly graphs. For a thorough introduction, I'd suggest this tutorial or the Velvet paper. But in short, an assembly graph is a data structure where contigs aren't disconnected sequences but can have connections to each other:
Just contigs: Assembly graph:
TCGAAACTTGACGCGAGTCGC CTTGTTTA
TGCTACTGCTTGATGATGCGG / \
TGTCCATT TCGAAACTTGACGCGAGTCGC TGCTACTGCTTGATGATGCGG
CTTGTTTA \ /
TGTCCATT
Most assemblers use graphs internally to produce their assemblies, but users often ignore the graph in favour of the conceptually simpler FASTA file of contigs. When a genome assembly is 100% complete, we have one contig per chromosome/plasmid and there's no real need for the graph. But most assemblies are not complete (especially short-read assemblies), and a graph can describe an incomplete assembly much better than contigs alone.
The main reason we can't get a complete assembly from short reads is that DNA usually contains repeats – the same sequence occurring two or more times in the genome. When a repeat is longer than the reads (or for paired-end sequencing, longer than the insert size), it forms a single contig in the assembly graph with multiple connections in and multiple connections out.
Here is what happens to a simple bacterial assembly graph as you add repeats to the genome:
As repeats are added, the graph becomes increasingly tangled (and real assembly graphs get a lot more tangled than that).
To complete a bacterial genome assembly (i.e. find the one correct sequence for each chromosome/plasmid), we need to resolve the repeats. This means finding which way into a repeat matches up with which way out. Short reads don't have enough information for this but long reads do.
Assembly graphs come in many different varieties, but we are particularly interested in the kind produced by SPAdes, because that is what Unicycler uses.
SPAdes graphs are made by performing a De Bruijn graph assembly with a range of different k-mer sizes, from small to large (see the SPAdes paper). Each assembly builds on the previous one, which allows SPAdes to get the advantages of both small k-mer assemblies (a more connected graph) and large k-mer assemblies (ability to resolve repeats). Two contigs in a SPAdes graph that connect will overlap by their k-mer size (more info on the Bandage wiki page).
After producing the graph, SPAdes can perform further repeat resolution by using paired-end information. Since two reads in a pair are close to each other in the original DNA, SPAdes can use this to trace paths in the graph to form larger contigs (see their paper on ExSPAnder). However, the SPAdes contigs with repeat resolution do not come in graph form – they are only available in a FASTA file.
When assembling just Illumina reads, Unicycler functions mainly as a SPAdes optimiser. It offers a few benefits over using SPAdes alone:
- Tries a wide range of k-mer sizes and automatically selects the best.
- Filters out low-depth parts of the assembly to remove contamination.
- Applies SPAdes repeat resolution to the graph (as opposed to disconnected contigs in a FASTA file).
- Rejects low-confidence repeat resolution to reduce the rate of misassembly.
- Trims off graph overlaps so sequences aren't repeated where contigs join.
More information on the Illumina-only assembly process is described in the steps below.
Unicycler uses SPAdes' built-in read correction step before assembling the Illumina reads. This can be disabled with --no_correct
if your Illumina reads are very high quality or you've already performed read QC.
Unicycler uses SPAdes to assemble the Illumina reads into an assembly graph. It tries assemblies at a wide range of k-mer sizes, evaluating the graph at each one. It chooses the graph which best minimises both contig count and dead end count. If the Illumina reads are good, it produces an assembly graph with long contigs but few to no dead ends (more info here). Since a typical bacterial genome has no dead ends (the sequences are circular) an ideal assembly graph won't either.
A raw SPAdes graph can also contain some 'junk' sequences due to sequencer artefacts or contamination, so Unicycler performs some graph cleaning to remove these. Therefore, small amounts of contamination in the Illumina reads should not be a problem.
To scaffold the graph, Unicycler must distinguish between single copy contigs and repeats. It does this with a greedy algorithm that uses both read depth and graph connectivity:
This process does not assume that all single copy contigs have the same read depth, which allows it to identify single copy contigs from plasmids as well as the chromosome. After it has determined multiplicity, Unicycler chooses a set of 'anchor' contigs. These are sufficiently-long single-copy contigs suitable for bridging in later steps.
To reduce redundancy and allow for neatly circularised contigs, Unicycler removes all overlap in the graphs:
Before: After: GACGCGTTGACAAGGAAAT TGACAAGGAAAT / / TTGACTACCCAGACGCGT TTGACTACCCAGACGCGT \ \ GACGCGTCCTCTCATTCTA CCTCTCATTCTA
At this point, the assembly graph does not contain the SPAdes repeat resolution. To apply this to the graph, Unicycler builds bridges between single copy contigs using the information in the SPAdes contigs.paths
file.
Bridges are given a quality score, most importantly based on the length of the bridge compared to the length of the paired end insert size, so bridges which span a long repeat are given a low score. Since paired-end sequencing cannot resolve repeats longer than the insert size, bridges which attempt to span long repeats cannot be trusted. This selectivity helps to reduce the number of misassemblies.
When assembling just long reads, Unicycler uses a miniasm+Racon pipeline. It offers a couple advantages over using other long-read-only assemblers:
- Multiple rounds of Racon polishing give a good final sequence accuracy.
- Circular replicons (like most bacterial chromosomes and plasmids) assemble into circular replicons with no start-end overlap.
More information on the long-read-only assembly process is described in the steps below.
Unicycler uses minimap and miniasm to assemble the long reads in essentially the same manner as described in the miniasm README. This produces an uncorrected assembly which is made directly of pieces of reads – the assembly error rate will be similar to the read error rate.
The version of miniasm that comes with Unicycler is slightly modified in a couple of ways. The first modification is to help circular replicons assemble into circular string graphs. The other modification only applies to hybrid assembly, so I'll come back to that!
After miniasm assembly, Unicycler carries out multiple rounds of polishing with Racon to improve the sequence accuracy. It will polish until the assembly stops improving, as measured by the agreement between the reads and the assembly. Circular replicons are 'rotated' (have their starting position shifted) between rounds of polishing to ensure that no part of the sequence is left unpolished.
Hybrid assembly (using both Illumina read and long reads) is where Unicycler really shines. Like with the Illumina-only pipeline described above, Unicycler will produce an Illumina assembly graph. It then uses long reads to build bridges, which often allows it to resolve all repeats in the genome, resulting in a complete genome assembly.
In hybrid assembly, Unicycler carries out all the steps in the Illumina-only pipeline, plus the additional steps below:
This step uses miniasm and Racon, and is very much like the long-read-only assembly method described above. Here however, the assembly is not just on long reads but a mixture of long reads and anchor contigs from the Illumina-only assembly. Since these anchor contigs can often be much longer than long reads (sometimes hundreds of kbp), they can significantly help the assembly. This takes advantage of the other modification to miniasm which was teased above. In Unicycler's miniasm, contigs and long reads are treated slightly differently in the string graph manipulations to better perform this step.
After the assembly is finished, Unicycler finds anchor contigs in the assembled sequence and uses the intervening sequences to create bridges:
assembled sequence: TATGGTCTCGCATGTTAATTCTACTCCCGAACTTGGCCCATCCCCGGCTAGGCTGGGCACTAGACGGTGGAT
anchor contigs: GTCTCGCATGTTAA ACTCCCGAACTTGGCCCATCCCCGGC GGCACTAGACGGTGG
intervening sequences for bridges: TTCT TAGGCTG
Unicycler also attempts to make long-read bridges directly by semi-globally aligning the long reads to the assembly graph. For each pair of single copy contigs which are linked by read alignments, Unicycler uses the read consensus sequence to find a connecting path and creates a bridge.
This step and the previous step are somewhat redundant, as both use long reads to build bridges between short-read contigs. They are both included because they have different strengths. The previous approach can tolerate low long-read depth but requires a good short-read assembly graph (i.e. few dead ends). This step requires decent long-read depth but can tolerate poor short-read assembly graphs. By using the two strategies together, Unicycler can successfully handle many types of input.
At this point of the pipeline there can be many bridges, some of which may conflict. Unicycler therefore assigns a quality score to each based on all available evidence (e.g. read alignment quality, graph path match, read depth consistency). Bridges are then applied in order of decreasing quality so whenever there is a conflict, only the most supported bridge is used. A minimum quality threshold prevents the application of low evidence bridges (see Conservative, normal and bold for more information).
If the above steps have resulted in any simple, circular sequences, then Unicycler will attempt to rotate/flip them to begin at a consistent starting gene. By default this is dnaA or repA, but users can specify their own with the --start_genes
option.
Finally, Unicycler does multiple rounds of short-read polishing using Pilon.
Unicycler can be run in three modes: conservative, normal (the default) and bold, set with the --mode
option. Conservative mode is least likely to produce a complete assembly but has a very low risk of misassembly. Bold mode is most likely to produce a complete assembly but carries greater risk of misassembly. Normal mode is intermediate regarding both completeness and misassembly risk.
If the structural accuracy of your assembly is paramount to your research, conservative mode is recommended. If you want a completed genome, even if it contains a mistake or two, then use bold mode.
The specific differences between the three modes are as follows:
Mode | Invocation | Short read bridges | Bridge quality threshold | Contig merging |
---|---|---|---|---|
conservative | ‑‑mode conservative |
not used | high (25) | contigs are only merged with bridges |
normal | ‑‑mode normal (or nothing) |
used | medium (10) | contigs are merged with bridges and when their multiplicity is 1 |
bold | ‑‑mode bold |
used | low (1) | contigs are merged wherever possible |
In the above example, the conservative assembly is incomplete because some bridges fell below the quality threshold and were not applied. Its contigs, however, are extremely reliable. Normal mode nearly gave a complete assembly, but a couple of unmerged contigs remain. Bold mode completed the assembly, but since lower confidence regions were bridged and merged, there is a larger risk of error.
Run unicycler --help
to view the program's most commonly used options:
usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAIRED] [-l LONG] -o OUT
[--verbosity VERBOSITY] [--min_fasta_length MIN_FASTA_LENGTH] [--keep KEEP] [-t THREADS]
[--mode {conservative,normal,bold}] [--linear_seqs LINEAR_SEQS] [--vcf]
__
\ \___
\ ___\
//
____// _ _ _ _
//_ //\\ | | | | |_| | |
// \// \\ | | | | _ __ _ ___ _ _ ___ | | ___ _ __
|| (O) || | | | || '_ \ | | / __|| | | | / __|| | / _ \| '__|
\\ \_ // | |__| || | | || || (__ | |_| || (__ | || __/| |
\\_____// \____/ |_| |_||_| \___| \__, | \___||_| \___||_|
__/ |
|___/
Unicycler: an assembly pipeline for bacterial genomes
Help:
-h, --help Show this help message and exit
--help_all Show a help message with all program options
--version Show Unicycler's version number
Input:
-1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)
-2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)
-s UNPAIRED, --unpaired UNPAIRED
FASTQ file of unpaired short reads (optional)
-l LONG, --long LONG FASTQ or FASTA file of long reads (optional)
Output:
-o OUT, --out OUT Output directory (required)
--verbosity VERBOSITY Level of stdout and log file information (default: 1)
0 = no stdout, 1 = basic progress indicators, 2 = extra info,
3 = debugging info
--min_fasta_length MIN_FASTA_LENGTH
Exclude contigs from the FASTA file which are shorter than this
length (default: 100)
--keep KEEP Level of file retention (default: 1)
0 = only keep final files: assembly (FASTA, GFA and log),
1 = also save graphs at main checkpoints,
2 = also keep SAM (enables fast rerun in different mode),
3 = keep all temp files and save all graphs (for debugging)
--vcf Produce a VCF by mapping the short reads to the final assembly
(experimental, default: do not produce a vcf file)
Other:
-t THREADS, --threads THREADS Number of threads used (default: 8)
--mode {conservative,normal,bold}
Bridging mode (default: normal)
conservative = smaller contigs, lowest misassembly rate
normal = moderate contig size and misassembly rate
bold = longest contigs, higher misassembly rate
--linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in
the underlying sequence (default: 0)
Run unicycler --help_all
to see a complete list of the program's options. These allow you to turn off parts of the pipeline, specify the location of tools (only necessary if they are not in PATH) and adjust various settings:
SPAdes assembly:
These options control the short-read SPAdes assembly at the beginning of the Unicycler
pipeline.
--spades_path SPADES_PATH Path to the SPAdes executable (default: spades.py)
--no_correct Skip SPAdes error correction step (default: conduct SPAdes error
correction)
--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of
the read length (default: 0.2)
--max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction
of the read length (default: 0.95)
--kmers KMERS Exact k-mers to use for SPAdes assembly, comma-separated
(example: 22,33,44, default: automatic)
--kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10)
--depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal
depth, if doing so does not result in graph dead ends (default:
0.25)
--spades_tmp_dir SPADES_TMP_DIR
Specify SPAdes temporary directory using the SPAdes --tmp-dir
option (default: make a temporary directory in the output
directory)
miniasm+Racon assembly:
These options control the use of miniasm and Racon to produce long-read bridges.
--no_miniasm Skip miniasm+Racon bridging (default: use miniasm and Racon to
produce long-read bridges)
--racon_path RACON_PATH Path to the Racon executable (default: racon)
--existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY
A pre-prepared long read assembly for the sample in GFA format.
If this option is used, Unicycler will skip the miniasm/Racon
steps and instead use the given assembly (default: perform long
read assembly using miniasm/Racon)
Assembly rotation:
These options control the rotation of completed circular sequence near the end of the
Unicycler pipeline.
--no_rotate Do not rotate completed replicons to start at a standard gene
(default: completed replicons are rotated)
--start_genes START_GENES FASTA file of genes for start point of rotated replicons
(default: start_genes.fasta)
--start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene
search (default: 90.0)
--start_gene_cov START_GENE_COV
The minimum required BLAST percent coverage for a start gene
search (default: 95.0)
--makeblastdb_path MAKEBLASTDB_PATH
Path to the makeblastdb executable (default: makeblastdb)
--tblastn_path TBLASTN_PATH Path to the tblastn executable (default: tblastn)
Pilon polishing:
These options control the final assembly polish using Pilon at the end of the Unicycler
pipeline.
--no_pilon Do not use Pilon to polish the final assembly (default: Pilon is
used)
--bowtie2_path BOWTIE2_PATH Path to the bowtie2 executable (default: bowtie2)
--bowtie2_build_path BOWTIE2_BUILD_PATH
Path to the bowtie2_build executable (default: bowtie2-build)
--samtools_path SAMTOOLS_PATH Path to the samtools executable (default: samtools)
--pilon_path PILON_PATH Path to a Pilon executable or the Pilon Java archive file
(default: pilon)
--java_path JAVA_PATH Path to the java executable (default: java)
--min_polish_size MIN_POLISH_SIZE
Contigs shorter than this value (bp) will not be polished using
Pilon (default: 10000)
VCF:
These options control the production of the VCF of the final assembly.
--bcftools_path BCFTOOLS_PATH Path to the bcftools executable (default: bcftools)
Graph cleaning:
These options control the removal of small leftover sequences after bridging is complete.
--min_component_size MIN_COMPONENT_SIZE
Graph components smaller than this size (bp) will be removed from
the final graph (default: 1000)
--min_dead_end_size MIN_DEAD_END_SIZE
Graph dead ends smaller than this size (bp) will be removed from
the final graph (default: 1000)
Long read alignment:
These options control the alignment of long reads to the assembly graph.
--contamination CONTAMINATION FASTA file of known contamination in long reads
--scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap
open, gap extend (default: 3,-6,-5,-2)
--low_score LOW_SCORE Score threshold - alignments below this are considered poor
(default: set threshold automatically)
Unicycler's most important output files are assembly.gfa
, assembly.fasta
and unicycler.log
. These are produced by every Unicycler run. Which other files are saved to its output directory depends on the value of --keep
:
--keep 0
retains only the important files. Use this setting to save drive space.--keep 1
(the default) also saves some intermediate graphs.--keep 2
also retains the SAM file of long-read alignments to the graph. This ensures that if you rerun Unicycler with the same output directory (for example changing the mode to conservative or bold) it will run faster because it does not have to repeat the alignment step.--keep 3
retains all files and saves many intermediate graphs. This is for debugging purposes and uses a lot of space. Most users should probably avoid this setting.
All files and directories are described in the table below. Intermediate output files (everything except for assembly.gfa
, assembly.fasta
and unicycler.log
) will be prefixed with a number so they are in chronological order.
File | Description | --keep level |
---|---|---|
spades_assembly/ | directory containing all SPAdes files and each k-mer graph | 3 |
best_spades_graph.gfa | the best SPAdes short-read assembly graph, with a bit of graph clean-up | 1 |
overlaps_removed.gfa | overlap-free version of the SPAdes graph, with some more graph clean-up | 3 |
miniasm_assembly/ | directory containing miniasm string graphs and unitig graphs | 3 |
read_alignment/ | directory containing long_read_alignments.sam |
3 |
bridges_applied.gfa | bridges applied, before any cleaning or merging | 1 |
cleaned.gfa | redundant contigs removed from the graph | 3 |
merged.gfa | contigs merged together where possible | 3 |
final_clean.gfa | more redundant contigs removed | 1 |
rotated.gfa | circular replicons rotated and/or flipped to a start position | 1 |
polished.gfa | after a round of Pilon polishing | 1 |
assembly.gfa | final assembly in GFA v1 graph format | 0 |
assembly.fasta | final assembly in FASTA format (same contigs as in assembly.gfa) | 0 |
unicycler.log | Unicycler log file (same info as stdout) | 0 |
Unicycler is thorough and accurate, but not particularly fast. The direct long read bridging step of the pipeline can take a while to complete. Two main factors influence the running time: the number of long reads (more reads take longer to align) and the genome size/complexity (finding bridge paths is more difficult in complex graphs).
Unicycler may only take an hour or so to assemble a small, simple genome with low depth long reads. On the other hand, a complex genome with many long reads may take 12 hours to finish or more. If you have a very high depth of long reads, you can make Unicycler run faster by subsampling for only the longest reads.
Using a lot of threads (with the --threads
option) can make Unicycler run faster too. It will only use up to 8 threads by default, but if you're running it on a big machine with lots of CPU and RAM, feel free to use more!
Unicycler also works with PyPy which can speed up parts of its pipeline. However, some of Unicycler's slowest steps are when it calls other tools (like SPAdes) or uses C++ code, so PyPy may not help much. I haven't tested this thoroughly – if you try it, let me know how you go!
The length of a long read is very important, typically more than its accuracy, because longer reads are more likely to align to multiple single copy contigs, allowing Unicycler to build bridges.
Consider a sequence with a 2 kb repeat:
In order to resolve the repeat, a read must span it by aligning to some sequence on either side. In this example, the 1 kb reads are shorter than the repeat and are useless. The 2.5 kb reads can resolve the repeat, but they have to be in just the right place to do so. Only one out of the six in this example is useful. The 5 kb reads, however, have a much easier time spanning the repeat and all three are useful.
So how long must your reads be for Unicycler to complete an assembly? Longer than the longest repeat in the genome. Depending on the genome, that might be a 1 kb insertion sequence, a 6 kb rRNA operon or a 50 kb prophage. If your reads are just a bit longer than the longest repeat, you'll probably need a lot of them. If they are much longer, then fewer reads should suffice. But in any scenario, longer is better!
Poretools can turn your Nanopore FAST5 reads into a FASTQ file appropriate for Unicycler. Here's an example command:
poretools fastq --type best --min-length 1000 path/to/fast5/dir/ > nanopore_reads.fastq
If you have 2D reads, --type best
makes Poretools give only one FASTQ read per FAST5 file (if you have 1D reads, you can exclude that option). Adjust the --min-length 1000
parameter to suit your dataset – a larger value would be appropriate if you have lots of long reads.
Since Unicycler can tolerate low accuracy reads, Oxford Nanopore 1D sequencing is preferable to 2D, as it can provide twice as many reads. Unicycler will of course work with 2D reads, but you can get more out of your flow cell with 1D.
Update: 2D reads are now a thing of the past, replaced by 1D2. I haven't tried 1D2 yet, thought I suspect the above advice still holds.
If you are using Oxford Nanopore reads, then I'd recommend using Porechop before assembly. This will trim adapters off read ends and split some chimeric reads, both of which make Unicycler's job a little bit easier.
Unicycler prefers decent Illumina reads as input – ideally with uniform read depth and 100% genome coverage. Bad Illumina read sets can still work in Unicycler, but greater long-read depth will be required to compensate.
You can look at the best_spades_graph.gfa
file (the first graph Unicycler saves to file) in Bandage to get a quick impression of the Illumina read quality:
A is an very good Illumina read graph – the contigs are long and there are no dead ends. This read set is ideally suited for use in Unicycler and shouldn't require too many long reads to complete (10–15x would probably be enough).
B is also a good graph. The genome is more complex, resulting in a more tangled structure, but there are still very few dead ends (you can see one in the lower left). This read set would also work well in Unicycler, though more long reads may be required to get a complete genome (maybe 20x or so).
C is a disaster! It is broken into many pieces, probably because parts of the genome got no read depth at all. This genome may take lots of long reads to complete in Unicycler, possibly 30x or more. The final assembly will probably have more small errors (SNPs and indels), as parts of the genome cannot be polished well with Illumina reads.
Confused by very small (e.g. 2 bp) contigs in Unicycler assemblies? Unlike a SPAdes graph where neighbouring sequences overlap by their k-mer size, Unicycler's final graph has no overlaps and the sequences adjoin directly. This means that contigs in complex regions can be quite short. They may be useless as stand-alone contigs but are still important in the graph structure.
If short contigs are a problem for your downstream analysis, you can use the --min_fasta_length
to exclude them from Unicycler's FASTA file (they will still be included in the GFA file).
Unicycler normalises the depth of contigs in the graph to the median value. This typically means that the chromosome has a depth near 1x and plasmids may have different (typically higher) depths.
In the above graph, the chromosome is at the top (you can only see part of it) and there are two plasmids. The plasmid on the left occurs in approximately 4 or 5 copies per cell. For the larger plasmid on the right, most cells probably had one copy but some had more. Since sequencing biases can affect read depth, these per cell counts should be interpreted loosely.
If your long reads have known contamination, you can use the --contamination
option to give Unicycler a FASTA file of the contaminant sequences. Unicycler will then discard any reads for which the best alignment is to the contaminant.
For example, if you've sequenced two isolates in succession on the same Nanopore flow cell, there may be residual reads from the first sample in the second run. In this case, you can supply a reference/assembly of the first sample to Unicycler when assembling the second sample.
Some Oxford Nanopore protocols include a lambda phage spike-in as a control. Since this is a common contaminant, you can simply use --contamination lambda
to filter these out (no need to supply a FASTA file).
If Unicycler makes a serious mistake during its multiplicity determination, this can have detrimental effects on the rest of the assembly. I've seen this happen mainly in two scenarios:
- when the Illumina graph is badly fragmented (multiplcity determination has few graph connections to work with)
- there are multiple very similar plasmids in the genome (shared sequences between plasmids can be huge, 10s of kbp)
If you believe this has happened in your assembly, you can manually assign multiplicities and try the assembly again. Here's the process:
- View the short read assembly (
001_best_spades_graph.gfa
) in Bandage and view the region in question. Note that Unicycler's graph colour scheme uses green for single-copy segments and yellow/orange/red for multi-copy segments. - For any segments where you disagree with Unicycler's multiplicity, add a
ML
tag to the GFA segment line in001_best_spades_graph.gfa
. Examples:- If Unicycler called segment 50 single-copy but you think it's actually a 2-copy repeat, add
ML:i:2
to the end of the GFA line starting withS 50
. - If Unicycler called segment 107 multi-copy but you think it's actually single-copy, add
ML:i:1
to the end of the GFA line starting withS 107
.
- If Unicycler called segment 50 single-copy but you think it's actually a 2-copy repeat, add
- Run Unicycler again, pointing to the same output directory (with your modified
001_best_spades_graph.gfa
file). It will take your manually assigned multiplicities into account and hopefully do better!
If Unicycler doesn't complete your bacterial genome assembly on its own, you may be able to complete it manually with a bit of bioinformatics detective work. There's no single, straight-forward procedure for doing so, but I've put together a few examples on the Unicycler wiki which may be helpful.
If you have a long-read assembly that you've prepared outside Unicycler and trust (e.g. with Canu), you can give it to Unicycler with --existing_long_read_assembly
. Unicycler will then skip its miniasm/Racon step and use this assembly instead.
Unicycler also comes with a few other tools which may be of interest:
- Unicycler-align: semi-global alignment of long reads
- Unicycler-polish: hybrid assembly polishing
- Unicycler-scrub: read scrubber for trimming ends and splitting chimeras
- Unicycler-check: misassembly detection and alignment visualisation
These tools may be experimental, incomplete or no longer under development, so use with caution!
An open access article describing Unicycler is available from PLOS Computational Biology:
Unicycler: Resolving bacterial genome assemblies from short and long sequencing reads
If you use Unicycler in your research, a citation would be appreciated!
Unicycler would not have been possible without Kat Holt, my fellow researchers in her lab and the many other people I work with at the University of Melbourne's Bio21 Molecular Science & Biotechnology Institute. In particular, Margaret Lam, Kelly Wyres, David Edwards and Claire Gorrie worked with me on many challenging genomes during Unicycler's development. Louise Judd is a whizz with the MinION and produced many of the long reads I have used when developing Unicycler.
Unicycler uses SeqAn to perform alignments and other sequence manipulations. The authors of this library have been very helpful during Unicycler's development and I owe them a great deal of thanks! It also uses minimap for alignment and miniasm for long read assembly, and so I'd like to thank Heng Li for these tools. Finally, Unicycler uses nanoflann, a delightfully fast and lightweight nearest neighbour library, to perform its line-finding in semi-global alignment.