Another snakemake module for metagenomic assembly and binning. Why could you ask? Here are some reasons which motivated me to build this tool:
- Available pipelines are mostly all-in-one solutions with no modularity. You can't replace a binner with another binner. You can't compare binners.
- Available pipelines incorporate unnecessary tools and packages. Who needs to download 1Gb of codebase for binning?
- Most of the (old) binning refiners are not coded properly. Don't use CPU, shady options, ...
- Most of the binning tools do not incorporate quality controls of your bins. I wonder why.
- Most of the pipelines assume you have perfect data. What happens if your sample does not yield a single bin?
This pipeline was at first a translation and cleaner version of the metaLAS pipeline, itself very inspired by the metaWRAP pipeline. The current pipeline is now much more complete. including an optional assembly step, a complete quality module, reassembly options, additional binners, and quality assessment tools.
Most importantly, this pipeline is highly modularized, incorporates multiple binners (while it is impossible to keep up with every new binners, adding a new one should be easy), and only download packages you need for your analysis.
-
Flexibility and reliability
- Include multiple binners and assembly tools
- Tuned and tested with multiple datasets
- Deal with both short and long reads data
-
A bin quality module
-
Fine-tuning and versatility of the snakemake pipeline
- Each pipeline part is a module and can be changed and or removed
- Time-consuming parts were tuned in the pipeline to make the whole process faster
- Allow implementation of new tools easily
-
Rewriting of the metaWRAP binning module
- Direct and faster implementation of original python scripts for snakemake
- Installing the whole metaWrap (1Gb) can be avoided
- Bins data in different ways
- concoct
- maxbin2
- metabat2
- vamb (see specific note)
- metabinner (not finished)
- refined - similar to metawrap (see specific note)
- Reassembly of the binning results
- Currently, only unicycler can be used to reassemble bins
- Bins quality
- Additional informations
- MIMAG status
- Bins coverage
Bins quality is based on these data:
- Completeness and contamination provided by checkm(2)
- Number of tRNA provided by tRNAScan-SE
- Number of rRNA provided by barnapp
- Chimeric information provided by GUNC
- Bins coverage based on mapping results against all assembly contigs
MIMAG group | Completeness | Contamination | # distinct tRNA | # 5S, 16S and 23S rRNA |
---|---|---|---|---|
LOW | < 50 | > 10 | - | - |
MEDIUM | > 50 | < 10 | - | - |
NEAR COMPLETE | > 90 | < 5 | - | - |
HIGH | > 90 | < 5 | 18 | At least one each |
Couple of notes:
GUNC
results are not used for MIMAG rank. I highly recommend anyone to either remove or at least carefully examine chimeric MAGs- Some studies do not use rank but make analysis base on a score:
completness - 5 x contamination
. MAGs can be selected using a threshold of score > 50.
While it's in theory possible to run the pipeline without conda, using conda is highly recommended. This pipeline use an extensive amount of software with potential conflict between them.
- [MANDATORY] You need to download Snakemake and conda. Mamba is nice too.
- [MANDATORY] You need to download gunc database using gunc and provide the database path into the main snakemake file (see
megahit_metabat2.snk
). - [MANDATORY] If you plan to use
checkm2
, you need to install it yourself and update its path in the main snakemake file (seemegahit_metabat2.snk
). - [OPTIONAL] If you want to run opera-ms assembly, you will need to install the software yourself using a conda environment and provide OPERA-MS executable in the
assembly/Snakefile
file.
By default, I set up 7 conda environments (which can be found in workflow/envs
). The conda environment to use for each software is provided within the file condaenvs.json
. By default, snakemake will automatically downloads and sets up these environment, but you can change on environment with one of your by editing the json file.
I highly recommend having a look at the pipeline configuration process and the resources requirement first. Sample information (short-reads) and assembly must be provided in a configuration file (named config_binning.json
).
Once you generated a configuration file (use the template), test the pipeline with a dry run:
snakemake -s binning.snk --configfile config_binning.json -d res_binning --use-conda --cores 18 -np
You can then run the pipeline this way:
snakemake -s binning.snk --configfile res_assembly/config_binning.json -d res_binning --use-conda --cores 18 --resources mem_mb=60000 --rerun-incomplete --keep-going
This will run the pipeline with 18 cores and 60Gb of memory.
Most of the assemblers are not completely set up in the workflow, this part needs to be finished. While the default binning.snk
workflow expects to find an assembly in the configuration file, you can easily edit this file to include an assembler as provided with megahit_metabat2.snk
example. Here is the important line in the beginning of the script:
for sample, sdata in config.items():
sdata['assembly'] = f'assembly/completed/megahit/{sample}.fa'
The usage of each component is automatically determined based on the required output of the pipeline, defined in the main Snakefile all
rule. It is encoded this way: binning/{binner}/quality_data/{postbinning}/tables/bins_info.tsv
, where binnner
can be one or several elements of the list ('metabat2', 'maxbin2', 'concoct', 'refined')
and postbinning
can be from ('raw', 'unicycler')
. For example, if you want to use metabat2
with unicyler
reassembly, you should write binning/metabat2/quality_data/unicycler/tables/bins_info.tsv
. Memory and cores amount is also currently hardcoded based on my current experience to what fits the best to me, but you might want to change that for your need (see next section).
Binning algorithm can deal with both short-reads and nanopore long-reads data, just use nanopore
as key instead of r1
and r2
in your configuration file. Traditional binners are not supposed to work well with current nanopore data (or anything with lower quality than Illumina reads). LRBinner might be a solution but I still need to implement and compare this tool. A temporary solution is described below.
You will need to redefine the identity threshold used for the coverage extraction by jgi_summary
(see here). This threshold is set by default to 85 for nanopore data while I keep the script default identity of 97 for short-read data. You can update this value per sample using the key jgi_identity
in your configuration file as such:
{
"MYSAMPLE": {
"assembly": "/path/to/assembly.fa",
"nanopore": "/path/to/longread.fq.gz",
"jgi_identity": 90
}
}
If you have both short and long-read, short-reads are prioritized for mapping and long-reads are ignored.
Spring file (compressed fastq) can be provided rather than fastq file. Current system assumes that your spring file encodes Illumina paired-end reads. To use it, just replace r1
andr2
with spring
in your sample configuration.
Checkm2
is available and can be used flawlessly as a replacement of checkm
. Note that based on my tests, the two softwares does not produce exactly the same results. One major technical advantage of checkm2
over checkm
is its independence over pplacer
, which make the former much faster and memory efficient. This might be especially useful when using the refined
binner (see below). To use checkm2
you can use the alternative binqual_checkm2.yaml
environment, link the binary in the main snakefile and setup the flag in the internal configuration dictionary (see megahit_metabat2.snk
for an example).
The snakemake implements a refined
binner that mimic metawrap pipeline:
- Generate all potential combinations between metabat2, maxbin2, concoct approaches
- Identified the bins showing the best contamination and completeness values
- Dereplicate from these bins potential duplicated contigs
Important note: This approach favors low contamination against high completeness. The first step is to merge the intersection of contigs from two bins with similar contigs, and look if the intersection is better than the original two bins. The second step is to keep the bin that shows the better contamination/completeness ratio and remove similar bins (based on 80% similarity). This means you can have contigs overlap between bins with this method.
On the resource side, the comparison of all possible combinations is based on checkM (or CheckM2), which can make the process slow (see note under). Finally, while this script should make the whole process much faster, some thresholds were removed from the original design and you might have more low-quality bins with this pipeline (that you can filter later if you want).
An integration of VAMB is available with this pipeline. Note that VAMB is supposed to be run with GPU but the current implementation does not include this aspect, which makes the process slower but still doable. For a large dataset, you might want to run it with GPU. In this case, I recommend launching first snakemake as usual with option --until vamb_paste_abundances
, this will lead to the creation of all input files requested by VAMB (which need a mapping for each sample). Once done, modify VAMB rule in workflow/vamb.snk
by adding the --cuda
option in the vamb_params
variable. Relaunch the pipeline with option --until vamb_vamb
and some GPUs. Once finished, you can run one last time the snakemake pipeline as usual.
VAMB does not offer defined post and pre-processing of data, which differs from usual binners. In this pipeline, we use arbitrary filters, encoded in the VAMB options with the vamb_params
: -o C -m 1000 --minfasta 500000
; and when we filter the output file with a minimum binsize of 100,000bp with the minsize
parameter. You might want to modify these values.
The pipeline requires several tools, each having variable needs. Note that it's nearly impossible to define required computer resources since it highly depends on your input data (biological sample, assembly quality and sequencing depth). I record here tools that might be a bottleneck. Overall I recommend using a system with a minimum of 24CPU and at least ~ 100Gb of RAMs (minimum 40Gb). I recommend adding a bit more CPU than what you have (5/10%) since some tools steps under-use resources which are given to them.
The pipeline uses several times checkpoint to finely adapts its process to generated bins. Therefore, the number of remaining rules will most likely increase during the pipeline execution.
CheckM lineage workflow uses pplacer to identify the lineage of the bin into a taxonomic tree. This process is memory intensive and scales linearly. In this pipeline, we assume that you might have a lot of bins, coming at different times. This way, we run Checkm individually for each sample. However, this might be suboptimal for low bins quantity and low memory setup, I hope this will not be a big issue for you. I noticed that Checkm crashes on some cluster configurations.
The binning refinement tool uses checkM
completeness and contamination values on all binning result and binning combinations. This means that for a single sample with refinement from 3 binners, you have 3 + 4 checkM
analyses to do. checkM
internally uses pplacer
which is known to be slow and requires at least 35Gb of memory. This can be improved by using additional pplacer
threads, but the memory requirement is linear to the number of threads involved, so this is not an ideal solution. If memory is absolutely not an issue for you, you might want to change checkM
calls directly in the snakemake pipeline at workflow/binqual.snk
. This issue is mostly resolved using CheckM2
which does not use pplacer
.
This step uses unicycler for each bin and can be particularly intensive, especially if you have high depth. This process involves 32 CPUs and unicycler
calls several software, including spades which is known to take a lot of memory for some well-covered samples, especially when it is too multi-threaded. You might need to reduce the rule to 16 CPUs. I do not recommend using the reassembly mode unless you have a low number of samples (< 20) or unlimited resources.
You might hit this message, linked to this issue, usually almost at the end of the pipeline. This is not a big issue, and you can just run the pipeline again with the same arguments, snakemake will continue where he stopped and finish the last rules.
- Allow long-read mapping when short-reads are not available
- Unicycler sometimes crashes
- Add more assemblers