A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).
- Snakemake
- mmseqs2
- Bakta
- Biopython
- CheckM
- Pandas
- Rust toolchain
- Panaroo
NOTE: Conda is used to call different environments and dependencies (see Snakemake file).
Install the required packages using conda/mamba:
git clone https://github.com/bacpop/MAG_pangenome_pipeline.git
cd MAG_pangenome_pipeline
mamba env create -f environment.yml
mamba activate celebrimbor
Download the required bakta database file:
bakta_db download --output /path/to/database
You can also use the light bakta database if using a suitable version of bakta:
bakta_db download --output /path/to/database --type light
Install cgt (will install cgt_bacpop
executable in ./bin
directory)
cargo install cgt_bacpop --root .
Or to build from source:
git clone https://github.com/bacpop/cgt.git
cd cgt
cargo install --path "."
An alternative, if you are having trouble with the above, is to use the CELEBRIMBOR docker container. If you are comfortable running commands inside docker containers and mounting your external files, the whole pipeline is in the container available by running:
docker pull samhorsfield96/celebrimbor:main
Update config.yaml
to specify workflow and directory paths.
core
: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.output_dir
: path to output directory. Does not need to exist prior to running.genome_fasta
: path to directory containing fasta files (must have.fasta
extension).bakta_db
: path to bakta db downloaded above.cgt_exe
: path to cgt executable.cgt_breaks
: frequency for rare/core gene cutoff, e.g.0.1,0.9
, meaning genes predicted at<0.1
frequency will berare
,0.1<=x<0.9
will bemiddle
and>=0.9
will becore
.cgt_error
: sets false assignment rate of gene to particular frequency compartment.
Run snakemake (must be in same directory as Snakemake
file):
snakemake --cores <cores>
This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.
- Predict genes in all FASTA files in given directory using bakta
- Cluster genes using mmseqs2 and generate a gene presence/absence matrix
- Generate a pangenome summary of observed gene frequencies
- Calculate genome completeness using CheckM
- Probabistically assign each gene family as
core|middle|rare
using cgt