A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).
- Snakemake
- mmseqs2
- Bakta
- Biopython
- CheckM
- Pandas
- Rust
- Panaroo
NOTE: Conda is used to call different environments and dependencies (see Snakemake file).
Install the required packages using conda/mamba:
git clone https://github.com/bacpop/MAG_pangenome_pipeline.git
cd MAG_pangenome_pipeline
mamba env create -f environment.yml
mamba activate celebrimbor
Download the required bakta database file:
bakta_db download --output /path/to/database
You can also use the light bakta database if using a suitable version of bakta:
bakta_db download --output /path/to/database --type light
Install cgt
git clone https://github.com/bacpop/cgt.git
cd cgt
cargo build --release
Update config.yaml
to specify workflow and directory paths.
core
: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.output_dir
: path to output directory. Does not need to exist prior to running.genome_fasta
: path to directory containing fasta files (must have.fasta
extension).bakta_db
: path to bakta db downloaded above.cgt_exe
: path to cgt executable. Relative path will becgt/target/release/cgt_bacpop
.cgt_breaks
: frequency for rare/core gene cutoff, e.g.0.1,0.9
, meaning genes predicted at<0.1
frequency will berare
,0.1<=x<0.9
will bemiddle
and>=0.9
will becore
.cgt_error
: sets false assignment rate of gene to particular frequency compartment.
Run snakemake (must be in same directory as Snakemake
file):
snakemake --cores <cores>
This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.
- Predict genes in all FASTA files in given directory using bakta
- Cluster genes using mmseqs2 and generate a gene presence/absence matrix
- Generate a pangenome summary of observed gene frequencies
- Calculate genome completeness using CheckM
- Probabistically assign each gene family as
core|middle|rare
using cgt