Skip to content

Core ELEment Bias Removal In Metagenome Binned ORthologs: A pipeline to make pangenomes from MAGs

License

Notifications You must be signed in to change notification settings

bacpop/CELEBRIMBOR

Repository files navigation

CELEBRIMBOR

A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).

Dependencies:

  • Snakemake
  • mmseqs2
  • Bakta
  • Biopython
  • CheckM
  • Pandas
  • Rust toolchain
  • Panaroo

NOTE: Conda is used to call different environments and dependencies (see Snakemake file).

To install:

Install the required packages using conda/mamba:

git clone https://github.com/bacpop/MAG_pangenome_pipeline.git
cd MAG_pangenome_pipeline
mamba env create -f environment.yml
mamba activate celebrimbor

Download the required bakta database file:

bakta_db download --output /path/to/database

You can also use the light bakta database if using a suitable version of bakta:

bakta_db download --output /path/to/database --type light

Install cgt (will install cgt_bacpop executable in ./bin directory)

cargo install cgt_bacpop --root .

Or to build from source:

git clone https://github.com/bacpop/cgt.git
cd cgt
cargo install --path "."

Running inside a container

An alternative, if you are having trouble with the above, is to use the CELEBRIMBOR docker container. If you are comfortable running commands inside docker containers and mounting your external files, the whole pipeline is in the container available by running:

docker pull samhorsfield96/celebrimbor:main

Quick start:

Update config.yaml to specify workflow and directory paths.

  • core: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.
  • output_dir: path to output directory. Does not need to exist prior to running.
  • genome_fasta: path to directory containing fasta files (must have .fasta extension).
  • bakta_db: path to bakta db downloaded above.
  • cgt_exe: path to cgt executable.
  • cgt_breaks: frequency for rare/core gene cutoff, e.g. 0.1,0.9, meaning genes predicted at <0.1 frequency will be rare, 0.1<=x<0.9 will be middle and >=0.9 will be core.
  • cgt_error: sets false assignment rate of gene to particular frequency compartment.

Run snakemake (must be in same directory as Snakemake file):

snakemake --cores <cores>

Overview of workflow

This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.

  1. Predict genes in all FASTA files in given directory using bakta
  2. Cluster genes using mmseqs2 and generate a gene presence/absence matrix
  3. Generate a pangenome summary of observed gene frequencies
  4. Calculate genome completeness using CheckM
  5. Probabistically assign each gene family as core|middle|rare using cgt

About

Core ELEment Bias Removal In Metagenome Binned ORthologs: A pipeline to make pangenomes from MAGs

Topics

Resources

License

Stars

Watchers

Forks

Languages

  • Python 79.5%
  • R 18.5%
  • Dockerfile 1.7%
  • Shell 0.3%