Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
.github/workflows		.github/workflows
docker		docker
envs		envs
scripts		scripts
simulate_pangenomes		simulate_pangenomes
test		test
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
Snakefile		Snakefile
celebrimbor_logo.png		celebrimbor_logo.png
config.yaml		config.yaml
create_plots.py		create_plots.py
environment.yml		environment.yml
submit_lsf.sh		submit_lsf.sh

Repository files navigation

CELEBRIMBOR

A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).

Dependencies:

Snakemake
mmseqs2
Bakta
Biopython
CheckM
Pandas
Rust toolchain
Panaroo

NOTE: Conda is used to call different environments and dependencies (see Snakemake file).

To install:

Install the required packages using conda/mamba:

git clone https://github.com/bacpop/MAG_pangenome_pipeline.git
cd MAG_pangenome_pipeline
mamba env create -f environment.yml
mamba activate celebrimbor

Download the required bakta database file:

bakta_db download --output /path/to/database

You can also use the light bakta database if using a suitable version of bakta:

bakta_db download --output /path/to/database --type light

Install cgt (will install cgt_bacpop executable in ./bin directory)

cargo install cgt_bacpop --root .

Or to build from source:

git clone https://github.com/bacpop/cgt.git
cd cgt
cargo install --path "."

Quick start:

Update config.yaml to specify workflow and directory paths.

core: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.
output_dir: path to output directory. Does not need to exist prior to running.
genome_fasta: path to directory containing fasta files (must have .fasta extension).
bakta_db: path to bakta db downloaded above.
cgt_exe: path to cgt executable.
cgt_breaks: frequency for rare/core gene cutoff, e.g. 0.1,0.9, meaning genes predicted at <0.1 frequency will be rare, 0.1<=x<0.9 will be middle and >=0.9 will be core.
cgt_error: sets false assignment rate of gene to particular frequency compartment.

Run snakemake (must be in same directory as Snakemake file):

snakemake --cores <cores>

Overview of workflow

This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.

Predict genes in all FASTA files in given directory using bakta
Cluster genes using mmseqs2 and generate a gene presence/absence matrix
Generate a pangenome summary of observed gene frequencies
Calculate genome completeness using CheckM
Probabistically assign each gene family as core|middle|rare using cgt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CELEBRIMBOR

Dependencies:

To install:

Quick start:

Overview of workflow

About

Releases 3

Languages

License

bacpop/CELEBRIMBOR

Folders and files

Latest commit

History

Repository files navigation

CELEBRIMBOR

Dependencies:

To install:

Quick start:

Overview of workflow

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Languages