Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
docker		docker
envs		envs
scripts		scripts
simulate_pangenomes		simulate_pangenomes
test		test
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
Snakefile		Snakefile
celebrimbor_logo.png		celebrimbor_logo.png
config.yaml		config.yaml
create_plots.py		create_plots.py
environment.yml		environment.yml
submit_lsf.sh		submit_lsf.sh

Repository files navigation

CELEBRIMBOR

A pipeline written in Snakemake to automatically generate pangenomes from metagenome assembled genomes (MAGs).

Dependencies:

Snakemake
mmseqs2
Bakta
Biopython
CheckM
Pandas
Rust toolchain
Panaroo

NOTE: Conda is used to call different environments and dependencies (see Snakemake file).

To install:

Install the required packages using conda/mamba:

git clone https://github.com/bacpop/MAG_pangenome_pipeline.git
cd MAG_pangenome_pipeline
mamba env create -f environment.yml
mamba activate celebrimbor

Download the required bakta database file:

bakta_db download --output /path/to/database

You can also use the light bakta database if using a suitable version of bakta:

bakta_db download --output /path/to/database --type light

Install cgt (will install cgt_bacpop executable in ./bin directory)

cargo install cgt_bacpop --root .

Or to build from source:

git clone https://github.com/bacpop/cgt.git
cd cgt
cargo install --path "."

Running inside a container

An alternative, if you are having trouble with the above, is to use the CELEBRIMBOR docker container. If you are comfortable running commands inside docker containers and mounting your external files, the whole pipeline is in the container available by running:

docker pull samhorsfield96/celebrimbor:main

Quick start:

Update config.yaml to specify workflow and directory paths.

core: gene frequency cutoff for core gene, anything above this frequency is annotated as a core gene.
output_dir: path to output directory. Does not need to exist prior to running.
genome_fasta: path to directory containing fasta files (must have .fasta extension).
bakta_db: path to bakta db downloaded above.
cgt_exe: path to cgt executable.
cgt_breaks: frequency for rare/core gene cutoff, e.g. 0.1,0.9, meaning genes predicted at <0.1 frequency will be rare, 0.1<=x<0.9 will be middle and >=0.9 will be core.
cgt_error: sets false assignment rate of gene to particular frequency compartment.

Run snakemake (must be in same directory as Snakemake file):

snakemake --cores <cores>

Overview of workflow

This workflow annotates genes in metagenome-assembled genomes (MAGs) and using a probabilistic model to assign each gene to a gene frequency compartment based on their respective frequencies and genome completeness.

Predict genes in all FASTA files in given directory using bakta
Cluster genes using mmseqs2 and generate a gene presence/absence matrix
Generate a pangenome summary of observed gene frequencies
Calculate genome completeness using CheckM
Probabistically assign each gene family as core|middle|rare using cgt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CELEBRIMBOR

Dependencies:

To install:

Running inside a container

Quick start:

Overview of workflow

About

Releases 3

Languages

License

bacpop/CELEBRIMBOR

Folders and files

Latest commit

History

Repository files navigation

CELEBRIMBOR

Dependencies:

To install:

Running inside a container

Quick start:

Overview of workflow

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 3

Languages