Is a user-friendly, automated metagenomics pipeline the key to making your life as a bioinformatician easier? Do you still want some choice in which tools you use for each analysis step rather a rigid selection of pre-determined tools that other pipelines use? Well look no further!
Welcome to Muti-Level Metagenomics (not to be confused with Multi-Level Marketing), a flexible snakemake workflow designed to handle various metagenomics data types and analyses steps. Use MLM's flexible, high-level configuration settings to choose from multiple tools at each major step in the analysis. Yes, you heard that right: you can use the MLM pipeline to make your own pipeline that answers the questions closest to your heart. Like, is this a pipeline, or a pyramid scheme?
UNDER ACTIVE DEVELOPMENT :) Use at you own risk...
,---. ,---. .---. ,---. ,---.
| \ / | | ,_| | \ / |
| , \/ , ,-./ ) | , \/ , |
| |\_ /| \ '_ '`) | |\_ /| |
| _( )_/ | |> (_) ) | _( )_/ | |
| (_ o _) | ( . .-' | (_ o _) | |
| (_,_) | |`-'`-'|___| (_,_) | |
| | | | | | | | |
'--' '--' `--------'--' '--'
Table of Contents
This is a snakemake pipeline is designed to automate common components of shotgun metagenomic data analysis. Briefly, reads are trimmed, deconvoluted (- human), taxnomically defined, assembled, binned, and annotated. Further optimization is neccessary and functional components of metagenomic analysis have yet to be integrated.
TODO
- Update prep_sample_sheet.py so it sorts samples into alphanumeric order
- check metabat
- integrate visualizations + comparisons
- move logs to results folder
- clean up bowtie2 code to reduce used disk space + gzip files
- add bioconda biopython=1.78 to snakemamba base install
- probably more stuff
MaybeTODO
- full setup/install script
- singularity
- test cases?
This pipeline has been tested using snakemake and mamba installed in a single conda environment. For more information regarding snakemake, generally, please see the documentation here
Execute to create a DAG visualization of the pipeline
snakemake --forceall --dag | dot -Tpdf > dag.pdf
Execute to create a rule graph visualization
snakemake --forceall --rulegraph | dot -Tpdf > dag.pdf
Tested versions for base conda software environment:
- snakemake version 7.3.8
- mamba 0.15.3
- anaconda 4.10.3
-
Clone the MLM repository and move into the freshly cloned directory
git clone https://github.com/jtsumner/mlm.git cd mlm
-
(Optional) if you are on an HPC (i.e., quest), load the anaconda or miniconda module
module load mamba
-
Create a new conda environment based on the mamba yaml file located in
workflow/envs/mamba.yml
conda env create -f workflow/envs/mamba.yml
-
The base environment with snakemake and mamba should now be available using
source activate mamba
-
Create a new conda environment to install mamba and snakemake. If you have not already installed conda, install it using the documentatio found here.
conda create --name mamba -c conda-forge -c bioconda
-
Activate the new conda environment and install snakemake using mamba
mamba create -c conda-forge -c bioconda -n snakemake snakemake
-
Confirm that snakeake has installed.
snakemake --version
-
Move data into the
data/
subdirectory -
From the mlm base directory (mlm or metagenomics-snsakemake), use the
prep_sample_sheet.py
script in theworkflow/scripts
subdirectory to prepare a sample sheet. It will automatically create a file calledsample_sheet.tsv
in theconfig/
subdirect E.g.,python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S"
-
Configure the cluster config file. Adjust the
account
andpartition
setting underdefault-resources
to fit your cluster. Note that the current cluster configurations is a basic setup for SLURM on Quest that's based on this blog -
Open the Snakefile and adjust the rule all output to fit your desired output.
-
Configure the general snakemake config
config/config.yaml
so that rules you want to execute are set toTrue
and rules you do not want to execute are set toFalse
. E.g., The following lines will assembl and perform metaphlan read-based analysis, but won't execute metabat2 binning.FASTQC: False NONPAREIL: False TRIM_READS: True DECONVOLUTE: True BOWTIE2: True METAPHLAN: False ASSEMBLE: False MEGAHIT: True SPADES: True METABAT2: False
-
Download Bowtie2 index for human reference genome
mkdir resources/bowtie_human
cd resources/bowtie_human
wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
unzip GRCh38_noalt_as.zip
or use T2T reference
mkdir resources/bowtie_human
cd resources/bowtie_human
wget https://genome-idx.s3.amazonaws.com/bt/chm13.draft_v1.0_plusY.zip
unzip chm13.draft_v1.0_plusY.zip
See https://benlangmead.github.io/aws-indexes/bowtie for more options and including other organisms
-
(Optional) Check that snakemake is correctly interpretting your sample spreadsheet by executing a dryrun or one of the commands in the notes above
snakemake --dry-run
-
In the base snakemake/project directory, use the
run_snakemake.sh
file to submit the snakemake scheduler as a job to slurm. (Alternatively, you can run an interactive job snakeake via command line for troubleshooting)sbatch run_snakeake.sh
-
wait for data :) (hopefully)
Start interactive job on slurm
srun -A b1042 --partition=genomics -N 1 -n 24 --mem=64G --time=02:00:00 --pty bash -i
Start interactive job on slurm with salloc ssh onto the qnode it outputs
salloc -A p31648 --partition=long -N 1 -n 24 --mem=64G --time=12:00:00 bash -i
Generalized commands + software used by this pipeline
FastP
fastp -i {input.r1} \
-I {input.r2} \
--out1 {output.r1_filtered} \
--out2 {output.r2_filtered} \
--detect_adapter_for_pe \
--dedup \
--thread {threads} \
--length_required 50 \
-j {output.json} \
-h {output.html} \
-V
FastQC
fastqc -t {threads} \
{input} \
--outdir {params.outDir}"
BWA - Remove human-derived reads
bwa mem -t {threads} {input.genome} {input.r1Filtered} {input.r2Filtered} > {params.sam}
samtools view -Subh -o {params.bam} {params.sam}
samtools sort -o {params.sortedBam} {params.bam}
samtools view -b -f 12 -F 256 -o {params.unmappedBam} {params.sortedBam}
bedtools bamtofastq -i {params.unmappedBam} -fq {output.cleanFastQ1} -fq2 {output.cleanFastQ2}
Bowtie2 - Remove human-derived reads
bowtie2 -p {threads} -x {params.filter_db} --very-sensitive -1 {input.r1} -2 {input.r2}| \
samtools view -bS -@ {threads}| \
samtools sort -@ {threads} -n -o {output.sorted_bam}
samtools fastq -1 {output.r1_clean} -2 {output.r2_clean} -@ {threads} -f 12 -F 256 {output.sorted_bam}
samtools flagstat -@ {threads} -O tsv {output.sorted_bam} > {output.flagstat}
Metaphlan
Setup
metaphlan --install \
--index {params.metaphlan_idx} \
--bowtie2db {output.metaphlan_db} \
--nproc {threads}
Execute
metaphlan {input.cleanFastQ1},{input.cleanFastQ2} \
--bowtie2out {output.bowtie_out} \
--index {params.metaphlan_idx} \
--bowtie2db {input.metaphlan_db} \
--nproc {threads} \
--input_type fastq \
--unknown_estimation \
-o {output.profile}
MegaHit
megahit -t {threads} \
-m 0.9 \
-1 {input.cleanR1} \
-2 {input.cleanR1} \
-o {params.outdir_tmp}
Quast
quast.py \
-o {output.direc} \
--threads {threads} \
--min-contig 0 \
-L {input}
Metabat2
Depth
jgi_summarize_bam_contig_depths \
--outputDepth {output.depth_fi} \
--percentIdentity 97 \
--minContigLength 1000 \
--minContigDepth 1.0 \
--referenceFasta {input.contigs} {input.sortedBam}
Bin
metabat2 -t {threads} \
--inFile {input.contigs} \
--outFile {output.bin_dir}/bin \
--abdFile {input.depth_fi}
- bedtools v2.29.2
- biopython v1.78
- bowtie2 v2.4.4
- bwa v0.7.17
- checkm v1.0.7
- fastp v0.20.1
- fastqc v0.11.9
- fastqc v0.11.5
- hclust2 v1.0.0
- kneaddata v0.10.00
- megahit v1.0.6.1
- metabat2 v2.15
- metaphlan v3.0.13
- python v2.7
- python v3.7
- quast v5.0.2
- samtools v1.10.1
- spades v3.14.1
to add and/or deprecated:
- kaiju v1.8.0
- vcontact2
- blast
- diamond
- mcl
- hmmer
- cython v0.29.21
- scikit-learn v0.21.3* prodigal
So far there are six major steps in the analysis. Each of these can be turned on or off at your desire.
FASTQC: True
TRIM_READS: True
DECONVOLUTE: True
METAPHLAN: True
ASSEMBLE: True
METABAT2: False
FASTQC
employs fastqc on raw reads and, if selected to perform the analyses which generate them, trimmed and deconvoluted reads.TRIM_READS
employs fastp to trim and QC raw reads.DECONVOLUTE
employs bwa alignment to the human reference genome to remove probable human-derived readsMETAPHLAN
employs metaphlan3 to determine a relative abundance profile at the genus and species levelASSEMBLE
employs megahit or metaspades to assemble metagenomesMETABAT2
uses the metabat algorithm to bin genomes into metagenome assembled genomes
,---. ,---. ___ _ .---.,---------..-./`)
| \ / .' | | | | ,_|\ \ .-.')
| , \/ , | .' | ,-./ ) `--. ,---/ `-' \
| |\_ /| .' '_ | \ '_ '`) | \ `-'`"`_ _ _ _
| _( )_/ | ' ( \.-.|> (_) ) :_ _: .---.( ' )--( ' )
| (_ o _) | ' (`. _` /( . .-' (_I_) | (_{;}_)(_{;}_)
| (_,_) | | (_ (_) _)`-'`-'|__(_(=)_) | |(_,_)--(_,_)
| | | |\ / . \ / | (_I_) | |
'--' '--' ``-'`-'' `--------'---' '---'
.---. .-''-. ,---. ,---. .-''-. .---.
| ,_| .'_ _ \| / | |.'_ _ \ | ,_|
,-./ ) / ( ` ) | | | ./ ( ` ) ,-./ )
\ '_ '`). (_ o _) | | _ | . (_ o _) \ '_ '`)
> (_) )| (_,_)___| _( )_ | (_,_)___|> (_) )
( . .-'' \ .---\ (_ o._) ' \ .---( . .-'
`-'`-'|__\ `-' /\ (_,_) / \ `-' /`-'`-'|___
| \ / \ / \ / | \
`--------``'-..-' `---` `'-..-' `--------`
,---. ,---. .-''-.,---------. ____
| \ / | .'_ _ \ \.' __ `.
| , \/ , |/ ( ` ) `--. ,---/ ' \ \
| |\_ /| . (_ o _) | | \ |___| / | _ _ _ _
| _( )_/ | | (_,_)___| :_ _: _.-` |( ' )--( ' )
| (_ o _) | ' \ .---. (_I_) .' _ (_{;}_)(_{;}_)
| (_,_) | |\ `-' / (_(=)_) | _( )_ |(_,_)--(_,_)
| | | | \ / (_I_) \ (_ o _) /
'--' '--' `'-..-' '---' '.(_,_).'
.-_'''-. .-''-. ,---. .--. ,-----. ,---. ,---.-./`) _______ .-'''-.
'_( )_ \ .'_ _ \| \ | | .' .-, '. | \ / \ .-.') / __ \ / _ \
|(_ o _)| ' / ( ` ) | , \ | |/ ,-.| \ _ \| , \/ , / `-' \ | ,_/ \__) (`' )/`--'
. (_,_)/___|. (_ o _) | |\_ \| ; \ '_ / | | |\_ /| |`-'`",-./ ) (_ o _).
| | .-----| (_,_)___| _( )_\ | _`,/ \ _/ | _( )_/ | |.---.\ '_ '`) (_,_). '.
' \ '- .' \ .---| (_ o _) : ( '\_/ \ | (_ o _) | || | > (_) ) __.---. \ :
\ `-'` | \ `-' | (_,_)\ |\ `"/ \ ) /| (_,_) | || |( . .-'_/ \ `-' |
\ / \ /| | | | '. \_/``".' | | | || | `-'`-' / \ /
`'-...-' `'-..-' '--' '--' '-----' '--' '--''---' `._____.' `-...-'