Named after the beautiful Cecret lake
Location: 40.570°N 111.622°W , 9,875 feet (3,010 m) elevation
Cecret is a workflow developed by @erinyoung at the Utah Public Health Laborotory for SARS-COV-2 sequencing with the artic/Illumina hybrid library prep workflow for MiSeq data with protocols here and here. Built to work on linux-based operating systems. Additional config options are needed for cloud batch usage.
Cecret is also part of the staphb-toolkit.
git clone https://github.com/UPHL-BioNGS/Cecret.git
To make your life easier, follow with
cd Cecret
git init
so that you can use git pull
for updates.
- Nextflow
- Nextflow version 20+ is required (
nextflow -v
to check your installation)
- Nextflow version 20+ is required (
- Singularity
or
- Docker (with the caveat that the creator and maintainer uses singularity and may not be able to troubleshoot all docker issues)
directory
|-reads
|-*fastq.gz
Arrange single-end fastq.gz reads as follows or designate directory with params.single_reads
or --single_reads
directory
|-single_reads
|-*fastq.gz
WARNING : single and paired-end reads cannot be in the same directory
nextflow run Cecret.nf -c configs/singularity.config
nextflow run Cecret.nf -c configs/singularity.config --cleaner fastp
Or set params.cleaner = 'fastp'
in a config file
nextflow run Cecret.nf -c configs/singularity.config --trimmer samtools
Or set params.trimmer = 'samtools'
in a config file
nextflow run Cecret.nf -c configs/singularity.config --aligner minimap2
Or set params.aligner = 'minimap2'
in a config file
To create a multiple sequence alignment and corresponding phylogenetic tree and SNP matrix, set params.relatedness = true
or
nextflow run Cecret.nf -c configs/singularity.config --relatedness true
To classify reads with kraken2 to identify reads from human or the organism of your choice
Step 1. Get a kraken2 database (note : this link is no longer active. I'm actively working on creating my own)
mkdir -p kraken2_db
cd kraken2_db
wget https://storage.googleapis.com/sars-cov-2/kraken2_h%2Bv_20200319.tar.gz
tar -zxf kraken2_h+v_20200319.tar.gz
rm -rf kraken2_h+v_20200319.tar.gz
params.kraken2 = true
params.kraken2_db = < path to kraken2 database >
params.kraken2_organism = "Severe acute respiratory syndrome coronavirus 2"
- seqyclean - for cleaning reads
- fastp - for cleaning reads ; optional, faster alternative to seqyclean
- bwa - for aligning reads to the reference
- minimap2 - an alternative to bwa
- ivar - calling variants and creating a consensus fasta; optional primer trimmer
- samtools - for QC metrics and sorting; optional primer trimmer; optional converting bam to fastq files
- fastqc - for QC metrics
- bedtools - for depth estimation over amplicons
- kraken2 - for read classification
- pangolin - for lineage classification
- nextclade - for clade classification
- vadr - for annotating fastas like NCBI
- mafft - for multiple sequence alignment (optional, relatedness must be set to "true")
- snp-dists - for relatedness determination (optional, relatedness must be set to "true")
- iqtree - for phylogenetic tree generation (optional, relatedness must be set to "true")
- bamsnap - to create images of SNPs
It came to my attention that some processes (like bcftools) do not work consistently. Also, they might take longer than wanted and might not even be needed for the end user. Here's the processes that can be turned off with their default values:
params.bcftools_variants = false # the container gets a lot of traffic which can error when attempting to download
params.fastqc = true # qc on the sequencing reads
params.ivar_variants = true # itemize the variants identified by ivar
params.samtools_stats = true # stats about the bam files
params.samtools_coverage = true # stats about the bam files
params.samtools_depth = true # stats about the bam files
params.samtools_flagstat = true # stats about the bam files
params.samtools_ampliconstats = true # stats about the amplicons
params.samtools_plot_ampliconstats = true # images related to amplicon performance
params.kraken2 = false # used to classify reads and needs a corresponding params.kraken2_db and organism if not SARS-CoV-2
params.bedtools_multicov = true # bedtools multicov for coverage approximation of amplicons
params.nextclade = true # SARS-CoV-2 clade determination
params.pangolin = true # SARS-CoV-2 clade determination
params.vadr = false # NCBI fasta QC
params.relatedness = false # actually the toggle for mafft to create a multiple sequence alignement
params.snpdists = true # creates snp matrix from mafft multiple sequence alignment
params.iqtree = true # creates phylogenetic tree from mafft multiple sequence alignement
params.bamsnap = false # can be really slow. Works best with bcftools variants. An example bamsnap image is below.
params.rename = false # needs a corresponding sample file and will rename files for GISAID and NCBI submission
params.filter = false # takes the aligned reads and turns them back into fastq.gz files
This requires a comma-delimted file set with params.sample_file
file with a row for each sample and a comma-delimited column for each item to add to the GenBank submission header. Additionally, adjust params.rename = true
.
The following headers are required
- Sample_id (required, must match sample_id*.fa*)
- Submission_id (if file needs renaming)
- Collection_Date
Example covid_samples.csv file contents:
Sample_id,Submission_ID,Collection_Date,SRR
12345,UT-UPHL-12345,2020-08-22,SRR1
67890,UT-UPHL-67890,2020-08-22,SRR2
23456,UT-UPHL-23456,2020-08-22,SRR3
78901,UT-UPHL-78901,2020-08-18,SRR4
Where the files named 12345-UT-M03999-200822_S9_L001_R1_001.fastq.gz
, 12345-UT-M03999-200822_S9_L001_R2_001.fastq.gz
will be renamed UT-UPHL-12345.R1.fastq.gz
and UT-UPHL-12345.R2.fastq.gz
. A GISAID and GenBank friendly multifasta files ready for submission are also generated. The GenBank multifasta uses the input file to create fasta headers like
>12345 [Collection_Date=2020-08-22][organism=Severe acute respiratory syndrome coronavirus 2][host=human][country=USA][isolate=SARS-CoV-2/human/USA/12345/2020][SRR=SRR1]
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Sometimes sequencing fails, so there are parameters for how many non-ambiguous bases a fasta needs in order to get incorporated into the final file. This can be set with params.gisaid_threshold
(Default is 'params.gisaid_threshold = '25000'
') and params.genbank_threshold (Default is 'params.genbank_threshold = '15000'
').
cecret_run_results.txt # information about the sequencing run that's compatible with legacy workflows
covid_samples.csv # only if supplied initially
cecret
|-aligned
| |-pretrimmed.sorted.bam
|-bamsnap
| |-sample
| |-ivar
| |-variant.png # png of variants identified via ivar
| |-bcftools
| |-variant.png # png of variants identified via bcftools
|-bedtools_multicov
| |-sample.multicov.txt # depth per amplicon
|-consensus
| |-sample.consensus.fa # the likely reason you are running this workflow
|-fastp
| |-sample_clean_PE1.fastq # clean file: only if params.cleaner=fastp
| |-sample_clean_PE2.fastq # clean file: only if params.cleaner=fastp
|-fastqc
| |-sample.fastqc.html
| |-sample.fastqc.zip
|-filter # optional: turns aligned bams into fastq files
| |-sample_filtered_R1.fastq
| |-sample_filtered_R2.fastq
| |-sample_filtered_unpaired.fastq
|-iqtree # optional: relatedness parameter must be set to true
| |-iqtree.treefile
|-ivar_trim
| |-sample.primertrim.bam # aligned reads after primer trimming. trimmer parameter must be set to 'ivar'
|-ivar_variants
| |-sample.variants.tsv # list of variants identified via ivar and corresponding scores
|-kraken2
| |-sample_kraken2_report.txt # kraken2 report of the percentage of reads matching virus and human sequences
|-logs
| |-process_logs # for troubleshooting puroses
|-mafft # optional: relatedness parameter must be set to true
| |-mafft_aligned.fasta # multiple sequence alignement generated via mafft
|-nextclade # identfication of nextclade clades and variants identified
| |-sample_nextclade_report.csv # actually a ";" deliminated file
|-pangolin
| |-lineage_report.csv # identification of pangolin lineages
|-samtools_ampliconstats
| |-sample_ampliconstats.txt # stats for the amplicons used
|-samtools_coverage
| |-aligned
| | |-sample.cov.hist # histogram of coverage for aligned reads
| | |-sample.cov.txt # tabular information of coverage for aligned reads
| |-trimmed
| |-sample.cov.trim.hist # histogram of coverage for aligned reads after primer trimming
| |-sample.cov.trim.txt # tabular information of coverage for aligned reads after primer trimming
|-samtools_depth
| |-aligned
| | |-sample.depth.txt # read depth for each read position
| |-trimmed
| |-sample.depth.txt # read depth for each position
|-samtools_flagstat
| |-aligned
| | |-sample.flagstat.txt # samtools stats for aligned reads
| |-trimmed
| |-sample.flagstat.trim.txt # samtools stats for trimmed reads
|-samtools_plot_ampliconstats
| |-sample.*.png # images corresponding to amiplicon performance
|-samtools_stats
| |-aligned
| | |-sample.stats.txt # samtools stats for aligned reads
| |-trimmed
| |-sample.stats.trim.txt # samtools stats for trimmed reads
|-seqyclean
| |-sample_clean_PE1.fastq # clean file
| |-sample_clean_PE2.fastq # clean file
|-snp-dists # optional: relatedness parameter must be set to true
| |-snp-dists # file containing a table of the number of snps that differ between any two samples
|-submission_files # optional: is only created if covid_samples.txt exists
| |-sample.genbank.fa # fasta file with formatting and header including metadata for genbank
| |-sample.gisaid.fa # fasta file with header for gisaid
| |-sample.R1.fastq.gz # renamed raw fastq.gz file
| |-sample.R2.fastq.gz # renamed raw fastq.gz file
|-summary
| |-sample.summary.txt # individual results
|-summary.csv # tab-delimited summary of results from the workflow
reads
| |-sample_S1_L001_R1_001.fastq.gz # initial file
| |-sample_S1_L001_R2_001.fastq.gz # inital file
work # nextflows work directory. Likely fairly large.
vadr
|-vadr* # vadr output
A FILE THAT THE END USER/YOU CAN COPY AND EDIT IS FOUND AT configs/example.config
This file contains all of the configurable parameters with their default values. If you are using some sort of cloud or HPC setup, I highly recommend you copy this and edit it appropriately. A limited list of parameters is listed below:
- params.reads = workflow.launchDir + '/reads'
- params.single_reads = workflow.launchDir + '/single_reads'
- params.outdir = workflow.launchDir + "/cecret"
- params.reference_genome = workflow.projectDir + "/configs/MN908947.3.fasta"
- params.gff_file = workflow.projectDir + "/configs/MN908947.3.gff"
- params.primer_bed = workflow.projectDir + "/configs/artic_V3_nCoV-2019.bed"
- params.amplicon_bed = workflow.projectDir + "/configs/nCoV-2019.insert.bed"
- To "resume" a workflow that use
-resume
with your nextflow command - To create a report, use
-with-report
with your nextflow command - To use nextflow tower, use
-with-tower
with your nextflow command
TELL ME ABOUT IT!!!
- Github issue
- Email me
- Send me a message on slack
Be sure to include the command that you used, what config file you used, and what the nextflow error was.
nextflow run Cecret_annotation.nf -c configs/singularity.config --fastas <directory with fastas>
You can run mafft, snpdists, and iqtree on a collection of fastas as well with
nextflow run Cecret_annotation.nf -c configs/singularity.config --relatedness true --fastas <directory with fastas>
Cecret_annotation.nf Uses the same basic structure as the main workflow, so the same config file can probably be used for both.
You're more than welcome to look at what we use here at UPHL here.
There's nothing wrong with the bcftools process, and the vcf created by bcftools is rather handy for additional analyses. The 'staphb/bcftools:latest'
container is really popular, and has issues downloading during high traffic times. I don't have the time to handle issues of users not understanding why the container did not download. /Sorry
If you want to get the output from bcftools, set params.bcftools = true
Yes. Set params.bamsnap = true
. This is false by default because of how long it takes. It will work with variants called by ivar
and bcftools
, although it is MUCH faster with the vcf created by bcftools.
Warning : will not work on all variants. This is due to how bamsnap runs. It is even less likely to work on indels.
The primer bedfile is the file with the start and stop of each primer sequence.
$ head -n 3 artic_V3_nCoV-2019.bed
MN908947.3 30 54 nCoV-2019_1_LEFT nCoV-2019_1 +
MN908947.3 385 410 nCoV-2019_1_RIGHT nCoV-2019_1 -
MN908947.3 320 342 nCoV-2019_2_LEFT nCoV-2019_2 +
The amplicon bedfile is the file with the start and stop of each intended amplicon.
$ head -n 3 nCoV-2019.insert.bed
MN908947.3 54 385 1 1 +
MN908947.3 342 704 2 2 +
MN908947.3 664 1004 3 1 +
Due to the many varieties of primer bedfiles, I determined it was best if the user supplied this file for custom primer sequences.
In your config file, change the following relevant parameters:
params.reference_genome
params.primer_bed
params.amplicon_bed or set params.bedtools_multicov = false
params.gff_file or set params.ivar_variants = false
And set
params.pangolin = false
params.nextclade = false
params.vadr = false or configure your vadr container appropriately and params.vadr_reference
Although not perfect, if params.filter = true
, then only the reads that were mapped to the reference are returned. This should eliminate all human contamination (as long as human is not part of your reference).
This workflow has too many bells and whistles. I really only care about generating a consensus fasta. How do I get rid of all the extras?
Change the parameters in your config file and set most of them to false.
params.fastqc = false
params.ivar_variants = false
params.samtools_stats = false
params.samtools_coverage = false
params.samtools_depth = false
params.samtools_flagstat = false
params.bedtools_multicov = false
params.samtools_ampliconstats = false
params.samtools_plot_ampliconstats = false
params.bedtools_multicov = false
params.pangolin = false
params.nextclade = false
params.vadr = false
And, yes, this means I added some bells and whistles so you could turn off the bells and whistles. /irony