- DNA extraction
- Primary PCR. One of:
2A. trnL (plant)
2B. 12SV5 (vertebrate animals) - Dilution
- Indexing PCR
These scripts produce, from Illumina MiniSeq raw sequencing data, a directory named DATE_results with the following structure:
DATE_results
|
+-- 0_raw
|
+-- 1_trimadapter
|
+-- 2_filterprimer
|
+-- 3_trimprimer
|
+-- 4_dada2
- A compute cluster with SLURM job submission
- Have downloaded the metabarcoding Singularity container: https://gitlab.oit.duke.edu/lad-lab/metabarcoding/-/tree/main
- Demultiplexing
- Trim adapters
- Filter primers
- Trim primers
- Data analysis object generation and QC 5A. Submission script 5B. Rscript 5C. Write Rout file
Clone this repo to the computing cluster:
#navigate to where you want to store the scripts
git clone https://github.com/bpetrone/mb-pipeline.git
After getting your data off of the sequencer, upload it to the computing cluster. For example, here is how you would upload to the DCC:
#"IlluminaRunFolder will look something like "211019_MN00462_0194_A000H3L2M7"
scp -r /path/to/your/<IlluminaRunFolder> <your-netid>@dcc-login.oit.duke.edu:/path/to/DCC/folder
Next, upload your samplesheet.csv file and make sure that you have the following file structure:
/seqdata #name this whatever you want
-########_samplesheet.csv
-211019_MN00462_0194_A000H3L2M7 #note that this and sample sheet should be in the SAME folder
See "Troubleshooting" for more information regarding sample sheet structure if you aren't sure what an Illumina sample sheet should look like.
You will also need to download the metabarcoding container file to the computing cluster you're using with the following command:
#navigate to whatever directory you want to store the container in
curl -O https://research-singularity-registry.oit.duke.edu/lad-lab/metabarcoding.sif
This is a Singularity (aka computing cluster compatible) container that has all of the packages you need for analysis pre-installed, so no other package installations are needed!
See the below instructions for how to run each step of the pipeline. There is also a template file that will write each submission script for you if you just want to copy/paste the correct commands.
The first step of this pipeline is to convert the raw .bcl files from the sequencer into individual .fastq files for each sample. You will need:
- The path to your Illumina run folder
- Sample sheet name
- Path to where you stored the metabarcoding.sif container file
- The diet metabarcoding run type (trnL or 12SV5)
#navigate to the mb-pipeline folder on your computing cluster
sbatch [email protected] 1_demux-barcode.sh /path/to/metabarcoding.sif path/to/miniseq-dir XXXXXXXX_sample-sheet.csv <trnL>OR<12SV5>
This can take about an hour to run with 192 samples.
After this script has finished, you should see the following file structure:
/seqdata
-########_samplesheet.csv
-miniseq-dir
-XXXXXXXX_results
-0_reference
-primers.txt
-1_raw_demux
#all of your demultiplexed .fastq files will be here
-1_raw_all
Undetermined_S0_L001_R1_001.fastq
Undetermined_S0_L001_R2_001.fastq # these files contain everything, included PhiX and all reads that didn't match to the barcodes you input
Check the .out and .err files to make sure that everything went smoothly
If demultiplexing failed, here are the most common issues:
- Incorrect file structure: is your sample sheet located in the same folder as your Illumina run folder?
- Incorrect sample sheet input: Does the input sample sheet name match what you have in the folder?
- Incorrect sample sheet format: bcl2fastq requires a very specific sample sheet format. If there is an empty row or hidden character where there shouldn't be, then it will give an error. Sometimes, it's necessary to copy and paste the sample sheet from a run that has worked in the past (or, use this sample) and then re-paste in the barcodes you used.
- Not using exactly "trnL" or "12SV5" as the run type argument. This input is case sensitive, make sure you use exactly "trnL" or "12SV5" as the last argument.
- Other file path issues: double check that all of the file paths are correct and that there aren't any missing back slashes. Did you correctly type the miniseq-dir path?
- Trying to submit command from outside /mb-pipline/code folder. If you aren't submitting the sbatch command from inside the mb-pipeline folder, you will need to encode the path to the script in your sbatch command. I.e.;
sbatch [email protected] /PATH/TO/1_demux-barcode.sh /path/to/metabarcoding.sif path/to/miniseq-dir XXXXXXXX_sample-sheet.csv <trnL>OR<12SV5>
- Input files: Demultiplexed and trimmed .fastq files
- Output files -- Filtered .fastq files -- .out file for each sample
Some final quality control filtering and QC plot generation is done at this step.
3_trimprimer directory containing appropriately trimmed/filtered read files
- quality_F.pdf
- quality_R.pdf
- quality_F_summary.png
- quality_R_summary.png
- dada_errors_F.png
- dada_errors_R.png
- dadaFs.rds
- dadaRs.rds
- mergers.rds: merged forward and reverse reads
- concats.rds: includes concatenated reads
- seqtab.rds
- seqtab_concats.rds
- seqtab_nochim.rds: chimeras removed
- track.rds: counts up how many reads were lost at each dada2 step due to filtering
- seqtab_nochim_concats.rds
- track_pipeline.csv: table of how many reads were present at each of the preceding pipeline steps
- track_long.csv: R data table formatted table with the number of reads for each sample at each pipeline step, including dada2. Allows QC plotting