To See answers to the below questions, go to Jessen's the answers branch of his repo.
NOTE: Unless otherwise specified, example command lines are available in this workshop's lecture notes
-
First, install the
gnuplot
command line plotting software using HomeBrew:brew install gnuplot
-
Download the sofware required for this tutorial using
wget
fromhttps://www.dropbox.com/s/zn4m0oqbasg4gal/software.tar.gz
. Unarchive the files by typingtar -xzvf software.tar.gz
. Finally, add thesoftware/bin
directory to your$PATH
environment variable:export PATH=$PWD/software/bin:$PATH;
NOTE: You need to replaced
$PWD
above with the full, expanded path the thesoftware/bin
dir if you put theexport
statement in your.bash_profile
or.bashrc
. -
Download the genome and annotation files using
wget
and decompress them both withgunzip
. -
Index the genome as described in the lecture notes.
-
Download the reads with SRR number
SRR10178655
from the SRA and convert to FASTQ format:# Fetch the .sra container file from the SRA FTP: wget ftp://ftp-trace.ncbi.nih.gov/sra/sra-instant/reads/ByRun/sra/SRR/SRR101/SRR10178655/SRR10178655.sra # Use SRA Toolkit to extract the reads in FASTQ format: fastq-dump --gzip --split-files --defline-seq '@$sn/$ri' --defline-qual '+' SRR10178655.sra
-
Run FastQC on the FASTQ files and examine the report (see
fastqc --help
for a complete list of options).- How many read pairs are in included in the FASTQ file?
- How long are the reads?
- What quality encoding are the reads in? What is the quality offset?
- For which metrics are there warnings?
- Are there any over-represented sequences in the file? If so, what is it?
-
Next, run the Trimmomatic adapter trimmer on the FASTQ files in "PE" mode, using this adapter file. What fraction of the data were discarded? NOTE: Trimmomatic is in it's own subdirectory
sofware/Trimmomatic-0.39/trimmomatic-0.39.jar
. -
Align the reads to the genome sequence using BWA-MEM. Convert the file to BAM format, sort the BAM file, and index it (see lecture notes for how).
-
Now, Run GATK's
MarkDuplicates
tool on the BAM file to identify optical and PCR duplicate reads. -
Run
samtools stats
andplot-bamstats
on the BAM file and examine the results.- What is the mode insert size of the sequencing library?
- What is the estimated read base-call error rate?
- What fraction of your reads are duplicates?
- Within which base quality score range do the majority of mismatches occur? (HINT: see the "Mismatches per cycle" plot). Record the upper value of the range to use as the minimum base quality threshold for variant calling a later step.
-
Use
samtools view
to keep alignments withPAIRED
andPROPER_PAIR
flags AND DO NOT containUNMAP
,MUNMAP
,SECONDARY
,QCFAIL
,DUP
, orSUPPLEMENTARY
flags; write the output as a BAM file. Index this new file with SAMtools (HINT: seesamtools flags
for help with flags). Be sure to index your filtered BAM file. -
Using the base quality score determined in Step 10 as the minimum base quality threshold, call SNPs using the GATK HaplotypeCaller.
-
Use the
samtools depth
command to calculate the per-site depth of reads in the genome (seesamtools depth --help
for more info). The output file contains three columns: the chromosome name, position (1-based), and depth. For example:chrI 1 15 chrI 2 15 chrI 3 14 chrI 4 16 chrI 5 16
-
Write a script that computes a text histogram of depth with output similar to the following:
1 | 2 |] 3 |]] 4 |]]]]] 5 |]]]]]]]] 6 |]]]]]]]]]]]]] 7 |]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] 8 |]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]]] 9 |]]]]]]]]]]]]]]]]]]]]]]]]]]] 10 |]]]]]]]]]]]]]]] 11 |]]]]]]]]] 12 |]]]]] 13 |] 14 |] 15 |
-
Filter SNPs and Indels for variant loci within the center 95% of the depth distribution (use your distribution from above; estimating by eye is fine). To filter loci, use VCFtools:
vcftools --recode --recode-INFO-all --stdout --max-missing 1 --minDP <lower-threshold> --maxDP <upper-threshold> --vcf <your.vcf> >your.filtered.vcf
-
Finally, how many of these SNPs and Indels intersect CDS features? (HINT: extract CDS features into a new GFF3 file and use
bedtools intersect
to do this to extract unique SNP loci).
BONUS: Try loading the genome FASTA, annotation GFF3, and filtered SNPs into IGV
- Open IGV
- Navigate through
Genomes
=>Load Genomes from File...
and select the genome FASTA file. - Navigate
File
=>Load from File...
and load a VCF/GFF3/BED file.