Skip to content

Commit

Permalink
merging kraken profile functionality started
Browse files Browse the repository at this point in the history
  • Loading branch information
jtsumner committed Feb 5, 2023
1 parent bd7e9f8 commit 9def34b
Show file tree
Hide file tree
Showing 12 changed files with 864 additions and 17 deletions.
4 changes: 2 additions & 2 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ include: "workflow/rules/bin_metabat2.smk"

rule all:
input:
get_rules

get_rules,
"results/kraken/merged_kraken_report_profile.tsv"

# Make report for snakemake.

Expand Down
2 changes: 1 addition & 1 deletion config/config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ BWA: False
BOWTIE2: True
METAPHLAN: True
KRAKEN2: True
METAXA2: True
METAXA2: False
ASSEMBLE: False
MEGAHIT: True
SPADES: True
Expand Down
36 changes: 36 additions & 0 deletions notes/20221103_notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
command i used to make the sample sheet

python3 workflow/scripts/prep_sample_sheet.py --delimiter="_" --r1_common="_1" --r2_common="_2" --path="../cat_reads"
reads in ../cat_reads and ended up working for wverything hehe

noticed that NTM00399 did assemble, which i noted in my rotation report that it didnt... maybe there were just too few assemblies
on the same note, two samples are taking quite a long time: NTM00262, NTM00067

had to delete snakemamba conda and reinstall w code in documentation. i think it became too bloated w all my mlms but not sure

with that new install, i also had to update snakemake cause kept getting a snakemake error. 7.3.8>7.18

20221116

noticed snakemake stalls at dropping short contigs script + no python gives Biopython not recocgnized error -> biopython not installed?

found which conda env it is w/ "snakemake -n --list-conda-envs"
activated it with minipython and tested parse_contigs function --> it worked, so prob not the install
means prob a problem with snakemake?? maybe path?
i think it may be tht there are issues in calling the correct python environment for the "script" execute in snakemake...
...maybe try installing in snakemamba
did conda install -c bioconda biopython=1.78 in snakemamba env
WOO this seemed to have worked...but why didnt snakemake work og?

noticed map2contigs wasnt working after dropping short contigs started working
in the rule description, "mem=<>" was blank since god knows when...
set it to be 25G

also, deleting NTM00263, NTM00067 from sample sheet

20221201
turns out i accidentally deleted NTM01067 instead of NTM00067; its been corrected and rerun

also commited the changes from last time RE debugging and made some new changes to the report which now seems kinda fixed

NTM01464_k99_1692 only has 1 contig > 1kb in megabit and none in spades therefore removing
48 changes: 48 additions & 0 deletions notes/20221205_notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
20221204

starting analysis on shotgun bal data

theres one mistake in sample log -- B22_NaNa was misnames B22_NaDNase so both
reads got put into one. THANKFULLY not a frameshift mistake -- need better
naming conventions to avoid this mistake in the future

will omit NaDNase samples--will rerun bcl convert with updated sample sheet
and add them in toorrow.

B22_NaDNase MGX_DATA /projects/p31648/bmo_shotgun_fastq/reads/B22_NaDNase_S12_L001_R1_001.fastq.gz /projects/p31648/bmo_shotgun_fastq/reads/B22_NaDNase_S12_L001_R2_001.fastq.gz


20221205

restarting analysis with updated data

python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S" --path="/projects/p31648/bmo_shotgun_fastq" --subdirectories=Truefdsa
reads in ../cat_reads and ended up working for wverything hehe

20221213

last time i added a bunch of functionality including bowtie2 alignment etc.
this led to improved efficiency and showed greater host-depletion and taxonomic depth from metaphlan
this led to the conclusion that duplicate reads from adapter dimers are contributing to artificial low alignment in some samples
thus, a read deduplication step is necessary to more robustly

today, im updating fastp to v0.23.2 in order to use its deduplication functionality
activated conda environment with fastp
conda update fastp

fastp needs to be optimized with downstream stuff
fastp -i /projects/p31648/bmo_shotgun_fastq/ds.d2b2785572ed457391fae708169f4be5/B21_LyNa_S77_L001_R1_001.fastq.gz -I /projects/p31648/bmo_shotgun_fastq/ds.d2b2785572ed457391fae708169f4be5/B21_LyNa_S77_L001_R2_001.fastq.gz --out1 B21_LyNA.r1.a.fastq.gz --out2 B21_LyNA.r2.a.fastq.gz --detect_adapter_for_pe --dedup --thread 23 --length_required 50 -j ./B21_LyNa_fastp.json -h ./B21_LyNa_fastp.html -V --overrepresentation_analysis -g --merge --merged_out B21_LyNA.merged.a.fastq.gz --low_complexity_filter --include_unmerged --correction

and then include -U merged + unpaired in the bowtie mapping



20221214

starting to test the other components

fastp > bowtie2 etc. finished up yesterday wit dedup except B21_LyDNase needs more memory

one liner for flagstat
cd into bowtie_out directory
for i in $(ls -d *) ; do sed -e "s/^/$i\t/" ${i}/*.flagstat.tsv >> tmp.txt ; done
31 changes: 31 additions & 0 deletions notes/20230125_notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@

# 20230125

Starting analysis of Hauser 2023 mouse microbiome competition analysis project

Using mlm pipeline

Transfered data to globus last week

Generate sample sheet
python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S" --path="/projects/b1042/HartmannLab/jack/Hauser2023_InVivoCompetition/Hauser11_10.21.2022"

temporarily moved sample sheet for bmo project to have suffix "_bmo"

Downlaoding mouse bowtie2 index from sourcefource
https://bowtie-bio.sourceforge.net/bowtie2/index.shtml # info
https://benlangmead.github.io/aws-indexes/bowtie # download/wget links
M. musculus GRCm39
wget https://genome-idx.s3.amazonaws.com/bt/GRCm39.zip

There are two CZ1D-5
BC-19_CZ1D-5
**BC-20_CZ1D-5**
BC-21_CZ3D-5
assume that BC-20_CZ1D-5 is CZ2D-5
chaned in mapping file and in metaphlan species, etc.

Need to add in nonpareil and other features

20230127

46 changes: 46 additions & 0 deletions notes/20230131_notes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@

# 20230131
Processing LyPMA fragmentation optimization

Changed alignment genome back to human from mouse

Updated prep_sample_sheet.py script so it sorts sample sheet alphabetically (easier navigation)

Command line code for making sample sheet
python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S" --path="/projects/p31648/bmoe008_LibraryFragmentLength/" --subdirectories=True

Would like to clean up bowtie2 mapping and implement a compression step on fastq files

Worked! Got much cleaner version of bowtie2 that cuts out all intermediate bam/sam files except final, mapped, sorted bam
also unmapped reads now in fastq.gz

20230201

AHHHHHHHHHH WHYYYYYYYY
cluster deleted all conda envs and older mlm files that haven't been modified in awhile
forced to start a new mlm repo, fastest solution

AHHHHH -- snakemamba directory now isnt working, have to reinstall too...
deleted and tried reinstalling from yaml, but snakemake has error like reported in 20221103_notes.md

solution
make sure to check ~/.condarc
apparently conda-forge doesnt play nice with defaults channel anymore(?)/with mamba
deleted biocore, defaults, and biobakery from channels list
added back defaults

ok so this solution took hours to get around and, not gonna lie, was really rough
- deleted all conda environments (i.e., rm -rf .conda)
- from now on, we're fully switching to mamba module
- did mamba init bash and then restarted shell
- upon relogin, has "(base)" active since mamba was initiated
- mamba create -c conda-forge -c bioconda -n snakemake snakemake
- started working again

**Need** to containerize conda envs to prevent this from happening agains...

also had to manually "conda update fastp" to change fastp from 0.20.1 > 0.22.0
changes in the conda yaml in case
older version, which somehow was being updated in other mlm, doesn't support dedup


Binary file added notes/dag_all.pdf
Binary file not shown.
2 changes: 1 addition & 1 deletion workflow/rules/common.smk
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ def get_rules(wildcards):
all_rules = all_rules + expand("results/kraken/{sample}/{sample}_kraken2out.txt", sample=samples["sample"])

if config["METAXA2"]:
all_rules = all_rules + expand("results/metaxa2/{sample}/{sample}_metaxa2.summary.txt", sample=samples["sample"])
all_rules = all_rules + expand("results/metaxa2/{sample}/{sample}_metaxa2.taxonomy.txt", sample=samples["sample"])

if config["ASSEMBLE"]:
if config["MEGAHIT"]:
Expand Down
86 changes: 73 additions & 13 deletions workflow/rules/metaphlan.smk
Original file line number Diff line number Diff line change
Expand Up @@ -167,18 +167,22 @@ use rule metaphlan_species_abundance as metaphlan_bowtie_species_abundance with:


############################
### PART 1A: KRACKEN2 ###
### PART 1B: KRACKEN2 ###
############################

# Because such high memory DB, consider finding a way to call all input files in one rule

rule kraken2:
"""
Performs taxnomic classification with Kraken2
Outputs a kraken2-style report and metaphlan-style report with a
script from KrakenTools
"""
input:
r1_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r1.fastq.gz",
r2_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r2.fastq.gz"
output:
table = "results/kraken/{sample}/{sample}_kraken2table.tsv",
reads_class = "results/kraken/{sample}/{sample}_kraken2out.txt"
kraken_out = "results/kraken/{sample}/{sample}_kraken2out.txt",
kraken_report = "results/kraken/{sample}/{sample}_kraken2report.txt"
threads: 20
resources:
mem="180G",
Expand All @@ -187,23 +191,72 @@ rule kraken2:
"""
module load kraken/2
kraken2 --threads {threads} \
--use-names \
--output {output.reads_class} \
--report {output.table} \
--use-mpa-style \
--report-zero-counts \
--output {output.kraken_out} \
--report {output.kraken_report} \
--confidence 0.5 \
--gzip-compressed \
--minimum-hit-groups 3 \
--paired {input.r1_clean} {input.r2_clean}
"""
# --use-names \
# --use-mpa-style \


rule kraken_mpa:
"""
Outputs a metaphlan-style report with a script from KrakenTools
"""
input:
kraken_report = "results/kraken/{sample}/{sample}_kraken2report.txt"
output:
kraken_mpa = "results/kraken/{sample}/{sample}_mpa.tsv"
threads: 1
resources:
mem="3G",
time="0:10:00"
shell:
"""
workflow/scripts/kreport2mpa.py -r {input.kraken_report} \
-o {output.kraken_mpa} \
--display-header \
--no-intermediate-ranks
"""

rule merge_kraken:
"""
Outputs a metaphlan-style report with a script from KrakenTools
"""
input:
kraken_reports = expand("results/kraken/{sample}/{sample}_kraken2report.txt", sample=samples["sample"]),
kraken_mpas = expand("results/kraken/{sample}/{sample}_mpa.tsv", sample=samples["sample"])
output:
merged_reports = "results/kraken/merged_kraken_report_profile.tsv",
merged_mpa = "results/kraken/merged_kraken_mpa_profile.tsv"
threads: 1
resources:
mem="5G",
time="0:10:00"
shell:
"""
workflow/scripts/combine_kreports.py -r {input.kraken_reports} \
-o {output.merged_reports} \
--display-headers
workflow/scripts/combine_mpa.py -i {input.kraken_mpas} \
-o {output.merged_mpa} \
"""


############################
### PART 1C: METAXA2 ###
############################

rule metaxa2:
input:
r1_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r1.fastq.gz",
r2_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r2.fastq.gz"
output:
out = "results/metaxa2/{sample}/{sample}_metaxa2.summary.txt"
out = "results/metaxa2/{sample}/{sample}_metaxa2.taxonomy.txt"
params:
base_out = "results/metaxa2/{sample}/{sample}_metaxa2"
threads: 25
Expand All @@ -217,10 +270,17 @@ rule metaxa2:
-2 {input.r1_clean} \
--mode metagenome \
-f fastq \
-z gzip \
-g ssu \
-p /software/metaxa2/2.2/metaxa2_db/SSU/HMMs/ \
-o {params.base_out} \
--cpu 24 \
--multi_thread T \
--plus T
"""
--plus T \
--graphical F \
--fasta F \
"""

# metaxa2_ttt -i 20221221-Comunal-Zymo_metaxa2.taxonomy.txt -o test -t A,b
# --cpu 24 --multi_thread T --unknown T -r 1 --distance=0
Loading

0 comments on commit 9def34b

Please sign in to comment.