merging kraken profile functionality started

jtsumner · Feb 5, 2023 · 9def34b · 9def34b
1 parent bd7e9f8
commit 9def34b
Show file tree

Hide file tree

Showing 12 changed files with 864 additions and 17 deletions.
diff --git a/Snakefile b/Snakefile
@@ -22,8 +22,8 @@ include: "workflow/rules/bin_metabat2.smk"
 
 rule all:
     input:
-        get_rules
-
+        get_rules,
+        "results/kraken/merged_kraken_report_profile.tsv"
 
 # Make report for snakemake. 
 

diff --git a/config/config.yaml b/config/config.yaml
@@ -21,7 +21,7 @@ BWA: False
 BOWTIE2: True
 METAPHLAN: True
 KRAKEN2: True
-METAXA2: True
+METAXA2: False
 ASSEMBLE: False
 MEGAHIT: True
 SPADES: True

diff --git a/notes/20221103_notes.md b/notes/20221103_notes.md
@@ -0,0 +1,36 @@
+command i used to make the sample sheet
+
+python3 workflow/scripts/prep_sample_sheet.py --delimiter="_" --r1_common="_1" --r2_common="_2" --path="../cat_reads"
+	reads in ../cat_reads and ended up working for wverything hehe
+
+noticed that NTM00399 did assemble, which i noted in my rotation report that it didnt... maybe there were just too few assemblies
+	on the same note, two samples are taking quite a long time: NTM00262, NTM00067
+
+had to delete snakemamba conda and reinstall w code in documentation. i think it became too bloated w all my mlms but not sure
+
+	with that new install, i also had to update snakemake cause kept getting a snakemake error. 7.3.8>7.18
+
+20221116
+
+noticed snakemake stalls at dropping short contigs script + no python gives Biopython not recocgnized error -> biopython not installed?
+
+	found which conda env it is w/ "snakemake -n --list-conda-envs"
+	activated it with minipython and tested parse_contigs function --> it worked, so prob not the install
+	means prob a problem with snakemake?? maybe path?
+	i think it may be tht there are issues in calling the correct python environment for the "script" execute in snakemake...
+	...maybe try installing in snakemamba 
+	did conda install -c bioconda biopython=1.78 in snakemamba env
+	WOO this seemed to have worked...but why didnt snakemake work og?
+
+noticed map2contigs wasnt working after dropping short contigs started working	
+	in the rule description, "mem=<>" was blank since god knows when...
+	set it to be 25G
+
+also, deleting NTM00263, NTM00067 from sample sheet
+
+20221201
+turns out i accidentally deleted NTM01067 instead of NTM00067; its been corrected and rerun
+
+also commited the changes from last time RE debugging and made some new changes to the report which now seems kinda fixed
+
+NTM01464_k99_1692 only has 1 contig > 1kb in megabit and none in spades therefore removing
diff --git a/notes/20221205_notes.md b/notes/20221205_notes.md
@@ -0,0 +1,48 @@
+20221204
+
+starting analysis on shotgun bal data
+
+theres one mistake in sample log -- B22_NaNa was misnames B22_NaDNase so both
+reads got put into one. THANKFULLY not a frameshift mistake -- need better 
+naming conventions to avoid this mistake in the future
+
+	will omit NaDNase samples--will rerun bcl convert with updated sample sheet
+	and add them in toorrow.
+
+	B22_NaDNase     MGX_DATA        /projects/p31648/bmo_shotgun_fastq/reads/B22_NaDNase_S12_L001_R1_001.fastq.gz   /projects/p31648/bmo_shotgun_fastq/reads/B22_NaDNase_S12_L001_R2_001.fastq.gz
+
+
+20221205
+
+restarting analysis with updated data
+
+python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S" --path="/projects/p31648/bmo_shotgun_fastq" --subdirectories=Truefdsa
+	reads in ../cat_reads and ended up working for wverything hehe
+
+20221213
+
+last time i added a bunch of functionality including bowtie2 alignment etc.
+this led to improved efficiency and showed greater host-depletion and taxonomic depth from metaphlan
+this led to the conclusion that duplicate reads from adapter dimers are contributing to artificial low alignment in some samples
+thus, a read deduplication step is necessary to more robustly 
+
+today, im updating fastp to v0.23.2 in order to use its deduplication functionality 
+	activated conda environment with fastp	
+	conda update fastp
+
+fastp needs to be optimized with downstream stuff
+ fastp -i /projects/p31648/bmo_shotgun_fastq/ds.d2b2785572ed457391fae708169f4be5/B21_LyNa_S77_L001_R1_001.fastq.gz -I /projects/p31648/bmo_shotgun_fastq/ds.d2b2785572ed457391fae708169f4be5/B21_LyNa_S77_L001_R2_001.fastq.gz --out1 B21_LyNA.r1.a.fastq.gz --out2 B21_LyNA.r2.a.fastq.gz  --detect_adapter_for_pe --dedup --thread 23 --length_required 50 -j ./B21_LyNa_fastp.json -h ./B21_LyNa_fastp.html -V --overrepresentation_analysis -g --merge --merged_out B21_LyNA.merged.a.fastq.gz --low_complexity_filter --include_unmerged --correction
+
+and then include -U merged + unpaired in the bowtie mapping
+
+
+
+20221214
+
+starting to test the other components
+
+fastp > bowtie2 etc. finished up yesterday wit dedup except B21_LyDNase needs more memory 
+
+one liner for flagstat
+	cd into bowtie_out directory
+	for i in $(ls -d *) ; do sed -e "s/^/$i\t/" ${i}/*.flagstat.tsv >> tmp.txt ; done		
diff --git a/notes/20230125_notes.md b/notes/20230125_notes.md
@@ -0,0 +1,31 @@
+
+# 20230125
+
+Starting analysis of Hauser 2023 mouse microbiome competition analysis project
+
+Using mlm pipeline 
+
+Transfered data to globus last week
+
+Generate sample sheet
+	python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S" --path="/projects/b1042/HartmannLab/jack/Hauser2023_InVivoCompetition/Hauser11_10.21.2022"
+
+temporarily moved sample sheet for bmo project to have suffix "_bmo"
+
+Downlaoding mouse bowtie2 index from sourcefource 
+	https://bowtie-bio.sourceforge.net/bowtie2/index.shtml # info
+		https://benlangmead.github.io/aws-indexes/bowtie # download/wget links
+	M. musculus GRCm39
+	wget https://genome-idx.s3.amazonaws.com/bt/GRCm39.zip
+
+There are two CZ1D-5
+	BC-19_CZ1D-5
+	**BC-20_CZ1D-5**
+	BC-21_CZ3D-5
+	assume that BC-20_CZ1D-5 is CZ2D-5
+	chaned in mapping file and in metaphlan species, etc. 
+
+Need to add in nonpareil and other features
+
+20230127
+
diff --git a/notes/20230131_notes.md b/notes/20230131_notes.md
@@ -0,0 +1,46 @@
+
+# 20230131
+Processing LyPMA fragmentation optimization
+
+Changed alignment genome back to human from mouse
+
+Updated prep_sample_sheet.py script so it sorts sample sheet alphabetically (easier navigation)
+
+Command line code for making sample sheet   
+    python3 workflow/scripts/prep_sample_sheet.py --delimiter="_S" --path="/projects/p31648/bmoe008_LibraryFragmentLength/" --subdirectories=True
+
+Would like to clean up bowtie2 mapping and implement a compression step on fastq files
+
+Worked! Got much cleaner version of bowtie2 that cuts out all intermediate bam/sam files except final, mapped, sorted bam
+    also unmapped reads now in fastq.gz
+
+20230201
+
+AHHHHHHHHHH WHYYYYYYYY
+cluster deleted all conda envs and older mlm files that haven't been modified in awhile
+forced to start a new mlm repo, fastest solution
+
+AHHHHH -- snakemamba directory now isnt working, have to reinstall too...
+deleted and tried reinstalling from yaml, but snakemake has error like reported in 20221103_notes.md
+
+solution
+    make sure to check ~/.condarc
+        apparently conda-forge doesnt play nice with defaults channel anymore(?)/with mamba
+        deleted biocore, defaults, and biobakery from channels list
+	added back defaults
+
+ok so this solution took hours to get around and, not gonna lie, was really rough
+- deleted all conda environments (i.e., rm -rf .conda)
+- from now on, we're fully switching to mamba module
+- did mamba init bash and then restarted shell
+- upon relogin, has "(base)" active since mamba was initiated
+- mamba create -c conda-forge -c bioconda -n snakemake snakemake
+- started working again
+
+**Need** to containerize conda envs to prevent this from happening agains...
+
+also had to manually "conda update fastp" to change fastp from 0.20.1 > 0.22.0
+	changes in the conda yaml in case 
+	older version, which somehow was being updated in other mlm, doesn't support dedup
+
+
diff --git a/notes/dag_all.pdf b/notes/dag_all.pdf
diff --git a/workflow/rules/common.smk b/workflow/rules/common.smk
@@ -85,7 +85,7 @@ def get_rules(wildcards):
         all_rules = all_rules + expand("results/kraken/{sample}/{sample}_kraken2out.txt", sample=samples["sample"])
 
     if config["METAXA2"]:
-        all_rules = all_rules + expand("results/metaxa2/{sample}/{sample}_metaxa2.summary.txt", sample=samples["sample"])
+        all_rules = all_rules + expand("results/metaxa2/{sample}/{sample}_metaxa2.taxonomy.txt", sample=samples["sample"])
 
     if config["ASSEMBLE"]:
         if config["MEGAHIT"]:

diff --git a/workflow/rules/metaphlan.smk b/workflow/rules/metaphlan.smk
@@ -167,18 +167,22 @@ use rule metaphlan_species_abundance as metaphlan_bowtie_species_abundance with:
 
 
 ############################
-###  PART 1A: KRACKEN2   ###
+###  PART 1B: KRACKEN2   ###
 ############################
 
-# Because such high memory DB, consider finding a way to call all input files in one rule
-
 rule kraken2:
+    """
+    Performs taxnomic classification with Kraken2 
+
+    Outputs a kraken2-style report and metaphlan-style report with a
+    script from KrakenTools
+    """
     input: 
         r1_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r1.fastq.gz",
         r2_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r2.fastq.gz"
     output:
-        table = "results/kraken/{sample}/{sample}_kraken2table.tsv",
-        reads_class = "results/kraken/{sample}/{sample}_kraken2out.txt"        
+        kraken_out = "results/kraken/{sample}/{sample}_kraken2out.txt",
+        kraken_report = "results/kraken/{sample}/{sample}_kraken2report.txt"
     threads: 20
     resources:
         mem="180G",
@@ -187,23 +191,72 @@ rule kraken2:
         """
         module load kraken/2
         kraken2 --threads {threads} \
-            --use-names \
-            --output {output.reads_class} \
-            --report {output.table} \
-            --use-mpa-style \
-            --report-zero-counts \
+            --output {output.kraken_out} \
+            --report {output.kraken_report} \
             --confidence 0.5 \
             --gzip-compressed \
+            --minimum-hit-groups 3 \
             --paired {input.r1_clean} {input.r2_clean}
+        """ 
+#            --use-names \
+#            --use-mpa-style \
+
+
+rule kraken_mpa:
+    """
+    Outputs a metaphlan-style report with a script from KrakenTools
+    """
+    input: 
+        kraken_report = "results/kraken/{sample}/{sample}_kraken2report.txt"
+    output:
+        kraken_mpa = "results/kraken/{sample}/{sample}_mpa.tsv"
+    threads: 1
+    resources:
+        mem="3G",
+        time="0:10:00"
+    shell:
         """
+        workflow/scripts/kreport2mpa.py -r {input.kraken_report} \
+            -o {output.kraken_mpa} \
+            --display-header \
+            --no-intermediate-ranks
+        """ 
 
+rule merge_kraken:
+    """
+    Outputs a metaphlan-style report with a script from KrakenTools
+    """
+    input: 
+        kraken_reports = expand("results/kraken/{sample}/{sample}_kraken2report.txt", sample=samples["sample"]),
+        kraken_mpas = expand("results/kraken/{sample}/{sample}_mpa.tsv", sample=samples["sample"])
+    output:
+        merged_reports = "results/kraken/merged_kraken_report_profile.tsv",
+        merged_mpa = "results/kraken/merged_kraken_mpa_profile.tsv"
+    threads: 1
+    resources:
+        mem="5G",
+        time="0:10:00"
+    shell:
+        """
+        workflow/scripts/combine_kreports.py -r {input.kraken_reports} \
+            -o {output.merged_reports} \
+            --display-headers
+        
+        workflow/scripts/combine_mpa.py -i {input.kraken_mpas} \
+            -o {output.merged_mpa} \
+        """ 
+
+
+############################
+###   PART 1C: METAXA2   ###
+############################
 
 rule metaxa2:
     input:
         r1_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r1.fastq.gz",
         r2_clean = "results/bowtie_out/{sample}/{sample}.fastp_bowtie.r2.fastq.gz"
     output:
-        out = "results/metaxa2/{sample}/{sample}_metaxa2.summary.txt"
+        out = "results/metaxa2/{sample}/{sample}_metaxa2.taxonomy.txt"
     params:
         base_out = "results/metaxa2/{sample}/{sample}_metaxa2"
     threads: 25
@@ -217,10 +270,17 @@ rule metaxa2:
             -2 {input.r1_clean} \
             --mode metagenome \
             -f fastq \
+            -z gzip \
             -g ssu \
             -p /software/metaxa2/2.2/metaxa2_db/SSU/HMMs/ \
             -o {params.base_out} \
             --cpu 24 \
             --multi_thread T \
-            --plus T
-        """
+            --plus T \
+            --graphical F \
+            --fasta F \
+
+        """
+
+#  metaxa2_ttt -i 20221221-Comunal-Zymo_metaxa2.taxonomy.txt -o test -t A,b 
+# --cpu 24 --multi_thread T --unknown T -r 1 --distance=0