First commit

VCCRI · Oct 30, 2018 · 42de077 · 42de077
commit 42de077
Show file tree

Hide file tree

Showing 95 changed files with 345,155 additions and 0 deletions.
diff --git a/GeneSplicerCOPYRIGHT b/GeneSplicerCOPYRIGHT
@@ -0,0 +1,3 @@
+Copyright (c) 2003, The Institute for Genomic Research (TIGR), Rockville,
+Maryland, U.S.A.  All rights reserved.
+
diff --git a/GeneSplicerLICENSE b/GeneSplicerLICENSE
@@ -0,0 +1,114 @@
+The Artistic License
+
+Preamble
+
+The intent of this document is to state the conditions under which a
+Package may be copied, such that the Copyright Holder maintains some
+semblance of artistic control over the development of the package,
+while giving the users of the package the right to use and distribute
+the Package in a more-or-less customary fashion, plus the right to
+make reasonable modifications.
+
+Definitions:
+    * "Package" refers to the collection of files distributed by the
+      Copyright Holder, and derivatives of that collection of files
+      created through textual modification.
+    * "Standard Version" refers to such a Package if it has not been
+      modified, or has been modified in accordance with the wishes of
+      the Copyright Holder.
+    * "Copyright Holder" is whoever is named in the copyright or
+      copyrights for the package.
+    * "You" is you, if you're thinking about copying or distributing
+      this Package.
+    * "Reasonable copying fee" is whatever you can justify on the
+      basis of media cost, duplication charges, time of people
+      involved, and so on. (You will not be required to justify it to
+      the Copyright Holder, but only to the computing community at
+      large as a market that must bear the fee.)
+    * "Freely Available" means that no fee is charged for the item
+      itself, though there may be fees involved in handling the
+      item. It also means that recipients of the item may redistribute
+      it under the same conditions they received it.
+
+1. You may make and give away verbatim copies of the source form of
+   the Standard Version of this Package without restriction, provided
+   that you duplicate all of the original copyright notices and
+   associated disclaimers.
+
+2. You may apply bug fixes, portability fixes and other modifications
+   derived from the Public Domain or from the Copyright Holder. A
+   Package modified in such a way shall still be considered the
+   Standard Version.
+
+3. You may otherwise modify your copy of this Package in any way,
+   provided that you insert a prominent notice in each changed file
+   stating how and when you changed that file, and provided that you
+   do at least ONE of the following:
+
+    a) place your modifications in the Public Domain or otherwise make
+    them Freely Available, such as by posting said modifications to
+    Usenet or an equivalent medium, or placing the modifications on a
+    major archive site such as ftp.uu.net, or by allowing the
+    Copyright Holder to include your modifications in the Standard
+    Version of the Package.
+
+    b) use the modified Package only within your corporation or
+    organization.
+
+    c) rename any non-standard executables so the names do not
+    conflict with standard executables, which must also be provided,
+    and provide a separate manual page for each non-standard
+    executable that clearly documents how it differs from the Standard
+    Version.
+
+    d) make other distribution arrangements with the Copyright Holder.
+
+4. You may distribute the programs of this Package in object code or
+   executable form, provided that you do at least ONE of the
+   following:
+
+    a) distribute a Standard Version of the executables and library
+    files, together with instructions (in the manual page or
+    equivalent) on where to get the Standard Version.
+
+    b) accompany the distribution with the machine-readable source of
+    the Package with your modifications.
+
+    c) accompany any non-standard executables with their corresponding
+    Standard Version executables, giving the non-standard executables
+    non-standard names, and clearly documenting the differences in
+    manual pages (or equivalent), together with instructions on where
+    to get the Standard Version.
+
+    d) make other distribution arrangements with the Copyright Holder.
+
+5. You may charge a reasonable copying fee for any distribution of
+   this Package. You may charge any fee you choose for support of this
+   Package. You may not charge a fee for this Package itself. However,
+   you may distribute this Package in aggregate with other (possibly
+   commercial) programs as part of a larger (possibly commercial)
+   software distribution provided that you do not advertise this
+   Package as a product of your own.
+
+6. The scripts and library files supplied as input to or produced as
+   output from the programs of this Package do not automatically fall
+   under the copyright of this Package, but belong to whomever
+   generated them, and may be sold commercially, and may be aggregated
+   with this Package.
+
+7. C or perl subroutines supplied by you and linked into this Package
+   shall not be considered part of this Package.
+
+8. The name of the Copyright Holder may not be used to endorse or
+   promote products derived from this software without specific prior
+   written permission.
+
+9. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
+   WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
+   OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.
+
+The End
+This license is approved by the Open Source Initiative
+(www.opensource.org) for certifying software as OSI Certified Open
+Source.
+
diff --git a/RUN.sh b/RUN.sh
@@ -0,0 +1,139 @@
+#!/bin/bash
+#set default parameters
+POSITIONAL=()
+INPUTVCF="FALSE"
+INPUTBED="FALSE"
+INPUTFILES=""
+USEBP="FALSE"
+USEBPINDELS="FALSE"
+ANNOTATION=""
+FASTAPATH=""
+#parse command line args
+while [[ $# -gt 0 ]]
+do
+key="$1"
+case $key in
+    -inputVCF)
+    INPUTVCF="TRUE"
+    INPUTFILES="$2"
+    shift
+    shift
+    while [ "$1" ] && [[ ! $1 == *-* ]]; do
+        INPUTFILES="$INPUTFILES $1"
+        shift
+    done
+    ;;
+    -inputBED)
+    INPUTBED="TRUE"
+    INPUTFILES="$2"
+    shift
+    shift
+    while [ "$1" ] && [[ ! $1 == *-* ]]; do
+        INPUTFILES="$INPUTFILES $1"
+        shift
+    done
+    ;;
+    -fasta)
+    FASTAPATH="$2"
+    shift
+    shift ;;
+    -gtf)
+    ANNOTATION="$2"
+    shift
+    shift ;;
+    -branchpointer)
+    USEBP="TRUE"
+    USEBPINDELS="FALSE"
+    shift ;;
+    -branchpointerIndels)
+    USEBPINDELS="TRUE"
+    USEBP="FALSE"
+    shift ;;
+    *)
+    POSITIONAL+=("$1")
+    shift ;;
+esac
+done
+set - "${POSITIONAL[@]}"
+#input error handling
+if [ ! -f $FASTAPATH ]; then
+    echo -e "Fasta file not found: use -fasta ./path/to/hgXX.fa\nExiting..."
+    exit 1
+elif [ ! -f "$ANNOTATION" ]; then
+    echo "GTF annotation file not found: use -annotation path/to/gencodeXX.gtf\nExiting..."
+    exit 1
+fi
+#prepare splice site intervals from annotation.gtf
+if [ ! -f data/annotationIntervals.txt ] || [[ "$ANNOTATION" -nt data/annotationIntervals.txt ]] ; then
+    echo "Preparing splice site annotation..."
+    grep '[[:blank:]]gene[[:blank:]]\|[[:blank:]]exon[[:blank:]]' data/"$ANNOTATION" | java -cp bin getSpliceSiteIntervalsFromGTF > data/annotationIntervals.txt
+fi
+#for each file
+for FILE in $INPUTFILES; do
+    fileID=$(echo "$FILE" | xargs -n 1 basename)
+    echo "Input file: $fileID"
+    #remove temp files from any previous run
+    rm temp/"$fileID"* 2> /dev/null
+    #sort body of input file
+    grep "^#" "$FILE" > temp/"$fileID"_sorted
+    grep -v "^#" "$FILE" | sort -k1,1 -k2,2n >> temp/"$fileID"_sorted
+    #bedtools intersect to get strand info
+    echo "Retrieving strand info..."
+    grep '[[:blank:]]gene[[:blank:]]' data/"$ANNOTATION" | sort -k1,1 -k4,4n | bedtools intersect -a temp/"$fileID"_sorted -b stdin -wa -wb -sorted  > temp/"$fileID"unstrandedInput.txt
+    #generate flanking intervals.bed for bedtools getfasta
+    if [ ! "$INPUTVCF" = "FALSE" ]; then
+        grep '[[:blank:]]+[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "+", $4, $5}' | sort -u | java -cp bin getFastaIntervals > temp/"$fileID"fastaIntervals.bed
+        grep '[[:blank:]]-[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "-", $4, $5}' | sort -u | java -cp bin getFastaIntervals >> temp/"$fileID"fastaIntervals.bed
+    elif [ ! "$INPUTBED" = "FALSE" ]; then
+        grep '[[:blank:]]+[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "+", $5, $6}' | sort -u | java -cp bin getFastaIntervals > temp/"$fileID"fastaIntervals.bed
+        grep '[[:blank:]]-[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "-", $5, $6}' | sort -u | java -cp bin getFastaIntervals >> temp/"$fileID"fastaIntervals.bed
+    fi
+    echo "Retrieving flanking FASTA sequence..."
+    bedtools getfasta -fi $FASTAPATH -bed temp/"$fileID"fastaIntervals.bed -name -s > temp/"$fileID"seqToScan.FASTA
+    #seqScan: generates input strings for maxentscan and genesplicer as well as ESRseq scores
+    echo "Scanning for motifs..."
+    java -cp bin seqScan temp/"$fileID"seqToScan.FASTA -useESR $fileID 1>&2
+    #run maxentscan
+    echo "Running MaxEntScan..."
+    perl score5.pl temp/"$fileID"mesDonorInput.txt | java -cp bin processScoresMES > temp/"$fileID"mesDonorScores.txt
+    retVal=( ${PIPESTATUS[0]} )
+    if [ $retVal -ne 0 ]; then
+        echo "MaxEntScan returned non-zero exit status. It is likely not all variants were processed. Exiting..."
+    exit $retVal
+    fi
+    perl score3.pl temp/"$fileID"mesAcceptorInput.txt | java -cp bin processScoresMES > temp/"$fileID"mesAcceptorScores.txt
+    retVal=( ${PIPESTATUS[0]} )
+    if [ $retVal -ne 0 ]; then
+        echo "MaxEntScan returned non-zero exit status. It is likely not all variants were processed. Exiting..."
+    exit $retVal
+    fi
+    #run genesplicer
+    echo "Running GeneSplicer..."
+    bin/linux/genesplicerAdapted temp/"$fileID"gsInput.FASTA human > temp/"$fileID"gsScores.txt
+    #run branchpointer SNPs
+    if [ "$USEBP" = "TRUE" ]; then
+        echo "Running Branchpointer..."
+        Rscript --slave --vanilla bin/bpProcessing.R temp/"$fileID"input.txt $(pwd) "$BPGTF" &> output/"$fileID"bpLog.txt
+        #awk -v OFS=\\t '{print $2, $3, $4, $8, $9, $15, $16, $21, $22, $23, $24}' output/bpOutputSNPs.txt > output/bpSNPsSummarised.txt
+    fi
+    #run branchpointer indels
+    if [ "$USEBPINDELS" = "TRUE" ] ; then
+        echo "Running Branchpointer..."
+        while read -r na chr start strand ref alt; do
+            refLength=${#ref}
+            altLength=${#ref}
+            end=$((refLength+start))
+            if [ $altLength -gt 1 ] || [ $refLength -gt 1 ]; then
+                echo -e ".\t$chr\t$start\t$end\t$strand\t$ref\t$alt" >> temp/"$fileID"bpInputIndels.txt
+            fi
+        done < temp/"$fileID"input.txt
+        Rscript --slave --vanilla bin/bpProcessingINDELS.R temp/"$fileID"input.txt temp/"$fileID"bpInputIndels.txt $(pwd) "$BPGTF" &> output/"$fileID"bpLog.txt
+        #awk -v OFS=\\t '{print $2, $3, $4, $8, $9, $16, $17, $23, $24, $25, $26}' output/bpOutputIndels.txt > output/bpIndelsSummarised.txt
+        #awk -v OFS=\\t '{print $2, $3, $4, $8, $9, $15, $16, $22, $23, $24, $25}' $SNPpath"bpOutput_SNPs.txt" > output/bpSNPsSummarised.txt
+    fi
+    #merge scores into one line
+    echo "Processing scores..."
+    cat temp/"$fileID"mesDonorScores.txt temp/"$fileID"mesAcceptorScores.txt temp/"$fileID"gsScores.txt temp/"$fileID"ESRoutput.txt data/annotationIntervals.txt | sort -k1,1 -V -k 2,2n -k 3 -k 4 -s | java -cp bin mergeOutput > output/"$fileID"_out.txt
+    #clean up temp files
+    rm temp/"$fileID"* 2> /dev/null
+done
diff --git a/bin/bpProcessing.R b/bin/bpProcessing.R
@@ -0,0 +1,52 @@
+# Assess the effects of SNPs on branchpoint annotations (adapted from Branchpointer vignette)
+# read in fileName as Rscript argument
+args=(commandArgs(TRUE))
+if(length(args)<3){
+  print("Error: Not all arguments supplied.")
+}else{
+  querySNPFile <- args[1]
+  outputPath <- args[2]
+  annotationPath <- args[3]
+  }
+
+#load libraries and exon annotation
+suppressMessages(library(knitr))
+suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))
+suppressMessages(library(branchpointer))
+g <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
+exons <- gtfToExons(annotationPath)
+#format and filter for branchpoint window SNPs
+querySNP <- readQueryFile(querySNPFile, 
+                          queryType = "SNP", 
+                          exons = exons, 
+                          filter = TRUE,
+                          maxDist = 50)
+
+#predict branchpoints and specify number of cores
+branchpointPredictionsSNP <- predictBranchpoints(querySNP,
+                                                 queryType = "SNP",
+                                                 BSgenome = g,
+                                                 useParallel = TRUE,
+                                                 cores = 8)
+
+#summarise and define prediction thresholds
+querySNPSummary <- predictionsToSummary(querySNP,branchpointPredictionsSNP, probabilityCutoff = 0.52, probabilityChange = 0.15)
+
+#identify branchpoints created/removed by SNP
+if (FALSE) {
+l <- length(querySNPSummary)
+BPcreated_or_removed <- vector()
+for (i in 1:l) {
+  if (querySNPSummary$created_n[i]==1 | querySNPSummary$deleted_n[i]==1) {
+    BPcreated_or_removed <- c(BPcreated_or_removed, i)
+  }
+}
+
+#write to file
+fileDestination <- "bpOutputSNPs.txt"
+write.table(querySNPSummary[BPcreated_or_removed], fileDestination , quote = FALSE, sep="\t")
+}
+
+setwd(outputPath)
+fileDestination <- "./output/bpOutputSNPs.txt"
+write.table(querySNPSummary, fileDestination , quote = FALSE, sep="\t")
diff --git a/bin/bpProcessingINDELS.R b/bin/bpProcessingINDELS.R
@@ -0,0 +1,68 @@
+# Assess the effects of SNPs and INDELS on branchpoint annotations (adapted from Branchpointer vignette)
+# read in fileName as Rscript argument
+args=(commandArgs(TRUE))
+if(length(args)<3){
+  print("Error: Not all arguments supplied.")
+}else{
+  querySNPFile <- args[1]
+  queryIndelFile <- args[2]
+  outputPath <- args[3]
+  annotationPath <- args[4]
+}
+
+#load libraries and exon annotation
+suppressMessages(library(knitr))
+suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))
+suppressMessages(library(branchpointer))
+g <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
+exons <- gtfToExons(annotationPath)
+
+#format and filter for branchpoint window SNPs
+querySNP <- readQueryFile(querySNPFile, 
+                          queryType = "SNP", 
+                          exons = exons, 
+                          filter = TRUE,
+                          maxDist = 50)
+
+#predict branchpoints and specify number of cores
+branchpointPredictionsSNP <- predictBranchpoints(querySNP,
+                                                 queryType = "SNP",
+                                                 BSgenome = g,
+                                                 useParallel = TRUE,
+                                                 cores = 8)
+
+#summarise and define prediction thresholds
+querySNPSummary <- predictionsToSummary(querySNP,branchpointPredictionsSNP, probabilityCutoff = 0.52, probabilityChange = 0.15)
+
+#format and filter for branchpoint window indels
+queryIndel <- readQueryFile(queryIndelFile, 
+                          queryType = "indel", 
+                          exons = exons)
+
+#predict branchpoints and specify number of cores
+branchpointPredictionsIndel <- predictBranchpoints(queryIndel,
+                                                 queryType = "indel",
+                                                 BSgenome = g,
+                                                 useParallel = TRUE,
+                                                 cores = 8)
+
+#summarise and define prediction thresholds
+queryIndelSummary <- predictionsToSummary(queryIndel,branchpointPredictionsIndel, probabilityCutoff = 0.52, probabilityChange = 0.15)
+
+#identify branchpoints created/removed by Indel
+if (FALSE) {
+  l <- length(queryIndelSummary)
+BPcreated_or_removed <- vector()
+for (i in 1:l) {
+  if (queryIndelSummary$created_n[i]>=1 | queryIndelSummary$deleted_n[i]>=1) {
+    BPcreated_or_removed <- c(BPcreated_or_removed, i)
+  }
+}
+}
+
+#write to file
+setwd(outputPath)
+outputFileSNPs <- "./output/bpOutputSNPs.txt"
+outputFileIndels <- "./output/bpOutputIndels.txt"
+write.table(querySNPSummary, outputFileSNPs , quote = FALSE, sep="\t")
+write.table(queryIndelSummary, outputFileIndels , quote = FALSE, sep="\t")
diff --git a/bin/getFastaIntervals.class b/bin/getFastaIntervals.class
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		Copyright (c) 2003, The Institute for Genomic Research (TIGR), Rockville,
		Maryland, U.S.A. All rights reserved.