Skip to content

Commit

Permalink
First commit
Browse files Browse the repository at this point in the history
  • Loading branch information
Steven Monger committed Oct 30, 2018
0 parents commit 42de077
Show file tree
Hide file tree
Showing 95 changed files with 345,155 additions and 0 deletions.
3 changes: 3 additions & 0 deletions GeneSplicerCOPYRIGHT
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
Copyright (c) 2003, The Institute for Genomic Research (TIGR), Rockville,
Maryland, U.S.A. All rights reserved.

114 changes: 114 additions & 0 deletions GeneSplicerLICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
The Artistic License

Preamble

The intent of this document is to state the conditions under which a
Package may be copied, such that the Copyright Holder maintains some
semblance of artistic control over the development of the package,
while giving the users of the package the right to use and distribute
the Package in a more-or-less customary fashion, plus the right to
make reasonable modifications.

Definitions:
* "Package" refers to the collection of files distributed by the
Copyright Holder, and derivatives of that collection of files
created through textual modification.
* "Standard Version" refers to such a Package if it has not been
modified, or has been modified in accordance with the wishes of
the Copyright Holder.
* "Copyright Holder" is whoever is named in the copyright or
copyrights for the package.
* "You" is you, if you're thinking about copying or distributing
this Package.
* "Reasonable copying fee" is whatever you can justify on the
basis of media cost, duplication charges, time of people
involved, and so on. (You will not be required to justify it to
the Copyright Holder, but only to the computing community at
large as a market that must bear the fee.)
* "Freely Available" means that no fee is charged for the item
itself, though there may be fees involved in handling the
item. It also means that recipients of the item may redistribute
it under the same conditions they received it.

1. You may make and give away verbatim copies of the source form of
the Standard Version of this Package without restriction, provided
that you duplicate all of the original copyright notices and
associated disclaimers.

2. You may apply bug fixes, portability fixes and other modifications
derived from the Public Domain or from the Copyright Holder. A
Package modified in such a way shall still be considered the
Standard Version.

3. You may otherwise modify your copy of this Package in any way,
provided that you insert a prominent notice in each changed file
stating how and when you changed that file, and provided that you
do at least ONE of the following:

a) place your modifications in the Public Domain or otherwise make
them Freely Available, such as by posting said modifications to
Usenet or an equivalent medium, or placing the modifications on a
major archive site such as ftp.uu.net, or by allowing the
Copyright Holder to include your modifications in the Standard
Version of the Package.

b) use the modified Package only within your corporation or
organization.

c) rename any non-standard executables so the names do not
conflict with standard executables, which must also be provided,
and provide a separate manual page for each non-standard
executable that clearly documents how it differs from the Standard
Version.

d) make other distribution arrangements with the Copyright Holder.

4. You may distribute the programs of this Package in object code or
executable form, provided that you do at least ONE of the
following:

a) distribute a Standard Version of the executables and library
files, together with instructions (in the manual page or
equivalent) on where to get the Standard Version.

b) accompany the distribution with the machine-readable source of
the Package with your modifications.

c) accompany any non-standard executables with their corresponding
Standard Version executables, giving the non-standard executables
non-standard names, and clearly documenting the differences in
manual pages (or equivalent), together with instructions on where
to get the Standard Version.

d) make other distribution arrangements with the Copyright Holder.

5. You may charge a reasonable copying fee for any distribution of
this Package. You may charge any fee you choose for support of this
Package. You may not charge a fee for this Package itself. However,
you may distribute this Package in aggregate with other (possibly
commercial) programs as part of a larger (possibly commercial)
software distribution provided that you do not advertise this
Package as a product of your own.

6. The scripts and library files supplied as input to or produced as
output from the programs of this Package do not automatically fall
under the copyright of this Package, but belong to whomever
generated them, and may be sold commercially, and may be aggregated
with this Package.

7. C or perl subroutines supplied by you and linked into this Package
shall not be considered part of this Package.

8. The name of the Copyright Holder may not be used to endorse or
promote products derived from this software without specific prior
written permission.

9. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED
WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES
OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.

The End
This license is approved by the Open Source Initiative
(www.opensource.org) for certifying software as OSI Certified Open
Source.

139 changes: 139 additions & 0 deletions RUN.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
#!/bin/bash
#set default parameters
POSITIONAL=()
INPUTVCF="FALSE"
INPUTBED="FALSE"
INPUTFILES=""
USEBP="FALSE"
USEBPINDELS="FALSE"
ANNOTATION=""
FASTAPATH=""
#parse command line args
while [[ $# -gt 0 ]]
do
key="$1"
case $key in
-inputVCF)
INPUTVCF="TRUE"
INPUTFILES="$2"
shift
shift
while [ "$1" ] && [[ ! $1 == *-* ]]; do
INPUTFILES="$INPUTFILES $1"
shift
done
;;
-inputBED)
INPUTBED="TRUE"
INPUTFILES="$2"
shift
shift
while [ "$1" ] && [[ ! $1 == *-* ]]; do
INPUTFILES="$INPUTFILES $1"
shift
done
;;
-fasta)
FASTAPATH="$2"
shift
shift ;;
-gtf)
ANNOTATION="$2"
shift
shift ;;
-branchpointer)
USEBP="TRUE"
USEBPINDELS="FALSE"
shift ;;
-branchpointerIndels)
USEBPINDELS="TRUE"
USEBP="FALSE"
shift ;;
*)
POSITIONAL+=("$1")
shift ;;
esac
done
set - "${POSITIONAL[@]}"
#input error handling
if [ ! -f $FASTAPATH ]; then
echo -e "Fasta file not found: use -fasta ./path/to/hgXX.fa\nExiting..."
exit 1
elif [ ! -f "$ANNOTATION" ]; then
echo "GTF annotation file not found: use -annotation path/to/gencodeXX.gtf\nExiting..."
exit 1
fi
#prepare splice site intervals from annotation.gtf
if [ ! -f data/annotationIntervals.txt ] || [[ "$ANNOTATION" -nt data/annotationIntervals.txt ]] ; then
echo "Preparing splice site annotation..."
grep '[[:blank:]]gene[[:blank:]]\|[[:blank:]]exon[[:blank:]]' data/"$ANNOTATION" | java -cp bin getSpliceSiteIntervalsFromGTF > data/annotationIntervals.txt
fi
#for each file
for FILE in $INPUTFILES; do
fileID=$(echo "$FILE" | xargs -n 1 basename)
echo "Input file: $fileID"
#remove temp files from any previous run
rm temp/"$fileID"* 2> /dev/null
#sort body of input file
grep "^#" "$FILE" > temp/"$fileID"_sorted
grep -v "^#" "$FILE" | sort -k1,1 -k2,2n >> temp/"$fileID"_sorted
#bedtools intersect to get strand info
echo "Retrieving strand info..."
grep '[[:blank:]]gene[[:blank:]]' data/"$ANNOTATION" | sort -k1,1 -k4,4n | bedtools intersect -a temp/"$fileID"_sorted -b stdin -wa -wb -sorted > temp/"$fileID"unstrandedInput.txt
#generate flanking intervals.bed for bedtools getfasta
if [ ! "$INPUTVCF" = "FALSE" ]; then
grep '[[:blank:]]+[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "+", $4, $5}' | sort -u | java -cp bin getFastaIntervals > temp/"$fileID"fastaIntervals.bed
grep '[[:blank:]]-[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "-", $4, $5}' | sort -u | java -cp bin getFastaIntervals >> temp/"$fileID"fastaIntervals.bed
elif [ ! "$INPUTBED" = "FALSE" ]; then
grep '[[:blank:]]+[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "+", $5, $6}' | sort -u | java -cp bin getFastaIntervals > temp/"$fileID"fastaIntervals.bed
grep '[[:blank:]]-[[:blank:]]' temp/"$fileID"unstrandedInput.txt | awk -v OFS="\\t" '{print ".", $1, $2, "-", $5, $6}' | sort -u | java -cp bin getFastaIntervals >> temp/"$fileID"fastaIntervals.bed
fi
echo "Retrieving flanking FASTA sequence..."
bedtools getfasta -fi $FASTAPATH -bed temp/"$fileID"fastaIntervals.bed -name -s > temp/"$fileID"seqToScan.FASTA
#seqScan: generates input strings for maxentscan and genesplicer as well as ESRseq scores
echo "Scanning for motifs..."
java -cp bin seqScan temp/"$fileID"seqToScan.FASTA -useESR $fileID 1>&2
#run maxentscan
echo "Running MaxEntScan..."
perl score5.pl temp/"$fileID"mesDonorInput.txt | java -cp bin processScoresMES > temp/"$fileID"mesDonorScores.txt
retVal=( ${PIPESTATUS[0]} )
if [ $retVal -ne 0 ]; then
echo "MaxEntScan returned non-zero exit status. It is likely not all variants were processed. Exiting..."
exit $retVal
fi
perl score3.pl temp/"$fileID"mesAcceptorInput.txt | java -cp bin processScoresMES > temp/"$fileID"mesAcceptorScores.txt
retVal=( ${PIPESTATUS[0]} )
if [ $retVal -ne 0 ]; then
echo "MaxEntScan returned non-zero exit status. It is likely not all variants were processed. Exiting..."
exit $retVal
fi
#run genesplicer
echo "Running GeneSplicer..."
bin/linux/genesplicerAdapted temp/"$fileID"gsInput.FASTA human > temp/"$fileID"gsScores.txt
#run branchpointer SNPs
if [ "$USEBP" = "TRUE" ]; then
echo "Running Branchpointer..."
Rscript --slave --vanilla bin/bpProcessing.R temp/"$fileID"input.txt $(pwd) "$BPGTF" &> output/"$fileID"bpLog.txt
#awk -v OFS=\\t '{print $2, $3, $4, $8, $9, $15, $16, $21, $22, $23, $24}' output/bpOutputSNPs.txt > output/bpSNPsSummarised.txt
fi
#run branchpointer indels
if [ "$USEBPINDELS" = "TRUE" ] ; then
echo "Running Branchpointer..."
while read -r na chr start strand ref alt; do
refLength=${#ref}
altLength=${#ref}
end=$((refLength+start))
if [ $altLength -gt 1 ] || [ $refLength -gt 1 ]; then
echo -e ".\t$chr\t$start\t$end\t$strand\t$ref\t$alt" >> temp/"$fileID"bpInputIndels.txt
fi
done < temp/"$fileID"input.txt
Rscript --slave --vanilla bin/bpProcessingINDELS.R temp/"$fileID"input.txt temp/"$fileID"bpInputIndels.txt $(pwd) "$BPGTF" &> output/"$fileID"bpLog.txt
#awk -v OFS=\\t '{print $2, $3, $4, $8, $9, $16, $17, $23, $24, $25, $26}' output/bpOutputIndels.txt > output/bpIndelsSummarised.txt
#awk -v OFS=\\t '{print $2, $3, $4, $8, $9, $15, $16, $22, $23, $24, $25}' $SNPpath"bpOutput_SNPs.txt" > output/bpSNPsSummarised.txt
fi
#merge scores into one line
echo "Processing scores..."
cat temp/"$fileID"mesDonorScores.txt temp/"$fileID"mesAcceptorScores.txt temp/"$fileID"gsScores.txt temp/"$fileID"ESRoutput.txt data/annotationIntervals.txt | sort -k1,1 -V -k 2,2n -k 3 -k 4 -s | java -cp bin mergeOutput > output/"$fileID"_out.txt
#clean up temp files
rm temp/"$fileID"* 2> /dev/null
done
52 changes: 52 additions & 0 deletions bin/bpProcessing.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Assess the effects of SNPs on branchpoint annotations (adapted from Branchpointer vignette)
# read in fileName as Rscript argument
args=(commandArgs(TRUE))
if(length(args)<3){
print("Error: Not all arguments supplied.")
}else{
querySNPFile <- args[1]
outputPath <- args[2]
annotationPath <- args[3]
}

#load libraries and exon annotation
suppressMessages(library(knitr))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressMessages(library(branchpointer))
g <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
exons <- gtfToExons(annotationPath)
#format and filter for branchpoint window SNPs
querySNP <- readQueryFile(querySNPFile,
queryType = "SNP",
exons = exons,
filter = TRUE,
maxDist = 50)

#predict branchpoints and specify number of cores
branchpointPredictionsSNP <- predictBranchpoints(querySNP,
queryType = "SNP",
BSgenome = g,
useParallel = TRUE,
cores = 8)

#summarise and define prediction thresholds
querySNPSummary <- predictionsToSummary(querySNP,branchpointPredictionsSNP, probabilityCutoff = 0.52, probabilityChange = 0.15)

#identify branchpoints created/removed by SNP
if (FALSE) {
l <- length(querySNPSummary)
BPcreated_or_removed <- vector()
for (i in 1:l) {
if (querySNPSummary$created_n[i]==1 | querySNPSummary$deleted_n[i]==1) {
BPcreated_or_removed <- c(BPcreated_or_removed, i)
}
}

#write to file
fileDestination <- "bpOutputSNPs.txt"
write.table(querySNPSummary[BPcreated_or_removed], fileDestination , quote = FALSE, sep="\t")
}

setwd(outputPath)
fileDestination <- "./output/bpOutputSNPs.txt"
write.table(querySNPSummary, fileDestination , quote = FALSE, sep="\t")
68 changes: 68 additions & 0 deletions bin/bpProcessingINDELS.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
# Assess the effects of SNPs and INDELS on branchpoint annotations (adapted from Branchpointer vignette)
# read in fileName as Rscript argument
args=(commandArgs(TRUE))
if(length(args)<3){
print("Error: Not all arguments supplied.")
}else{
querySNPFile <- args[1]
queryIndelFile <- args[2]
outputPath <- args[3]
annotationPath <- args[4]
}

#load libraries and exon annotation
suppressMessages(library(knitr))
suppressMessages(library(BSgenome.Hsapiens.UCSC.hg38))
suppressMessages(library(branchpointer))
g <- BSgenome.Hsapiens.UCSC.hg38::BSgenome.Hsapiens.UCSC.hg38
exons <- gtfToExons(annotationPath)

#format and filter for branchpoint window SNPs
querySNP <- readQueryFile(querySNPFile,
queryType = "SNP",
exons = exons,
filter = TRUE,
maxDist = 50)

#predict branchpoints and specify number of cores
branchpointPredictionsSNP <- predictBranchpoints(querySNP,
queryType = "SNP",
BSgenome = g,
useParallel = TRUE,
cores = 8)

#summarise and define prediction thresholds
querySNPSummary <- predictionsToSummary(querySNP,branchpointPredictionsSNP, probabilityCutoff = 0.52, probabilityChange = 0.15)

#format and filter for branchpoint window indels
queryIndel <- readQueryFile(queryIndelFile,
queryType = "indel",
exons = exons)

#predict branchpoints and specify number of cores
branchpointPredictionsIndel <- predictBranchpoints(queryIndel,
queryType = "indel",
BSgenome = g,
useParallel = TRUE,
cores = 8)

#summarise and define prediction thresholds
queryIndelSummary <- predictionsToSummary(queryIndel,branchpointPredictionsIndel, probabilityCutoff = 0.52, probabilityChange = 0.15)

#identify branchpoints created/removed by Indel
if (FALSE) {
l <- length(queryIndelSummary)
BPcreated_or_removed <- vector()
for (i in 1:l) {
if (queryIndelSummary$created_n[i]>=1 | queryIndelSummary$deleted_n[i]>=1) {
BPcreated_or_removed <- c(BPcreated_or_removed, i)
}
}
}

#write to file
setwd(outputPath)
outputFileSNPs <- "./output/bpOutputSNPs.txt"
outputFileIndels <- "./output/bpOutputIndels.txt"
write.table(querySNPSummary, outputFileSNPs , quote = FALSE, sep="\t")
write.table(queryIndelSummary, outputFileIndels , quote = FALSE, sep="\t")
Binary file added bin/getFastaIntervals.class
Binary file not shown.
Loading

0 comments on commit 42de077

Please sign in to comment.