Spliceogen is an integrative, scalable tool for the discovery of splice-altering variants. Variants are assessed for their potential to create or disrupt any of the cis motifs which guide splice site definition: donors, acceptors, branchpoints, enhancers and silencers. Spliceogen integrates predictions from MaxEntScan1, GeneSplicer2, ESRseq3 and Branchpointer4. Spliceogen accepts standard VCF/BED inputs and handles both SNVs and indels. Databases of genome-wide predictions are also available.
Paper: https://doi.org/10.1093/bioinformatics/btz263
Maintainer: Steve Monger - [email protected]
See here for installation instructions and obtaining required files. Ensure dependencies are met, or alternatively, run Spliceogen from the provided docker image.
Navigate to your desired installation directory and clone this repository:
git clone https://github.com/VCCRI/Spliceogen.git Spliceogen
-Any whole genome FASTA (.fa)
-Any GTF genome annotation (.gtf)
Browse and download FASTA/GTF versions from Gencode
Alternatively, some recent (as of 2019) hg38 releases can be retrieved using:
> wget ftp://ftp.ensembl.org/pub/release-95/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.alt.fa.gz
> gunzip Homo_sapiens.GRCh38.dna.alt.fa.gz
> wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.basic.annotation.gtf.gz
> gunzip gencode.v29.basic.annotation.gtf.gz
-Bedtools (tested on v2.26.0)
-Java
To include (optional) Branchpointer predictions, users require:
-R (tested on v3.4.3)
-Branchpointer
-BSgenome
The current Bioconductor release of Branchpointer supports SNV predictions. To install it from an R prompt:
> source("https://bioconductor.org/biocLite.R")
> biocLite("branchpointer")
The development version of Branchpointer also supports indels. To install this version instead:
> library(devtools)
> install_github("betsig/branchpointer_dev")
From an R prompt, install the hg38 BSgenomes package using the below command. For hg19, edit the 2nd line to "hg19".
> source("https://bioconductor.org/biocLite.R")
> biocLite("BSgenome.Hsapiens.UCSC.hg38")
A docker image is provided with all Spliceogen and Branchpointer dependencies installed. With docker installed, the basic command is:
> docker run -it mictro/spliceogen:latest /bin/bash
Or to run it with access to a local directory (containing your VCF/BED/GTF/FASTA files), use the command below. Replace $(pwd) with the path of your directory. The name of the destination directory (/my_dir) can be changed to anything.
> docker run -it -v $(pwd):/my_dir mictro/spliceogen:latest /bin/bash
Then move to the Spliceogen directory:
> cd app/Spliceogen
> cd path/to/Spliceogen
> ./RUN.sh -inputVCF path/to/singleOrMultipleFiles.vcf -fasta path/to/hgXX.fa -gtf path/to/annotation.gtf
Small VCF, BED, GTF and FASTA files are provided to demonstrate input and output formats. Run this small example using the following command:
> ./RUN.sh -inputVCF toy/toy.vcf -gtf toy/toy.gtf -fasta toy/toy.fa
For BED inputs, replace the -inputVCF flag with -inputBED.
To include Branchpointer predictions, include the branchpointer flag and specify the genome build:
*basic usage command* -branchpointer hgXX
Or for branchpointer_dev which handles both SNPs and indels, use the flag -branchpointerIndels hgXX
All scores and predictions can be found in the Spliceogen/output directory in a tab delimited format suitable for ANNOVAR annotation. Multiple output files are provided for each input VCF/BED. This includes one master file containing all scores for all variants, as well as several additional files containing only variants identified as most likely to be disruptive, ranked in descending order. The specific files generated are as follows:
-
"$file"_out.txt:
Contains all scores generated for every variant, sorted in ascending chromosomal/start position order.
-
"$file"_withinSS.txt:
Contains all variants that overlap annotated splice sites. The overlapping splice sites are denoted by their exonID and "_donor" or "_acceptor". Variants are sorted by donor/acceptor score decrease, such that the variants most likely to disrupt existing donor/acceptor splice sites appear at the top of this file.
-
"$file"_donorCreating.txt
Contains variants outside of existing splice sites that are predicted to create donor motifs. Variants are ranked by P value, based on a logistic regression model trained on the MaxEntScan and GeneSplicer scores of a set of known donor creating variants derived from Shiraishi et al., 20185.
-
"$file"_acceptorCreating.txt
Same as above, but for acceptor creating variants.
-
"$file"_bpOutput.txt
Contains Branchpointer prediction scores, including whether the variant is predicted to create or remove a branchpoint, based on the recommended Branchpointer thresholds.
The following abbreviations are used in the output headers:
mes = MaxEntScan
gs = GeneSplicer
don = Donor
acc = Acceptor
ref = Reference allele
alt = Alternative allele
ESS = Silencer (ESRseq)
ESE = Enhancer (ESRseq)
withinSS = within splice site
donCreateP = donor creation logistic regression P value
accCreateP = acceptor creation logistic regression P value
So for example, the column "gsDonRef" contains GeneSplicer scores representing donor motif strength for the reference sequence, whereas "mesDonAlt" consists of MaxEntScan scores representing acceptor motif strength for the alternative sequence.
Spliceogen is highly scalable. Predictions are generated at a rate of 2.3 million variants/compute hour, with peak memory usage less than 500Mb. Benchmarking was performed using a single compute node with 1 CPU allocated, without Branchpointer predictions.
We provide two versions of the Spliceogen database. Both databases have genome-wide coverage, assessing every SNV at every position within every annotated multi-exon protein-coding transcript (1.29 billion base pairs in total, or 4.9 billion SNVs). They are available for both hg19 and hg38. The databases are formatted for ANNOVAR, and we also provide index files to speed up annotation.
The “focussed” version contains all donor and acceptor predictions:
hg19- https://s3-us-west-2.amazonaws.com/spliceogen/databases/hg19_focussed.zip
hg38- https://s3-us-west-2.amazonaws.com/spliceogen/databases/hg38_focussed.zip
The comprehensive version contains all donor, acceptor, silencer and enhancer predictions:
hg19- https://s3-us-west-2.amazonaws.com/spliceogen/databases/hg19.zip
hg38- https://s3-us-west-2.amazonaws.com/spliceogen/databases/hg38.zip
The focussed database contains predictions for all SNVs within annotated splice sites and all SNVs that are likely to create a de novo donor or acceptor motif. By excluding the vast majority of SNVs which fall outside of splice sites and are unlikely to create a donor/acceptor motif (logistic regression prediction score <0.7), this database is massively reduced in size without reducing the sensitivity of its donor/acceptor predictions.
Due to the sheer number of scores and predictions provided, we expect that the comprehensive database may be unwieldy for many use cases. To obtain comprehensive predictions, we generally recommend running the tool instead. Other advantages of running the tool include including predictions for indels and (optionally) branchpoints, and the flexibility of selecting/customising your GTF annotation.
-
Yeo, G., Burge, C., "Maximum Entropy Modeling of Short Sequence Motifs with Applications to RNA Splicing Signals", J Comput Biol. 2004; 11(2-3):377-94
-
Pertea, M., Lin, X., Salzberg, S., "GeneSplicer: a new computational method for splice site prediction", Nucleic Acids Res. 2001; 29(5):1185-90
-
Shendong, K., et al., "Quantitative evaluation of all hexamers as exonic splicing elements", Genome Res. 2011; 21(8):1360-1374
-
Signal, B., et al., "Machine learning annotation of human branchpoints", Bioinformatics. 2018; 34(6):920-927
-
Shiraishi, Y., et al., "A comprehensive characterization of cis-acting splicing-associated variants in human cancer", Genome Res. 2018; 28(8):1111-1125