Skip to content

Latest commit

 

History

History
85 lines (61 loc) · 5.21 KB

README.md

File metadata and controls

85 lines (61 loc) · 5.21 KB

Project Links

This repository is part of the CI-SpliceAI software package published in PLOS One.

This is the project comparing different splice prediction tools on variant data. You may also be interested in the code to train CI-SpliceAI, code to use trained models to annotate variants offline, and the website providing online annotation of variants.

Abstract

In this project, we are evaluating 6 different splice prediction tools (one of which is ours called CI-SpliceAI) on a corpus of:

  • 1,317 variants for a binary affecting/non-affecting task; and
  • 388 variants (subset of the first corpus) with annotations of their exact variant effect

This repository contains all variants and all code to re-produce the results obtained.

Variant Data

Visualisations of the variants:

Pie diagrams of the data Distance from a variant to its closest splice site

Results

Optimal Thresholds, PR-AUC, PR-ROC, and optimal Accuracy

Algorithm Coverage AUC-PR AUC-ROC Optimal Threshold Accuracy
MES (Sliding) 100% 55.68% 52.97% 12.5 53.42%
SQUIRLS 100% 91.32% 91.17% 0.074 85.64%
MES (VEP) 58% 92.52% 89.15% 2.109 86.40%
MMSplice (Splicing Efficiency) 99% 93.03% 92.56% 1.119 87.23%
MMSplice (Pathogenicity) 99% 94.13% 92.84% 0.961 88.53%
SpliceAI 99% 96.21% 95.65% 0.3 90.88%
CI-SpliceAI 100% 97.25% 96.75% 0.19 92.17%

PR-Curves of all algorithms; CI-SpliceAI is superior to the rest

Predictive error between CI-SpliceAI and SpliceAI

Predictive error bettered in the majority of data points

Exact variant effect prediction accuracy

Algorithm Acceptor Gain Acceptor Loss Donor Gain Donor Loss
MES (Sliding) 0.00% 1.16% 2.33% 2.25%
SpliceAI 87.50% 77.10% 79.07% 78.93%
CI-SpliceAI 93.75% 78.55% 79.07% 82.02%

CI-SpliceAI Mispredictions

Predictive error bettered in the majority of data points

Methods

These steps were taken:

CSV to VCF

The variant csv file was parsed into vcf format and normalised (index, normalise rows, align left).

The resulting vcf file is checked in this repository, so you don't need to run the code producing it.

Running tools

We ran all tools on the vcf file using predict.sh.

Results are checked into predictions/.

Analysis

Variant data and predictions were analysed and plotted using analysis.sh into analysis/.

Setup

This project is built on bash scripts. We suggest running it on a UNIX system; it might be possible to run it on windows using a bash environment like git bash, this is however untested and unsupported.

Before running the setup code, make sure you agree to all licences of third-party components.

Please make sure to install these manual dependencies first:

Then run setup.sh which will automatically:

  • Create conda environments with SpliceAI, CI-SpliceAI and MMSplice (through kipoi)
  • Download all third party elements like:
    • SQUIRLS command line, jannovar annotations, database
    • the human reference genome
    • GENCODE annotations for MMSplice
  • Pre-process GENCODE annotations for MMSplice

Licensing

Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.

By running this code, you are installing third-party software. It is your responsibility to assure that you are following all third party licenses.