This repository contains code to perform the analyses in the paper "The SpliZ generalizes “Percent Spliced In” to reveal regulated splicing at single-cell resolution" (Olivieri, Dehghannasiri, and Salzman 2021).
This pipeline takes the output from SICILIAN and returns the SpliZ for each gene and cell, as well as analyses of differential alternative splicing.
Clone this repository:
$ git clone https://github.com/juliaolivieri/SpliZ_pipeline.git
$ cd SpliZ_pipeline/
Ensure that conda is working on your system. Then set up the conda environment from the environment.yml file:
$ conda env create --name spliz_env --file=environment.yml
and activate it:
$ source activate spliz_env
If this activation step doesn't work, try running conda env list
and looking for the path that ends with spliz_env
. Then run source activate <full path>
.
This whole process should take less than 5 minutes on a normal computer.
Use the following command to run the pipeline on the small test dataset (labeled test
in the data
folder):
snakemake -p --config datasets="test" --restart-times 0
This should take less than 5 minutes to run on a local computer with at least 3 Gb free space.
After the pipeline has completed, you can check your results by comparing the file scripts/output/final_summary/summary_test_compartment-tissue_100_S_0.1_z_0.0_b_5.tsv
with the sample output file test_pvals_compartment-tissue_100_S_0.1_z_0.0_b_5.tsv
in the main directory.
You will need to place the following files in the "data" directory, accessible on figshare:
HLCA4_P2_10x_with_postprocessing_lung.pq
HLCA4_P3_10x_with_postprocessing_lung.pq
Names of datasets to run on are specified in the config.yaml
file. To run, use snakemake -p
. To run on different datasets, either change the values in the config.yaml
file, or override them at the command line: snakemake -p --config datasets="my_data_name"
. You can run snakemake -np
first to see what jobs will be run. Each job automatically re-submits itself two times if it fails, so if you want to run without these resubmissions you can run snakemake -p --restart-times 0
.
The terminal window you submit from will not be available again until after the full pipeline runs. You can use tmux to subset your termianl pane so that snakemake is only running in one box (this also allows you to detatch the session so it continues running even when terminal isn't open). For the tmux approach you will have to always log in to the same node so you can reconnect to the same session.
The pipeline should take around one hour to run on the full dataset.
To set up snakemake to run on slurm, you can follow the directions here: https://github.com/Snakemake-Profiles/slurm. All of the time and memory requirements for the SpliZ pipeline are specified in the script itself, so you don't need to change these variables if you're only running this pipeline.
This pipeline works with the "class input file" output of the SICILIAN pipeline. To run the pipeline without running SICILIAN first, your data must be a tsv or parquet file with one row per cell+junction with the following columns:
cell
: Cell identifierchrR1A
: The chromosome the gene is ongeneR1A_uniq
: The gene namestrand
: The strand of the gene (+ or -)juncPosR1A
: The 5' end of the splice junctionjuncPosR1B
: The 3' end of the splice junctionnumReads
: The number of reads mapping to this junction in the given cellcalled
: Column that is 1 for all junctions that should be included in the analysis, 0 otherwisefree_annotation
: The most specific annotation of the cells (e.g. cell type)compartment
: The next most specific annotation of the cells (every value in the column can be the same)tissue
: The most general annotation of the cells (every value in the column can be the same)
An example input file is given in data/test.tsv
.
The SpliZ values for the dataset can be found in this output file:
scripts/output/rijk_zscore/<dataname>_sym_SVD_normdonor_S_0.1_z_0.0_b_5_subcol.tsv
(for test,scripts/output/rijk_zscore/test_sym_SVD_normdonor_S_0.1_z_0.0_b_5_subcol.tsv
)
This output file has one line per cell per gene with enough spliced reads to calculate a SpliZ value. The column values are:
cell
: The cell identifiergeneR1A_uniq
: The genetissue
,compartment
,free_annotation
,ontology
: Metadata columns associating each cel with a cell typescZ
: SpliZ valuen.g_A
: Number of reads mapping to the 5' splice sites used for the SpliZ calculation in this gene and this celln.g_B
: Number of reads mapping to the 3' splice sites used for the SpliZ calculation in this gene and this cellsvd_z0
: The SpliZVD score based on the first eigenvectorsvd_z1
: The SpliZVD score based on the second eigenvectorsvd_z2
: The SpliZVD score based on the third eigenvector
Results of the differential SpliZ analysis can be found in this file:
scripts/output/final_summary/summary_<dataname>_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
(for test,scripts/output/final_summary/summary_test_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
)
There is one row per gene per group and cell type. The columns of the file are defined as follows:
gene
: the gene namesub_col
: The metadata column the data was subset on for analysis (in this example,tiss_comp
)group_col
: The metadata column differential analysis was performed on withsub_col
(in this example,ontology
)SpliZsites
: The SpliZsites isolated for this gene by the first three eigenvectors (separated by commas)<z_col>_median
: The median of the SpliZ(VD) values for the givengene
,sub_col
, andgroup_col
.z_col
can bescZ
,svd_z0
,svd_z1
, orsvd_z2
.<z_col>_pval
: The BH adjusted p value of the SpliZ(VD) values for the givengene
andsub_col
.z_col
can bescZ
,svd_z0
,svd_z1
, orsvd_z2
.
A separate file is created based on each of the first three eigenvectors:
scripts/output/SpliZsites/<dataname>_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
(for test,scripts/output/SpliZsites/test_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
)scripts/output/SpliZsites/second_evec_<dataname>_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
(for test,scripts/output/SpliZsites/second_evec_test_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
)scripts/output/SpliZsites/third_evec_<dataname>_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
(for test,scripts/output/SpliZsites/third_evec_test_ontology-tiss_comp_100_S_0.1_z_0.0_b_5.tsv
)
Each of these files contains the following columns:
gene
: gene namelet
: Either A or B depending on whether the splice site is 5' or 3'end
: the SpliZsite coordinate that is one of the most variable for this splice site
These are also found in the environment.yml
file.
- python=3.6.7
- pandas=1.0.4
- tqdm=4.46.0
- numpy=1.18.4
- snakemake-minimal=5.4.5=py_0
- pyarrow=0.15.1
- r-base=4.0.2
- r-data.table=1.14.0
- r-rfast=2.0.3
- scipy=1.4.1
- statsmodels=0.11.1
Olivieri, Dehghannasiri, and Salzman. "The SpliZ generalizes “Percent Spliced In” to reveal regulated splicing at single-cell resolution." bioRxiv. (2020) https://www.biorxiv.org/content/10.1101/2020.11.10.377572v2.
Dehghannasiri, Olivieri, and Salzman. "Specific splice junction detection in single cells with SICILIAN," bioRxiv. (2020) https://www.biorxiv.org/content/10.1101/2020.04.14.041905v1.