scJoint is a transfer learning method to integrate atlas-scale, heterogeneous collections of scRNA-seq and scATAC-seq data. scJoint leverages information from annotated scRNA-seq data in a semi-supervised framework and uses a neural network to simultaneously train labeled and unlabeled data, enabling label transfer and joint visualization in an integrative framework. For more information, please see scJoint manuscript: https://doi.org/10.1101/2020.12.31.424916.
scJoint is developed using PyTorch
1.0.0 and has been tested under both PyTorch
1.0.0 and 1.4.0. scJoint requires 1 GPU to run.
- A step-by-step tutorial using CITE-seq and ASAP-seq PBMC data from control condition generated by Mimitou et al. 2020 (GSE156478) is demonstrated here: link
- Tutorial for 10x Genomics data:
- Tutorial for mouse primary motor cortex that integrates transcriptomics, chromatin accessbility and methylation: link, the data can be donwloaded from link
scJoint can be obtained by simply clonning the github repository:
git clone https://github.com/SydneyBioX/scJoint.git
The following python packages are required to be installed before running scJoint:
h5py
, torch
, itertools
, scipy
, numpy
, os
, random
, sys
, time
, and datetime
.
scJoint's main function takes expression data in .npz
format and cell type labels in .txt
format. To prepare the input for scJoint, modifying dataset paths in process_db.py
which:
- take
.h5
files of expression matrix stored inmatrix/data
as input and generate.npz
files for each expression matrix. - transform
.csv
files of cell type labels to numeric and stored in.txt
files; and outputlabel_to_idx.txt
file indicates the correpondence of the numeric labels and the cell type labels.
Note:
- The expression matrix for scRNA-seq data are the gene expression matrix (either normalised or raw data), and gene actvitiy matrix for scATAC-seq data.
- The cell type labels for scRNA-seq is required, while the labels for scATAC-seq is optional and will only be used in accuracy calculation.
Edit config.py
according to the data input (See Arguments section for more details).
In terminal, run
python main.py
The output will be saved in ./output
folder.
The script config.py
indicate the arguments for scJoint, which needs to be modified according to the data.
DB
: name of the studynumber_of_class
: Number of cell type in the training data (scRNA-seq data)input_size
: Number of genes in both training and test datarna_paths
: A list of file paths of the .npz files of scRNA-seq gene expression datasetsrna_labels
: A list of file paths of the .txt files of scRNA-seq cell type inforamtionatac_paths
: A list of file paths of the .npz files of scATAC-seq gene activity expression datasetsatac_labels
: A list of file paths of the .txt files of scATAC-seq cell type inforamtion (optional, ifatac_labels
are provided, accuracy after knn would be provided)rna_protein_paths
: A list of paths of the .npz files of protein expression data for CITE-seq data (optional)atac_protein_paths
: A list of paths of the .npz files of protein expression data for ASAP-seq data (optional)
-
use_cuda
: Whether GPU is used -
threads
: Number of threads used (set as 1 by default) -
batch_size
: Batch size (set as 256 by default) -
lr_stage1
: Learning rate for stage 1 -
lr_stage3
: Learning rate for stage 3 -
lr_decay_epoch
: Number of epoch learning rate decay -
epochs_stage1
: Number of epochs for stage 1 -
epochs_stage3
: Number of epochs for stage 3 -
p
: The fraction of data pairs expected to have high cosine similarity scores (set as 0.8 by default) -
embedding_size
: Number of nodes in the embedding (hidden) layer (set as 64 by default) -
momentum
: Momentum for SGD (set as 0.9 by default) -
center_weight
: The weight for center loss (set as 1 by default) -
with_crossentorpy
:True
indicates well differentiated cell type mode,False
indicates to run trajectory mode of scJoint. -
seed
: seed to be used (set as none by default)
The configuration we used in our paper can be found in link.
scJoint will output 4 types of .txt files:
_embeddings.txt
: Output of embeddings layer for each dataset_knn_predictions.txt
: Predicted results of KNN for each scATAC-seq data (final predictions), where the numeric corresponding to the label_to_idx.txt file._knn_probs.txt
: Probability of KNN predictions for each scATAC-seq data_predictions.txt
: Output of prediction layer for each dataset
To generate tSNE and UMAP plots for the output data using R, run the following codes in terminal
Rscript embedding_visualisation_R.R --output_dir output/ --input_dir data/ --TSNE TRUE --UMAP TRUE --proportion 1
where
output_dir
: Directory of the output folderinput_dir
: Directory of intput folder (where the label_to_idx.txt file is saved)TSNE
: TRUE/FALSE indicates whether to run TSNEUMAP
: TRUE/FALSE indicates whether to run UMAPProportion
: proprotion of cells used in visualisation
Note:
- The script assumes the output folder only have results from one study
- Please install the following packages before running the
embedding_visualisation_R.R
script by running the following codes in R:
install.packages(c("ggplot2", "ggthemes", "scattermore", "ggpubr", "Rtsne", "uwot", "pals", "grDevices", "optparse"))
Output of embedding_visualisation_R.R
:
- TSNE and/or UMAP embedding will be generating in the
output_dir
folder:tsne_embedding.txt
,umap_embedding.txt
- Visualisation of TSNE and UMAP:
TSNE_plot.pdf
,UMAP_plot.pdf
scJoint is also available via superbio: https://app.superbio.ai/apps/114/.
Lin, Y., Wu, T.Y., Wan, S., Yang, J.Y., Wong, W.H. and Wang, Y.X., 2022. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nature Biotechnology, 40(5), pp.703-710.