Skip to content

Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data

License

Notifications You must be signed in to change notification settings

yanglq-bio/scCATCH

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

62 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

scCATCH v2.1

R >3.5 installed with devtools

Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data

Recent advance in single-cell RNA sequencing (scRNA-seq) has enabled large-scale transcriptional characterization of thousands of cells in multiple complex tissues, in which accurate cell type identification becomes the prerequisite and vital step for scRNA-seq studies. Currently, the common practice in cell type annotation is to map the highly expressed marker genes with known cell markers manually based on the identified clusters, which requires the priori knowledge and tends to be subjective on the choice of which marker genes to use. Besides, such manual annotation is usually time-consuming.

To address these problems, we introduce a single cell Cluster-based Annotation Toolkit for Cellular Heterogeneity (scCATCH) from cluster marker genes identification to cluster annotation based on evidence-based score by matching the identified potential marker genes with known cell markers in tissue-specific cell taxonomy reference database (CellMatch).

download CellMatch

CellMatch includes a panel of 353 cell types and related 686 subtypes associated with 184 tissue types, 20,792 cell-specific marker genes and 2,097 references of human and mouse.

The scCATCH mainly includes two function findmarkergenes() and scCATCH() to realize the automatic annotation for each identified cluster. Usage and Examples are detailed below.

Cite

DOI PMID:32062421

Shao et al., scCATCH:Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data, iScience, Volume 23, Issue 3, 27 March 2020. doi: 10.1016/j.isci.2020.100882. PMID:32062421

v2.0

  • Add cluster and match_CellMatch parameters to handle large scRNA-seq datasets.
  • Add cancer parameters to annotate scRNA-seq data from tissue with cancer.

v2.1

  • Update Gene symbols in CellMatch according to NCBI Gene symbols (updated in June 19, 2020, https://www.ncbi.nlm.nih.gov/gene). Unmatched marker genes FLJ42102, LOC101928100, LOC200772 and BC017158 are removed.
  • Fix the Error in intI(j, n = x@Dim[2], dn[[2]], give.dn = FALSE) : invalid character indexing in findmarkergenes() by adding a check of cluster number. Refer to issue 14
  • Fix the Error in object[object$cluster == clu.num[i], ] : wrong number of dimensions in scCATCH() by adding a check of type of input. Refer to issue 13
  • Add a progress bar for findmarkergenes() and scCATCH().
  • scCATCH for R > 4.0.0 can be downloaded in Release page.

Install

scCATCH-2.1.tar.gz R>3.6 scCATCH-2.1.tar.gz R>4.0

# download the source package of scCATCH-2.1.tar.gz and install it
# ensure the right directory for scCATCH-2.1.tar.gz
install.packages(pkgs = 'scCATCH-2.1.tar.gz',repos = NULL, type = "source")

or

# install devtools and install scCATCH
install.packages(pkgs = 'devtools')
devtools::install_github('ZJUFanLab/scCATCH')

Usage

library(scCATCH)

Cluster marker genes identification

clu_markers <- findmarkergenes(object,
                               species = NULL,
                               cluster = 'All',
                               match_CellMatch = FALSE,
                               cancer = NULL,
                               tissue = NULL,
                               cell_min_pct = 0.25,
                               logfc = 0.25,
                               pvalue = 0.05)

Identify potential marker genes for each cluster from a Seurat object (>= 3.0.0) after the default log1p normalization and cluster analysis. The potential marker genes in each cluster are identified according to its expression level compared to it in every other clusters. Only significantly highly expressed one in all pair-wise comparison of the cluster will be selected as a potential marker gene for the cluster. Genes will be revised according to NCBI Gene symbols (updated in June 19, 2020, https://www.ncbi.nlm.nih.gov/gene) and no matched genes and duplicated genes will be removed.

object Seurat object (>= 3.0.0) after the default log1p normalization and cluster analysis. Please ensure data is log1p normalized data and data has been clustered before running scCATCH pipeline.

speciesSpecies of cells. Species must be defined. 'Human' or 'Mouse'.

clusterSelect which clusters for potential marker genes identification. e.g. '1', '2', etc. Default is 'All' to find potential marker genes for each cluster.

match_CellMatchFor large datasets containg > 10,000 cells or > 15 clusters, it is strongly recommended to set match_CellMatch TRUE to match CellMatch database first to include potential marker genes in terms of large system memory it may take.

cancerIf match_CellMatch is set TRUE and the sample is from cancer tissue, then the cancer type may be defined. Select one or more related cancer types detailed in wiki page. Default is NULL.

tissueIf match_CellMatch is set TRUE, then the tissue origin of cells must be defined. Select one or more related tissue types detailed in wiki page

cell_min_pctInclude the gene detected in at least this many cells in each cluster. Default is 0.25.

logfcInclude the gene with at least this fold change of average gene expression compared to every other clusters. Default is 0.25.

pvalueInclude the significantly highly expressed gene with this cutoff of p value from wilcox test compared to every other clusters. Default is 0.05.

Output

clu_markersA list include a new data matrix wherein genes are revised by official gene symbols according to NCBI Gene symbols (updated in June 19, 2020, https://www.ncbi.nlm.nih.gov/gene) and no matched genes and duplicated genes are removed as well as a data.frame containing potential marker genes of each selected cluster and the corresponding expressed cells percentage and average fold change for each cluster.

Cluster annotation

clu_ann <- scCATCH(object,
                   species = NULL,
                   cancer = NULL,
                   tissue = NULL)

Evidence-based score and annotation for each cluster by matching the potential marker genes generated from findmarkergenes() with known cell marker genes in tissue-specific cell taxonomy reference database (CellMatch).

objectData.frame containing marker genes and the corresponding expressed cells percentage and average fold change for each cluster from the output of findmarkergenes().

speciesSpecies of cells. Select 'Human' or 'Mouse'

cancerIf the sample is from cancer tissue and you want to match cell marker genes of cancer tissues in CellMatch, then the cancer type may be defined. Select one or more related cancer types detailed in wiki page

tissueThe tissue origin of cells. Select one or more related tissue types in detailed in wiki page

Output

clu_annA data.frame containing matched cell type for each cluster, related marker genes, evidence-based score and PMID.

Examples

# Step 1: prepare a Seurat object containing log1p normalized single-cell transcriptomic data matrix and the information of cell clusters.
# Note: please define the species for revising gene symbols. Human or Mouse. The default is to find potential marker genes for all clusters with the percentage of expressed cells (≥25%), using WRS test (P<0.05) and a log1p fold change of ≥0.25. These parameters are adjustable for users.

clu_markers <- findmarkergenes(object = mouse_kidney_203_Seurat,
                               species = 'Mouse'
                               cluster = 'All',
                               match_CellMatch = FALSE,
                               cancer = NULL,
                               tissue = NULL,
                               cell_min_pct = 0.25,
                               logfc = 0.25,
                               pvalue = 0.05)
                               
# Note: for large datasets, please set match_CellMatch as TRUE and provided tissue types. For tissue with cancer, users may provided the cancer types and corresponding tissue types. See Details. 
# Step 2: evidence-based scoring and annotaion for identified potential marker genes of each cluster generated from findmarkergenes function.

clu_ann <- scCATCH(object = clu_markers$clu_markers,
                   species = 'Mouse',
                   cancer = NULL,
                   tissue = 'Kidney')

# Users can also use scCATCH by selecting multiple cluster, cancer types, tissue types as follows:
clu_markers <- findmarkergenes(object = mouse_kidney_203_Seurat,
                               species = 'Mouse'
                               cluster = '1',
                               match_CellMatch = TRUE,
                               cancer = NULL,
                               tissue = 'Kidney',
                               cell_min_pct = 0.1,
                               logfc = 0.1,
                               pvalue = 0.01)
                               
clu_markers <- findmarkergenes(object = mouse_kidney_203_Seurat,
                               species = 'Mouse'
                               cluster = c('1','2'),
                               match_CellMatch = TRUE,
                               cancer = NULL,
                               tissue = c('Kidney','Mesonephros'))
Note: please select the right cancer type and the corresponding tissue type (See wiki page.

Issues

bug error

Solutions for possilble bugs and errors. Please refer to closed Issues1 and Issues2

About

scCATCH was developed by Xin Shao. Should you have any questions, please contact Xin Shao at [email protected]

About

Automatic Annotation on Cell Types of Clusters from Single-Cell RNA Sequencing Data

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 99.8%
  • Rebol 0.2%