Skip to content

fangliu117/GPTCelltype

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GPTCelltype: Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis

Overview

Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrated that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines in this manuscript. We developed this software, GPTCelltype, to provide reference-free, cost-effective automated cell type annotation using GPT-4 for single-cell RNA-seq analysis.

Installation

GPTCelltype software can be installed via Github in seconds. Users should have R > 3.5.x installed. R can be downloaded here: http://www.r-project.org/.

For Windows users, Rtools is also required to be installed. Rtools can be downloaded here: (https://cloud.r-project.org/bin/windows/Rtools/). For R version 3.5.x, Rtools35.exe is recommended. Use default settings to perform the installation.

For mac users, if there is any problem with installation problem, please try download and install clang-8.0.0.pkg from the following URL: https://cloud.r-project.org/bin/macosx/tools/clang-8.0.0.pkg

To install the latest version of GPTCelltype package via Github, run the following commands in R:

remotes::install_github("Winnie09/GPTCelltype")

GPTCelltype depends on the R package openai. Please install openai as well.

install.packages("openai")

OpenAI Key

GPTCelltype integrates the OpenAI API into the software. To connect to OpenAI API, a secret API key is required. You can generate your API key in your OpenAI account webpage: log in to OpenAI, click on "Personal" on the upper right corner, click on "View API keys" in the break-down list, and then click on "Create new secret key" which directs you API key page. Copy the key and paste it on a note for further use. Users need to pass their secret API key to GPTCelltype functions as one of the inputs.

Run GPTCelltype

First of all, please load the packages.

library(GPTCelltype)
library(openai)

We demonstrate how to run GPTCelltype as follows. The demo can be completed within a minute. The main function is gptcelltype(). It can annotate cell types by OpenAI GPT models in a Seurat pipeline or with a custom gene list. If gptcelltype() is used in a Seurat pipeline, Seurat FindAllMarkers() function needs to be run first and the differential gene table generated by Seurat will serve as the input. If the input is a custom list of genes, one cell type is identified for each element in the list.

Among the input arguments, input can either be the differential gene table returned by Seurat FindAllMarkers() function, or a list of genes. tissuename (optional) is a tissue name. openai_key is your OpenAI key obtained from API key page (see above section). model is a valid GPT-4 or GPT-3.5 model name listed on Models page. Default is 'gpt-4'. topgenenumber is the number of top differential genes to be used when the input is Seurat differential genes. The output is a vector of cell types.

Example 1: Seurat object as input

GPTCelltype integrates seamlessly with the Seurat pipeline. It can take an Seurat object as input, if the Seurat object has marker genes information. Specifially, this can be achieved after running the Seurat function FindAllMarkers(). Here follows an example.

In the below example, we are going to use a Seurat object called 'pbmc_small' provided by the Seurat package. In real applications, a Seurat project obtained after running the standard Seurat pipeline should be prepared. The Seurat project should have cell clustering available. Use FindAllMarkers() function to generate the differential gene table if you haven't done so:

data("pbmc_small")
all.markers <- FindAllMarkers(object = pbmc_small)

Perform cell type annotation by GPT-4 using the gptcelltype() function. Here you can optionally provide the actual name of the tissue for your dataset.

res <- gptcelltype(all.markers, 
            tissuename = 'human PBMC', 
            openai_key = openaikey, 
            ## Note: Please provide your OpenAI key to get cell type annotations;
            ## or otherwise the output is the prompt itself.
            model = 'gpt-4'
)

It is recommended to check the results returned by GPT-4 in case of AI hallucination, before going to down-stream analysis.

If the results make sense, we can assign the cell type annotations back to the Seurat object and visualize the cell type annotations on the UMAP:

pbmc_small$celltype <- res[as.character(Idents(pbmc_small))]
DimPlot(pbmc_small,group.by='celltype')

If the results need to be fine-tuned, it is easy to reassign cell type annotations for some clusters. For example, to change the cell type annotation for cluster 0:

res[1] <- 'Classical monocytes'
pbmc_small$celltype <- res[as.character(Idents(pbmc_small))]

If you prefer not to link to GPT-4 API or do not have OpenAI key, you can set openai_key = NA. In this case, the gptcelltype() function will print the prompt directly, which can be copied and pasted into the GPT-4 or ChatGPT online user interface to obtain cell type annotations.

data("pbmc_small")
all.markers <- FindAllMarkers(object = pbmc_small)
res <- gptcelltype(all.markers, 
            tissuename = 'human PBMC', 
            openai_key = NA, 
            ## Note: When NA, the output is the prompt itself.
            model = 'gpt-4'
)
cat(res)

Example 2: a list of genes as input

If we provide a list of two gene vectors: the first vector contains CD4 and CD3D, and the second vector contains CD14, then we can call the function in this way:

res <- gptcelltype(
  input = list(cluster1 = c('CD4, CD3D'), cluster2 = 'CD14'),
  tissuename = 'human PBMC',
  openai_key = NA, ## Note: Please provide your OpenAI key to get cell type annotations; or otherwise the output is the prompt itself.
  model = 'gpt-4'
)
cat(res)

If you provide your openai_key, then you can obtain the output "T helper cells" "Monocytes". In this illustration, the default setting openai_key = NA is employed. As a result, we receive the prompt directly, which can be inputted into an online GPT interface to obtain cell type annotations.

Vignette

You can view the complete vignette here.

Citation

Hou, W. and Ji, Z., 2023. Reference-free and cost-effective automated cell type annotation with GPT-4 in single-cell RNA-seq analysis. bioRxiv, pp.2023-04.

Contact

Authors: Wenpin Hou ([email protected]), Zhicheng Ji ([email protected]).

Report bugs and provide suggestions by sending email to the maintainer Dr. Wenpin Hou ([email protected]) or open a new issue on this Github page.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 99.9%
  • R 0.1%