forked from Winnie09/GPTCelltype
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgptcelltype.Rmd
161 lines (117 loc) · 7.63 KB
/
gptcelltype.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
---
title: 'GPTCelltype: Reference-free and cost-effective automated cell type annotation
with GPT-4 in single-cell RNA-seq analysis'
author:
- Wenpin Hou$^{1,*}$, Zhicheng Ji$^{2,*}$
- $^1$Department of Biostatistics, Columbia University Mailman School of Public Health Health
- $^2$Department of Biostatistics and Bioinformatics, Duke University School of Medicine
- $^*$ corresponding authors
output:
html_document:
df_print: paged
pdf_document: default
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
### Introductions
Cell type annotation is an essential step in single-cell RNA-seq analysis. However, it is a time-consuming process that often requires expertise in collecting canonical marker genes and manually annotating cell types. Automated cell type annotation methods typically require the acquisition of high-quality reference datasets and the development of additional pipelines. We demonstrated that GPT-4, a highly potent large language model, can automatically and accurately annotate cell types by utilizing marker gene information generated from standard single-cell RNA-seq analysis pipelines in [this manuscript](https://www.biorxiv.org/content/10.1101/2023.04.16.537094v1).
We developed this software, **GPTCelltype**, to provide an automated cell type annotation approach using GPT-4 for single-cell RNA-seq analysis.
### Installation
GPTCelltype can be installed by following [this instruction](https://github.com/Winnie09/GPTCelltype) on Github.
```{r eval = FALSE}
remotes::install_github("Winnie09/GPTCelltype")
```
GPTCelltype depends on the R package [openai](https://cran.r-project.org/web/packages/openai/index.html). Please make sure it is installed as well.
```{r eval = FALSE}
install.packages("openai")
```
### Set up OpenAI API key as an environment variable
GPTCelltype integrates the [OpenAI API](https://platform.openai.com/account/api-keys) into the software. To connect to OpenAI API, a secret [API key](https://platform.openai.com/account/api-keys) is required.
To avoid the risk of exposing the API key or committing the key to browsers, users need to set up the API key as a system environment variable before running GPTCelltype. If the API key is provided, cell type annotations are returned. Otherwise, if the API key is not provided, the output from GPTCelltype is the prompt itself which users can further used to communicate with the GPT chatbot.
You can generate your API key in your OpenAI account webpage: log in to [OpenAI](https://openai.com/). In the pop-up windows, click on "->" next to "API"; next, click on the left-hand-side icon of "API key"; then click on "Create new secret key" to create your key which directs you to the [API key page](https://platform.openai.com/account/api-keys). Copy the key and paste it on a note for further use. Avoid sharing your API key with others or uploading it to public spaces. Make sure it's not visible in browsers or any client-side scripts. Finally, on the left bar, click "Settings"; on the break-down list click on "Billing", and make sure you have non-zero credit balance.
![](/Users/wenpinhou/Dropbox/HouLab/gptcelltype/github/GPTCelltype/vignettes/1.png){width=32%}
![](/Users/wenpinhou/Dropbox/HouLab/gptcelltype/github/GPTCelltype/vignettes/2.png){width=32%}
![](/Users/wenpinhou/Dropbox/HouLab/gptcelltype/github/GPTCelltype/vignettes/3.png){width=32%}
![](/Users/wenpinhou/Dropbox/HouLab/gptcelltype/github/GPTCelltype/vignettes/4.png){width=32%}
![](/Users/wenpinhou/Dropbox/HouLab/gptcelltype/github/GPTCelltype/vignettes/5.png){width=32%}
Set up the API key as a system environment variable before running GPTCelltype.
```{r,eval=FALSE}
Sys.setenv(OPENAI_API_KEY = 'your_openai_API_key')
```
### Run GPTCelltype
First of all, please load the packages.
```{r}
library(GPTCelltype)
library(openai)
```
We demonstrate how to run GPTCelltype as follows. The main function is **`gptcelltype()`**. It can annotate cell types by OpenAI GPT models in a Seurat pipeline or with a custom gene list. If **`gptcelltype()`** is used in a Seurat pipeline, Seurat **`FindAllMarkers()`** function needs to be run first and the differential gene table generated by Seurat will serve as the input. If the input is a custom list of genes, one cell type is identified for each element in the list.
Among the input arguments, **`input`** can either be the differential gene table returned by Seurat **`FindAllMarkers()`** function, or a list of genes.
**`tissuename`** (optional) is a tissue name.
**`model`** is a valid GPT-4 or GPT-3.5 model name listed on [Models page](https://platform.openai.com/docs/models). Default is 'gpt-4'.
**`topgenenumber`** is the number of top differential genes to be used when the input is Seurat differential genes. The output is a vector of cell types.
#### Example 1: Seurat object as input
GPTCelltype integrates seamlessly with the Seurat pipeline. It can take an Seurat object as input, if the Seurat object has marker genes information. Specifially, this can be achieved after running the Seurat function `FindAllMarkers()`. Here follows an example.
Load the Seurat package.
```{r}
library(Seurat, quietly = TRUE)
```
In the below example, we are going to use a Seurat object called 'pbmc_small' provided by the Seurat package. In real applications, a Seurat project obtained after running the standard Seurat pipeline should be prepared. The Seurat project should have cell clustering available. Use FindAllMarkers() function to generate the differential gene table if you haven't done so:
```{r}
data("pbmc_small")
suppressWarnings({
all.markers <- FindAllMarkers(object = pbmc_small)
})
```
Perform cell type annotation by GPT-4 using the gptcelltype() function. Here you can optionally provide the actual name of the tissue for your dataset.
```{r}
res <- gptcelltype(all.markers,
tissuename = 'human PBMC',
model = 'gpt-4'
)
```
It is always recommended to check the results returned by GPT-4 in case of AI hallucination, before going to down-stream analysis.
```{r}
res
```
If the results make sense, we can assign the cell type annotations back to the Seurat object and visualize the cell type annotations on the UMAP:
```{r, fig.height=4}
[email protected]$celltype <- as.factor(res[as.character(Idents(pbmc_small))])
DimPlot(pbmc_small,group.by='celltype')
```
If the results need to be fine-tuned, it is easy to reassign cell type annotations for some clusters. For example, to change the cell type annotation for cluster 0:
```{r}
res[1] <- 'Classical monocytes'
[email protected]$celltype <- res[as.character(Idents(pbmc_small))]
```
If you prefer not to link to GPT-4 API or do not have OpenAI key, you can set `Sys.setenv(OPENAI_API_KEY = '')`. In this case, the gptcelltype() function will print the prompt directly, which can be copied and pasted into the GPT-4 or ChatGPT online user interface to obtain cell type annotations.
```{r}
Sys.setenv(OPENAI_API_KEY = '')
data("pbmc_small")
suppressWarnings({
all.markers <- FindAllMarkers(object = pbmc_small)
})
res <- gptcelltype(all.markers,
tissuename = 'human PBMC',
model = 'gpt-4'
)
cat(res)
```
#### Example 2: use a list of genes as input
Set up your OpenAI API key as a system environment variable before running GPTCelltype.
```{r,eval=FALSE}
Sys.setenv(OPENAI_API_KEY = 'your_openai_API_key')
```
If we provide a list of two gene vectors: the first vector contains *CD4* and *CD3D*, and the second vector contains *CD14*, then we can call the function in this way:
```{r}
res <- gptcelltype(
input = list(cluster1 = c('CD4, CD3D'), cluster2 = 'CD14'),
tissuename = 'human PBMC',
model = 'gpt-4'
)
res
```
#### Session Info
```{r}
sessionInfo()
```