-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
254 lines (162 loc) · 10.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
# dsb <a href='https://mattpm.github.io/dsb'><img src='man/figures/logo.png' align="right" height="150" /></a>
# dsb <img src="man/figures/logo.png" align="right" />
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# dsb
## An R package for normalizing and denoising CITEseq data
```{r, echo=FALSE, out.width = '100%'}
knitr::include_graphics("man/figures/logo.png")
```
<!-- badges: start -->
[![Travis build status](https://travis-ci.org/MattPM/dsb.svg?branch=master)](https://travis-ci.org/MattPM/dsb)
<!-- badges: end -->
This package was developed at [John Tsang's Lab](https://www.niaid.nih.gov/research/john-tsang-phd) by Matt Mulè, Andrew Martins and John Tsang. The package implements our normalization and denoising method for CITEseq data. Technical discussion of how the method works can be found in [the biorxiv preprint](https://biorxiv.org) We utilized the dsb package to normalize CITEseq data reported in [this paper](https://doi.org/10.1038/s41591-020-0769-8)
In [the biorxiv preprint](https://biorxiv.org), by comparing unstained control cells and empty droplets we found the major contribution to background noise in CITEseq data is unbound antibody captured and sequenced in droplets. DSB corrects for this background by leveraging empty droplets which serve as a "built in" noise measurement in any droplet capture single cell platform (e.g. 10X, dropseq, indrop).
We also found that the counts of isotype controls and those of ‘negative’ markers for each cell are significantly correlated, suggesting that their covariation likely reflects cell-to-cell differences due to technical factors such as non-specific antibody binding and droplet-to-droplet differences in capture efficiency of the DNA tags.
We thus developed the DSB package specifically for normalizing CITE-seq protein expression data. DSB stands for Denoised and Scaled by Background (DSB) and it corrects for 1) protein-specific background noise as reflected by empty droplets, 2) the technical cell-to-cell variation as reflected by the latent noise component described above. DSB normalization improves separation between positive and negative populations for each protein, centers the negative-staining population at zero, and can improve interpretation of protein expression-based clustering.
## installation
You can install the released version of dsb in your R session with the command below
```{r}
# this is analagous to install.packages("package), you need the package devtools to install a package from a github repository like this one.
require(devtools)
devtools::install_github(repo = 'MattPM/dsb')
```
## quickstart
```{r example}
# load package and normalize the example raw data
library(dsb)
# normalize
normalized_matrix = DSBNormalizeProtein(cell_protein_matrix = cells_citeseq_mtx,
empty_drop_matrix = empty_drop_citeseq_mtx)
```
## The full version (recommended)
**This is a quick summary Please see the detailed workflow vignette**
By default dsb defines the per cell denoising covariate by fitting a gaussian mixture model to the log + 10 counts of each cell and defining the noise ocvariates the mean. We reccomend including the counts from isotype controls in each cell in the denoising covariates.
```{r}
# get the empty cells from demultiplexing with
# define a vector of the isotype controls in the data
isotypes = c("Mouse IgG2bkIsotype_PROT", "MouseIgG1kappaisotype_PROT","MouseIgG2akappaisotype_PROT", "RatIgG2bkIsotype_PROT")
normalized_matrix = DSBNormalizeProtein(cell_protein_matrix = cells_citeseq_mtx,
empty_drop_matrix = empty_drop_citeseq_mtx,
use.isotype.control = TRUE,
isotype.control.name.vec = isotypes)
```
## visualize protein distributions after normalization
plot the DSB normalized CITEseq data.
**Note, there is NO jitter added to these points for visualization these are the unmodified normalized counts**
```{r, fig.height=4.5, fig.width=8.2}
# add a density gradient on the points () this is helpful when there are many thousands of cells )
# this density function is from this blog post: https://slowkow.com/notes/ggplot2-color-by-density/
get_density = function(x, y, ...) {
dens <- MASS::kde2d(x, y, ...)
ix <- findInterval(x, dens$x)
iy <- findInterval(y, dens$y)
ii <- cbind(ix, iy)
return(dens$z[ii])
}
library(ggplot2)
data.plot = normalized_matrix %>% t %>%
as.data.frame() %>%
dplyr::select(CD4_PROT, CD8_PROT, CD27_PROT, CD19_PROT)
data.plot = data.plot %>% dplyr::mutate(density = get_density(data.plot$CD4_PROT, data.plot$CD8_PROT, n = 100))
p1 = ggplot(data.plot, aes(x = CD8_PROT, y = CD4_PROT, color = density)) +
geom_point(size = 0.4) + theme_bw() + ggtitle("small example dataset") +
geom_vline(xintercept = 0, color = "red", linetype = 2) +
geom_hline(yintercept = 0, color = "red", linetype = 2) +
theme(axis.text = element_text(face = "bold",size = 12)) +
viridis::scale_color_viridis(option = "B") +
scale_shape_identity()
data.plot = data.plot %>% dplyr::mutate(density = get_density(data.plot$CD19_PROT, data.plot$CD27_PROT, n = 100))
p2 = ggplot(data.plot, aes(x = CD19_PROT, y = CD27_PROT, color = density)) +
geom_point(size = 0.4) + theme_bw() +
geom_vline(xintercept = 0, color = "red", linetype = 2) +
geom_hline(yintercept = 0, color = "red", linetype = 2) +
theme(axis.text = element_text(face = "bold",size = 12)) +
viridis::scale_color_viridis(option = "B") +
scale_shape_identity()
cowplot::plot_grid(p1,p2)
```
## How do I get the empty droplets?
**This is covered more extensively in the detailed workflow vignette**
There are a number of ways to get the empty drops. If you are using cell hashing, when you demultiplex the cells, you get a vector of empty or Negative droplets.
HTODemux function in Seurat:
https://satijalab.org/seurat/v3.1/hashing_vignette.html
deMULTIplex function from Multiseq (this is now also implemented in Seurat).
https://github.com/chris-mcginnis-ucsf/MULTI-seq
If you're not multiplexing
you can simply get a vector of negative droplets from the cells you would remove.
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y
## Simple example workflow (Seurat Version 3)
```{r, eval=FALSE}
# get the ADT counts using Seurat version 3
seurat_object = HTODemux(seurat_object, assay = "HTO", positive.quantile = 0.99)
Idents(seurat_object) = "HTO_classification.global"
neg_object = subset(seurat_object, idents = "Negative")
singlet_object = subset(seurat_object, idents = "Singlet")
# non sparse CITEseq data actually store better in a regular materix so the as.matrix() call is not memory intensive.
neg_adt_matrix = GetAssayData(neg_object, assay = "CITE", slot = 'counts') %>% as.matrix()
positive_adt_matrix = GetAssayData(singlet_object, assay = "CITE", slot = 'counts') %>% as.matrix()
# normalize the data with dsb
# make sure you've run devtools::install_github(repo = 'MattPM/dsb')
normalized_matrix = DSBNormalizeProtein(cell_protein_matrix = positive_adt_matrix,
empty_drop_matrix = neg_adt_matrix)
# now add the normalized dat back to the object (the singlets defined above as "object")
singlet_object = SetAssayData(object = singlet_object, slot = "CITE", new.data = normalized_matrix)
```
As an additional QC step one can confirm the cells called as "Negative" have low RNA / gene content to be certain there are no contaminating cells.
We reccomend hash demultiplexing with the raw output from cellranger rather than the processed output (i.e. outs/raw_feature_bc_matrix). this file will have more empty droplets from which the HTODemux function will be able to estimate the negative distribution. This will also have the benefit of creating more droplets to use as built protein background controls in the DSB function.
## example workflow Seurat version 2
```{r, eval=FALSE}
# get the ADT counts using Seurat version 3
seurat_object = HTODemux(seurat_object, assay = "HTO", positive.quantile = 0.99)
neg = seurat_object %>%
SetAllIdent(id = "hto_classification_global") %>%
SubsetData(ident.use = "Negative")
singlet = seurat_object %>%
SetAllIdent(id = "hto_classification_global") %>%
SubsetData(ident.use = "Singlet")
# get negative and positive ADT data
neg_adt_matrix = neg@[email protected] %>% as.matrix()
pos_adt_matrix = singlet@[email protected] %>% as.matrix()
# normalize the data with dsb
# make sure you've run devtools::install_github(repo = 'MattPM/dsb')
normalized_matrix = DSBNormalizeProtein(cell_protein_matrix = pos_adt_matrix,
empty_drop_matrix = neg_adt_matrix)
# add the assay to the Seurat object
singlet = SetAssayData(object = singlet, slot = "CITE", new.data = normalized_matrix)
```
## How to get empty drops without cell hashing or sample demultiplexing
you can simply get a vector of negative droplets from the cells you would remove. There robust ways to estimate which cells are empty droplets:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1662-y
Below is a quick method to get outlier empty droplets assuming seurat_object is a object with most cells (i.e. any cell expressing at least a gene).
Get the nUMI from a seurat version 3 object
```{r, eval=FALSE}
# get the nUMI from a seurat version 3 object
umi = seurat_object$nUMI
```
Get the nUMI from a Seurat version 2 object
```{r, eval=FALSE}
# Get the nUMI from a Seurat version 2 object
umi = [email protected] %>% select("nUMI")
```
```{r, eval=FALSE}
mu_umi = mean(umi)
sd_umi = sd(umi)
# calculate a threshold for calling a cell negative
sub_threshold = mu_umi - (5*sd_umi)
# define the negative cell object
Idents(seurat_object) = "nUMI"
neg = subset(seurat_object, accept.high = sub_threshold)
```
This negative cell object can be used to define the negative background following the examples above.
**Please see the detailed workflow vignette for a full workflow and more details**