-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path07-exprs-qc.Rmd
390 lines (302 loc) · 14.3 KB
/
07-exprs-qc.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
---
output: html_document
---
# Expression QC (UMI) {#exprs-qc}
## Introduction
Once gene expression has been quantified it is summarized as an __expression matrix__ where each row corresponds to a gene (or transcript) and each column corresponds to a single cell. This matrix should be examined to remove poor quality cells which were not detected in either read QC or mapping QC steps. Failure to remove low quality cells at this
stage may add technical noise which has the potential to obscure
the biological signals of interest in the downstream analysis.
Since there is currently no standard method for performing scRNASeq the expected values for the various QC measures that will be presented here can vary substantially from experiment to experiment. Thus, to perform QC we will be looking for cells which are outliers with respect to the rest of the dataset rather than comparing to independent quality standards. Consequently, care should be taken when comparing quality metrics across datasets collected using different protocols.
## Tung dataset
To illustrate cell QC, we consider a
[dataset](http://jdblischak.github.io/singleCellSeq/analysis/) of
induced pluripotent stem cells generated from three different individuals [@Tung2017-ba] in [Yoav Gilad](http://giladlab.uchicago.edu/)'s lab at the
University of Chicago. The experiments were carried out on the
Fluidigm C1 platform and to facilitate the quantification both unique
molecular identifiers (UMIs) and ERCC _spike-ins_ were used. The data files are located in the `tung` folder in your working directory. These files are the copies of the original files made on the 15/03/16. We will use these copies for reproducibility purposes.
```{r, echo=FALSE}
library(knitr)
opts_chunk$set(out.width='90%', fig.align = 'center')
```
```{r, message=FALSE, warning=FALSE}
library(scater, quietly = TRUE)
library(knitr)
options(stringsAsFactors = FALSE)
```
Load the data and annotations:
```{r}
molecules <- read.table("tung/molecules.txt", sep = "\t")
anno <- read.table("tung/annotation.txt", sep = "\t", header = TRUE)
```
Inspect a small portion of the expression matrix
```{r}
knitr::kable(
head(molecules[ , 1:3]), booktabs = TRUE,
caption = 'A table of the first 6 rows and 3 columns of the molecules table.'
)
knitr::kable(
head(anno), booktabs = TRUE,
caption = 'A table of the first 6 rows of the anno table.'
)
```
The data consists of `r length(unique(anno$individual))` individuals and `r length(unique(anno$replicate))` replicates and therefore has `r length(unique(anno$batch))` batches in total.
We standardize the analysis by using the scater package. First, create the scater SCESet classes:
```{r}
pheno_data <- new("AnnotatedDataFrame", anno)
rownames(pheno_data) <- pheno_data$sample_id
umi <- scater::newSCESet(
countData = molecules,
phenoData = pheno_data
)
```
Remove genes that are not expressed in any cell:
```{r}
keep_feature <- rowSums(counts(umi) > 0) > 0
umi <- umi[keep_feature, ]
```
Define control features (genes) - ERCC spike-ins and mitochondrial genes ([provided](http://jdblischak.github.io/singleCellSeq/analysis/qc-filter-ipsc.html) by the authors):
```{r}
ercc <- featureNames(umi)[grepl("ERCC-", featureNames(umi))]
mt <- c("ENSG00000198899", "ENSG00000198727", "ENSG00000198888",
"ENSG00000198886", "ENSG00000212907", "ENSG00000198786",
"ENSG00000198695", "ENSG00000198712", "ENSG00000198804",
"ENSG00000198763", "ENSG00000228253", "ENSG00000198938",
"ENSG00000198840")
```
Calculate the quality metrics:
```{r}
umi <- scater::calculateQCMetrics(
umi,
feature_controls = list(ERCC = ercc, MT = mt)
)
```
## Cell QC
### Library size
Next we consider the total number of RNA molecules detected per
sample (if we were using read counts rather than UMI counts this would
be the total number of reads). Wells with few reads/molecules are likely to have
been broken or failed to capture a cell, and should thus be removed.
```{r total-counts-hist, fig.cap = "Histogram of library sizes for all cells"}
hist(
umi$total_counts,
breaks = 100
)
abline(v = 25000, col = "red")
```
__Exercise 1__
1. How many cells does our filter remove?
2. What distribution do you expect that the
total number of molecules for each cell should follow?
__Our answer__
```{r, echo=FALSE}
filter_by_total_counts <- (umi$total_counts > 25000)
knitr::kable(
as.data.frame(table(filter_by_total_counts)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by total counts filter (FALSE)'
)
```
### Detected genes (1)
In addition to ensuring sufficient sequencing depth for each sample, we also want to make sure that the reads are distributed across the transcriptome. Thus, we count the total number of unique genes detected in each sample.
```{r total-features-hist, fig.cap = "Histogram of the number of detected genes in all cells"}
hist(
umi$total_features,
breaks = 100
)
abline(v = 7000, col = "red")
```
From the plot we conclude that most cells have between 7,000-10,000 detected genes,
which is normal for high-depth scRNA-seq. However, this varies by
experimental protocol and sequencing depth. For example, droplet-based methods
or samples with lower sequencing-depth typically detect fewer genes per cell. The most notable feature in the above plot is the __"heavy tail"__ on the left hand side of the
distribution. If detection rates were equal across the cells then the
distribution should be approximately normal. Thus we remove those
cells in the tail of the distribution (fewer than 7,000 detected genes).
__Exercise 2__
How many cells does our filter remove?
__Our answer__
```{r, echo=FALSE}
filter_by_expr_features <- (umi$total_features > 7000)
knitr::kable(
as.data.frame(table(filter_by_expr_features)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by total features filter (FALSE)'
)
```
### ERCCs and MTs
Another measure of cell quality is the ratio between ERCC _spike-in_
RNAs and endogenous RNAs. This ratio can be used to estimate the total amount
of RNA in the captured cells. Cells with a high level of _spike-in_ RNAs
had low starting amounts of RNA, likely due to the cell being
dead or stressed which may result in the RNA being degraded.
```{r mt-vs-counts, fig.cap = "Percentage of counts in MT genes"}
scater::plotPhenoData(
umi,
aes_string(x = "total_features",
y = "pct_counts_feature_controls_MT",
colour = "batch")
)
```
```{r ercc-vs-counts, fig.cap = "Percentage of counts in ERCCs"}
scater::plotPhenoData(
umi,
aes_string(x = "total_features",
y = "pct_counts_feature_controls_ERCC",
colour = "batch")
)
```
The above analysis shows that majority of the cells from NA19098.r2 batch have a very high ERCC/Endo ratio. Indeed, it has been shown by the authors that this batch contains cells of smaller size.
__Exercise 3__
Create filters for removing batch NA19098.r2 and cells with high expression of mitochondrial genes (>10% of total counts in a cell).
__Our answer__
```{r, echo=FALSE}
filter_by_ERCC <- umi$batch != "NA19098.r2"
knitr::kable(
as.data.frame(table(filter_by_ERCC)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by ERCC filter (FALSE)'
)
filter_by_MT <- umi$pct_counts_feature_controls_MT < 10
knitr::kable(
as.data.frame(table(filter_by_MT)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by MT filter (FALSE)'
)
```
__Exercise 4__
What would you expect to see in the ERCC vs counts plot if you were examining a dataset containing cells of different sizes (eg. normal & senescent cells)?
__Answer__
You would expect to see a group corresponding to the smaller cells (normal) with a higher fraction of ERCC reads than a separate group corresponding to the larger cells (senescent).
## Cell filtering
### Manual
Now we can define a cell filter based on our previous analysis:
```{r}
umi$use <- (
# sufficient features (genes)
filter_by_expr_features &
# sufficient molecules counted
filter_by_total_counts &
# sufficient endogenous RNA
filter_by_ERCC &
# remove cells with unusual number of reads in MT genes
filter_by_MT
)
```
```{r}
knitr::kable(
as.data.frame(table(umi$use)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by manual filter (FALSE)'
)
```
### Default Thresholds
Results from the biological analysis of single-cell RNA-seq data are often strongly influenced by outliers. Thus, it is important to include multiple filters. A robust way of detecting outliers is through the median absolute difference which is defined as $d_i = |r_i - m|$, where $r_i$ is the number of reads in cell $i$ and $m$ is the median number of reads across the all cells in the sample. By default, scater removes all cells where $d_i>5*$median($d_i$). A similar filter is used for the number of detected genes. Furthermore, scater removes all cells where >80% of counts were assigned to control genes and any cells that have been marked as controls.
```{r}
umi$use_default <- (
# remove cells with unusual numbers of genes
!umi$filter_on_total_features &
# remove cells with unusual numbers molecules counted
!umi$filter_on_total_counts &
# < 80% ERCC spike-in
!umi$filter_on_pct_counts_feature_controls_ERCC &
# < 80% mitochondrial
!umi$filter_on_pct_counts_feature_controls_MT &
# controls shouldn't be used in downstream analysis
!umi$is_cell_control
)
```
```{r}
knitr::kable(
as.data.frame(table(umi$use_default)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by default filter (FALSE)'
)
```
### Automatic
Another option available in __scater__ is to conduct PCA on a set of QC metrics and then use automatic outlier detection to identify potentially problematic cells.
By default, the following metrics are used for PCA-based outlier detection:
* __pct_counts_top_100_features__
* __total_features__
* __pct_counts_feature_controls__
* __n_detected_feature_controls__
* __log10_counts_endogenous_features__
* __log10_counts_feature_controls__
scater first creates a matrix where the rows represent cells and the columns represent the different QC metrics. Here, the PCA plot provides a 2D representation of cells ordered by their quality metrics. The outliers are then detected using methods from the mvoutlier package.
```{r auto-cell-filt, fig.align='center', fig.cap="PCA plot used for automatic detection of cell outliers", message=FALSE, warning=FALSE, out.width='90%'}
umi <-
scater::plotPCA(umi,
size_by = "total_features",
shape_by = "use",
pca_data_input = "pdata",
detect_outliers = TRUE,
return_SCESet = TRUE)
```
```{r}
knitr::kable(
as.data.frame(table(umi$outlier)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of cells removed by automatic filter (FALSE)'
)
```
## Compare filterings
__Exercise 5__
Compare the default, automatic and manual cell filters. Plot a Venn diagram of the outlier cells from these filterings.
__Hint__: Use `limma::vennCounts` and `limma::vennDiagram` functions from the [limma](https://bioconductor.org/packages/release/bioc/html/limma.html) package to make a Venn diagram.
__Answer__
```{r cell-filt-comp, fig.cap = "Comparison of the default, automatic and manual cell filters", echo=FALSE}
def <- colnames(umi)[!umi$use_default]
auto <- colnames(umi)[umi$outlier]
man <- colnames(umi)[!umi$use]
venn.diag <- limma::vennCounts(cbind(colnames(umi) %in% def,
colnames(umi) %in% auto,
colnames(umi) %in% man))
limma::vennDiagram(venn.diag,
names = c("Default", "Automatic", "Manual"),
circle.col = c("magenta", "blue", "green"))
```
## Gene analysis
### Gene expression
In addition to removing cells with poor quality, it is usually a good idea to exclude genes where we suspect that technical artefacts may have skewed the results. Moreover, inspection of the gene expression profiles may provide insights about how the experimental procedures could be improved.
It is often instructive to consider the number of reads consumed by the top 50 expressed genes.
```{r top50-gene-expr, fig.cap = "Number of total counts consumed by the top 50 expressed genes", fig.asp = 1}
scater::plotQC(umi, type = "highest-expression")
```
The distributions are relatively flat indicating (but not guaranteeing!) good coverage of the full transcriptome of these cells. However, there are several spike-ins in the top 15 genes which suggests a greater dilution of the spike-ins may be preferrable if the experiment is to be repeated.
### Gene filtering
It is typically a good idea to remove genes whose expression level is considered __"undetectable"__. We define a gene as detectable if at least two cells contain more than 1 transcript from the gene. If we were considering read counts rather than UMI counts a reasonable threshold is to require at least five reads in at least two cells. However, in both cases the threshold strongly depends on the sequencing depth. It is important to keep in mind that genes must be filtered after cell filtering since some genes may only be detected in poor quality cells (__note__ `pData(umi)$use` filter applied to the `umi` dataset).
```{r}
filter_genes <- apply(counts(umi[ , pData(umi)$use]), 1,
function(x) length(x[x > 1]) >= 2)
fData(umi)$use <- filter_genes
```
```{r}
knitr::kable(
as.data.frame(table(filter_genes)),
booktabs = TRUE,
row.names = FALSE,
caption = 'The number of genes removed by gene filter (FALSE)'
)
```
Depending on the cell-type, protocol and sequencing depth, other cut-offs may be appropriate.
## Save the data
Dimensions of the QCed dataset (do not forget about the gene filter we defined above):
```{r}
dim(umi[fData(umi)$use, pData(umi)$use])
```
Let's create an additional slot with log-transformed counts (we will need it in the next chapters):
```{r}
set_exprs(umi, "log2_counts") <- log2(counts(umi) + 1)
```
Save the data:
```{r}
saveRDS(umi, file = "tung/umi.rds")
```
## Big Exercise
Perform exactly the same QC analysis with read counts of the same Blischak data. Use `tung/reads.txt` file to load the reads. Once you have finished please compare your results to ours (next chapter).