forked from thelovelab/DESeq2
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathDESeq2.Rmd
2836 lines (2339 loc) · 115 KB
/
DESeq2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Analyzing RNA-seq data with DESeq2"
author: "Michael I. Love, Simon Anders, and Wolfgang Huber"
date: "`r format(Sys.Date(), '%m/%d/%Y')`"
abstract: >
A basic task in the analysis of count data from RNA-seq is the
detection of differentially expressed genes. The count data are
presented as a table which reports, for each sample, the number of
sequence fragments that have been assigned to each gene. Analogous
data also arise for other assay types, including comparative ChIP-Seq,
HiC, shRNA screening, and mass spectrometry. An important analysis
question is the quantification and statistical inference of systematic
changes between conditions, as compared to within-condition
variability. The package DESeq2 provides methods to test for
differential expression by use of negative binomial generalized linear
models; the estimates of dispersion and logarithmic fold changes
incorporate data-driven prior distributions. This vignette explains the
use of the package and demonstrates typical workflows.
[An RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/)
on the Bioconductor website covers similar material to this vignette
but at a slower pace, including the generation of count matrices from
FASTQ files.
DESeq2 package version: `r packageVersion("DESeq2")`
output:
rmarkdown::html_document:
highlight: pygments
toc: true
fig_width: 5
bibliography: library.bib
vignette: >
%\VignetteIndexEntry{Analyzing RNA-seq data with DESeq2}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\usepackage[utf8]{inputenc}
---
```{r setup, echo=FALSE, results="hide"}
knitr::opts_chunk$set(tidy = FALSE,
cache = FALSE,
dev = "png",
message = FALSE, error = FALSE, warning = TRUE)
```
# Standard workflow
**Note:** if you use DESeq2 in published research, please cite:
> Love, M.I., Huber, W., Anders, S. (2014)
> Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.
> *Genome Biology*, **15**:550.
> [10.1186/s13059-014-0550-8](http://dx.doi.org/10.1186/s13059-014-0550-8)
Other Bioconductor packages with similar aims are
[edgeR](http://bioconductor.org/packages/edgeR),
[limma](http://bioconductor.org/packages/limma),
[DSS](http://bioconductor.org/packages/DSS),
[EBSeq](http://bioconductor.org/packages/EBSeq), and
[baySeq](http://bioconductor.org/packages/baySeq).
## Quick start
Here we show the most basic steps for a differential expression
analysis. There are a variety of steps upstream of DESeq2 that result
in the generation of counts or estimated counts for each sample, which
we will discuss in the sections below. This code chunk assumes that
you have a count matrix called `cts` and a table of sample
information called `coldata`. The `design` indicates how to model the
samples, here, that we want to measure the effect of the condition,
controlling for batch differences. The two factor variables `batch`
and `condition` should be columns of `coldata`.
```{r quickStart, eval=FALSE}
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design= ~ batch + condition)
dds <- DESeq(dds)
resultsNames(dds) # lists the coefficients
res <- results(dds, name="condition_trt_vs_untrt")
# or to shrink log fold changes association with condition:
res <- lfcShrink(dds, coef="condition_trt_vs_untrt", type="apeglm")
```
The following starting functions will be explained below:
* If you have performed transcript quantification
(with *Salmon*, *kallisto*, *RSEM*, etc.)
you could import the data with *tximport*, which produces a list,
and then you can use `DESeqDataSetFromTximport()`.
* If you imported quantification data with *tximeta*, which produces a
*SummarizedExperiment* with additional metadata, you can then use
`DESeqDataSet()`.
* If you have *htseq-count* files, you can use
`DESeqDataSetFromHTSeq()`.
## How to get help for DESeq2
Any and all DESeq2 questions should be posted to the
**Bioconductor support site**, which serves as a searchable knowledge
base of questions and answers:
<https://support.bioconductor.org>
Posting a question and tagging with "DESeq2" will automatically send
an alert to the package authors to respond on the support site. See
the first question in the list of [Frequently Asked Questions](#FAQ)
(FAQ) for information about how to construct an informative post.
You should **not** email your question to the package authors, as we will
just reply that the question should be posted to the
**Bioconductor support site**.
## Acknowledgments
Constantin Ahlmann-Eltze has contributed core code for increasing the
computational performance of *DESeq2*.
We have benefited in the development of *DESeq2* from the help and
feedback of many individuals, including but not limited to:
The Bionconductor Core Team,
Alejandro Reyes, Andrzej Oles, Aleksandra Pekowska, Felix Klein,
Nikolaos Ignatiadis (IHW),
Anqi Zhu (apeglm),
Joseph Ibrahim (apeglm),
Vince Carey,
Owen Solberg,
Ruping Sun,
Devon Ryan,
Steve Lianoglou, Jessica Larson, Christina Chaivorapol, Pan Du, Richard Bourgon,
Willem Talloen,
Elin Videvall, Hanneke van Deutekom,
Todd Burwell,
Jesse Rowley,
Igor Dolgalev,
Stephen Turner,
Ryan C Thompson,
Tyr Wiesner-Hanks,
Konrad Rudolph,
David Robinson,
Mingxiang Teng,
Mathias Lesche,
Sonali Arora,
Jordan Ramilowski,
Ian Dworkin,
Bjorn Gruning,
Ryan McMinds,
Paul Gordon,
Leonardo Collado Torres,
Enrico Ferrero,
Peter Langfelder,
Gavin Kelly,
Rob Patro,
Charlotte Soneson,
Koen Van den Berge,
Fanny Perraudeau,
Davide Risso,
Stephan Engelbrecht,
Nicolas Alcala,
Jeremy Simon,
Travis Ptacek,
Rory Kirchner,
R. Butler,
Ben Keith,
Dan Liang,
Nil Aygün.
## Input data
### Why un-normalized counts?
As input, the DESeq2 package expects count data as obtained, e.g.,
from RNA-seq or another high-throughput sequencing experiment, in the form of a
matrix of integer values. The value in the *i*-th row and the *j*-th column of
the matrix tells how many reads can be assigned to gene *i* in sample *j*.
Analogously, for other types of assays, the rows of the matrix might correspond
e.g. to binding regions (with ChIP-Seq) or peptide sequences (with
quantitative mass spectrometry). We will list method for obtaining count matrices
in sections below.
The values in the matrix should be un-normalized counts or estimated
counts of sequencing reads (for
single-end RNA-seq) or fragments (for paired-end RNA-seq).
The [RNA-seq workflow](http://www.bioconductor.org/help/workflows/rnaseqGene/)
describes multiple techniques for preparing such count matrices. It
is important to provide count matrices as input for DESeq2's
statistical model [@Love2014] to hold, as only the count values allow
assessing the measurement precision correctly. The DESeq2 model
internally corrects for library size, so transformed or normalized
values such as counts scaled by library size should not be used as
input.
### The DESeqDataSet
The object class used by the DESeq2 package to store the read counts
and the intermediate estimated quantities during statistical analysis
is the *DESeqDataSet*, which will usually be represented in the code
here as an object `dds`.
A technical detail is that the *DESeqDataSet* class extends the
*RangedSummarizedExperiment* class of the
[SummarizedExperiment](http://bioconductor.org/packages/SummarizedExperiment) package.
The "Ranged" part refers to the fact that the rows of the assay data
(here, the counts) can be associated with genomic ranges (the exons of genes).
This association facilitates downstream exploration of results, making use of
other Bioconductor packages' range-based functionality
(e.g. find the closest ChIP-seq peaks to the differentially expressed genes).
A *DESeqDataSet* object must have an associated *design formula*.
The design formula expresses the variables which will be
used in modeling. The formula should be a tilde (~) followed by the
variables with plus signs between them (it will be coerced into an
*formula* if it is not already). The design can be changed later,
however then all differential analysis steps should be repeated,
as the design formula is used to estimate the dispersions and
to estimate the log2 fold changes of the model.
*Note*: In order to benefit from the default settings of the
package, you should put the variable of interest at the end of the
formula and make sure the control level is the first level.
We will now show 4 ways of constructing a *DESeqDataSet*, depending
on what pipeline was used upstream of DESeq2 to generated counts or
estimated counts:
1) From [transcript abundance files and tximport](#tximport)
2) From a [count matrix](#countmat)
3) From [htseq-count files](#htseq)
4) From a [SummarizedExperiment](#se) object
<a name="tximport"/>
### Transcript abundance files and *tximport* / *tximeta*
Our recommended pipeline for *DESeq2* is to use fast transcript
abundance quantifiers upstream of DESeq2, and then to create
gene-level count matrices for use with DESeq2
by importing the quantification data using
[tximport](http://bioconductor.org/packages/tximport) [@Soneson2015].
This workflow allows users to import transcript abundance estimates
from a variety of external software, including the following methods:
* [Salmon](http://combine-lab.github.io/salmon/)
[@Patro2017Salmon]
* [Sailfish](http://www.cs.cmu.edu/~ckingsf/software/sailfish/)
[@Patro2014Sailfish]
* [kallisto](https://pachterlab.github.io/kallisto/about.html)
[@Bray2016Near]
* [RSEM](http://deweylab.github.io/RSEM/)
[@Li2011RSEM]
Some advantages of using the above methods for transcript abundance
estimation are:
(i) this approach corrects for potential changes in gene length across samples
(e.g. from differential isoform usage) [@Trapnell2013Differential],
(ii) some of these methods (*Salmon*, *Sailfish*, *kallisto*)
are substantially faster and require less memory
and disk usage compared to alignment-based methods that require
creation and storage of BAM files, and
(iii) it is possible to avoid discarding those fragments that can
align to multiple genes with homologous sequence, thus increasing
sensitivity [@Robert2015Errors].
Full details on the motivation and methods for importing transcript
level abundance and count estimates, summarizing to gene-level count matrices
and producing an offset which corrects for potential changes in average
transcript length across samples are described in [@Soneson2015].
Note that the tximport-to-DESeq2 approach uses *estimated* gene
counts from the transcript abundance quantifiers, but not *normalized*
counts.
A tutorial on how to use the *Salmon* software for quantifying
transcript abundance can be
found [here](https://combine-lab.github.io/salmon/getting_started/).
We recommend using the `--gcBias`
[flag](http://salmon.readthedocs.io/en/latest/salmon.html#gcbias)
which estimates a correction factor for systematic biases
commonly present in RNA-seq data [@Love2016Modeling; @Patro2017Salmon],
unless you are certain that your data do not contain such bias.
Here, we demonstrate how to import transcript abundances
and construct of a gene-level *DESeqDataSet* object
from *Salmon* `quant.sf` files, which are
stored in the [tximportData](http://bioconductor.org/packages/tximportData) package.
You do not need the `tximportData` package for your analysis, it is
only used here for demonstration.
Note that, instead of locating `dir` using *system.file*,
a user would typically just provide a path, e.g. `/path/to/quant/files`.
For a typical use, the `condition` information should already be
present as a column of the sample table `samples`, while here we
construct artificial condition labels for demonstration.
```{r txiSetup}
library("tximport")
library("readr")
library("tximportData")
dir <- system.file("extdata", package="tximportData")
samples <- read.table(file.path(dir,"samples.txt"), header=TRUE)
samples$condition <- factor(rep(c("A","B"),each=3))
rownames(samples) <- samples$run
samples[,c("pop","center","run","condition")]
```
Next we specify the path to the files using the appropriate columns of
`samples`, and we read in a table that links transcripts to genes for
this dataset.
```{r txiFiles}
files <- file.path(dir,"salmon", samples$run, "quant.sf.gz")
names(files) <- samples$run
tx2gene <- read_csv(file.path(dir, "tx2gene.gencode.v27.csv"))
```
We import the necessary quantification data for DESeq2 using the
*tximport* function. For further details on use of *tximport*,
including the construction of the `tx2gene` table for linking
transcripts to genes in your dataset, please refer to the
[tximport](http://bioconductor.org/packages/tximport) package vignette.
```{r tximport, results="hide"}
txi <- tximport(files, type="salmon", tx2gene=tx2gene)
```
Finally, we can construct a *DESeqDataSet* from the `txi` object and
sample information in `samples`.
```{r txi2dds, results="hide"}
library("DESeq2")
ddsTxi <- DESeqDataSetFromTximport(txi,
colData = samples,
design = ~ condition)
```
The `ddsTxi` object here can then be used as `dds` in the
following analysis steps.
### Tximeta for import with automatic metadata
Another Bioconductor package,
[tximeta](https://bioconductor.org/packages/tximeta) [@Love2020],
extends *tximport*, offering the same functionality, plus the
additional benefit of automatic addition of annotation metadata for
commonly used transcriptomes (GENCODE, Ensembl, RefSeq for human and
mouse). See the [tximeta](https://bioconductor.org/packages/tximeta)
package vignette for more details. *tximeta* produces a
*SummarizedExperiment* that can be loaded easily into *DESeq2* using
the `DESeqDataSet` function, with an example in the *tximeta* package
vignette, and below:
```{r}
coldata <- samples
coldata$files <- files
coldata$names <- coldata$run
```
```{r echo=FALSE}
library("tximeta")
se <- tximeta(coldata, skipMeta=TRUE)
ddsTxi2 <- DESeqDataSet(se, design = ~condition)
```
```{r eval=FALSE}
library("tximeta")
se <- tximeta(coldata)
ddsTxi <- DESeqDataSet(se, design = ~ condition)
```
The `ddsTxi` object here can then be used as `dds` in the
following analysis steps. If *tximeta* recognized the reference
transcriptome as one of those with a pre-computed hashed checksum, the
`rowRanges` of the `dds` object will be pre-populated. Again, see the
*tximeta* vignette for full details.
<a name="countmat"/>
### Count matrix input
Alternatively, the function *DESeqDataSetFromMatrix* can be
used if you already have a matrix of read counts prepared from another
source. Another method for quickly producing count matrices
from alignment files is the *featureCounts* function [@Liao2013feature]
in the [Rsubread](http://bioconductor.org/packages/Rsubread) package.
To use *DESeqDataSetFromMatrix*, the user should provide
the counts matrix, the information about the samples (the columns of the
count matrix) as a *DataFrame* or *data.frame*, and the design formula.
To demonstate the use of *DESeqDataSetFromMatrix*,
we will read in count data from the
[pasilla](http://bioconductor.org/packages/pasilla) package.
We read in a count matrix, which we will name `cts`,
and the sample information table, which we will name `coldata`.
Further below we describe how to extract these objects from,
e.g. *featureCounts* output.
```{r loadPasilla}
library("pasilla")
pasCts <- system.file("extdata",
"pasilla_gene_counts.tsv",
package="pasilla", mustWork=TRUE)
pasAnno <- system.file("extdata",
"pasilla_sample_annotation.csv",
package="pasilla", mustWork=TRUE)
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)
```
We examine the count matrix and column data to see if they are
consistent in terms of sample order.
```{r showPasilla}
head(cts,2)
coldata
```
Note that these are not in the same order with respect to samples!
It is absolutely critical that the columns of the count matrix and the
rows of the column data (information about samples) are in the same
order. DESeq2 will not make guesses as to which column of the count
matrix belongs to which row of the column data, these must be provided
to DESeq2 already in consistent order.
As they are not in the correct order as given, we need to re-arrange
one or the other so that they are consistent in terms of sample order
(if we do not, later functions would produce an error). We
additionally need to chop off the `"fb"` of the row names of
`coldata`, so the naming is consistent.
```{r reorderPasila}
rownames(coldata) <- sub("fb", "", rownames(coldata))
all(rownames(coldata) %in% colnames(cts))
all(rownames(coldata) == colnames(cts))
cts <- cts[, rownames(coldata)]
all(rownames(coldata) == colnames(cts))
```
If you have used the *featureCounts* function [@Liao2013feature] in the
[Rsubread](http://bioconductor.org/packages/Rsubread) package,
the matrix of read counts can be directly
provided from the `"counts"` element in the list output.
The count matrix and column data can typically be read into R
from flat files using base R functions such as *read.csv*
or *read.delim*. For *htseq-count* files, see the dedicated input
function below.
With the count matrix, `cts`, and the sample
information, `coldata`, we can construct a *DESeqDataSet*:
```{r matrixInput}
library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
colData = coldata,
design = ~ condition)
dds
```
If you have additional feature data, it can be added to the
*DESeqDataSet* by adding to the metadata columns of a newly
constructed object. (Here we add redundant data just for demonstration, as
the gene names are already the rownames of the `dds`.)
```{r addFeatureData}
featureData <- data.frame(gene=rownames(cts))
mcols(dds) <- DataFrame(mcols(dds), featureData)
mcols(dds)
```
<a name="htseq"/>
### *htseq-count* input
You can use the function *DESeqDataSetFromHTSeqCount* if you
have used *htseq-count* from the
[HTSeq](http://www-huber.embl.de/users/anders/HTSeq)
python package [@Anders:2014:htseq].
For an example of using the python scripts, see the
[pasilla](http://bioconductor.org/packages/pasilla) data package. First you will want to specify a
variable which points to the directory in which the *htseq-count*
output files are located.
```{r htseqDirI, eval=FALSE}
directory <- "/path/to/your/files/"
```
However, for demonstration purposes only, the following line of
code points to the directory for the demo *htseq-count* output
files packages for the [pasilla](http://bioconductor.org/packages/pasilla) package.
```{r htseqDirII}
directory <- system.file("extdata", package="pasilla",
mustWork=TRUE)
```
We specify which files to read in using *list.files*,
and select those files which contain the string `"treated"`
using *grep*. The *sub* function is used to
chop up the sample filename to obtain the condition status, or
you might alternatively read in a phenotypic table
using *read.table*.
```{r htseqInput}
sampleFiles <- grep("treated",list.files(directory),value=TRUE)
sampleCondition <- sub("(.*treated).*","\\1",sampleFiles)
sampleTable <- data.frame(sampleName = sampleFiles,
fileName = sampleFiles,
condition = sampleCondition)
sampleTable$condition <- factor(sampleTable$condition)
```
Then we build the *DESeqDataSet* using the following function:
```{r hsteqDds}
library("DESeq2")
ddsHTSeq <- DESeqDataSetFromHTSeqCount(sampleTable = sampleTable,
directory = directory,
design= ~ condition)
ddsHTSeq
```
<a name="se"/>
### *SummarizedExperiment* input
If one has already created or obtained a *SummarizedExperiment*, it
can be easily input into DESeq2 as follows. First we load the package
containing the `airway` dataset.
```{r loadSumExp}
library("airway")
data("airway")
se <- airway
```
The constructor function below shows the generation of a
*DESeqDataSet* from a *RangedSummarizedExperiment* `se`.
```{r sumExpInput}
library("DESeq2")
ddsSE <- DESeqDataSet(se, design = ~ cell + dex)
ddsSE
```
### Pre-filtering
While it is not necessary to pre-filter low count genes before running
the DESeq2 functions, there are two reasons which make pre-filtering
useful: by removing rows in which there are very few reads, we reduce
the memory size of the `dds` data object, and we increase the speed of
the transformation and testing functions within DESeq2. Here we
perform a minimal pre-filtering to keep only rows that have at least
10 reads total. Note that more strict filtering to increase power is
*automatically* applied via [independent filtering](#indfilt) on the
mean of normalized counts within the *results* function.
```{r prefilter}
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
```
<a name="factorlevels"/>
### Note on factor levels
By default, R will choose a *reference level* for factors based on
alphabetical order. Then, if you never tell the DESeq2 functions which
level you want to compare against (e.g. which level represents the
control group), the comparisons will be based on the alphabetical
order of the levels. There are two solutions: you can either
explicitly tell *results* which comparison to make using the
`contrast` argument (this will be shown later), or you can explicitly
set the factors levels. In order to see the change of reference levels
reflected in the results names, you need to either run `DESeq` or
`nbinomWaldTest`/`nbinomLRT` after the re-leveling operation.
Setting the factor levels can be done in two ways, either using
factor:
```{r factorlvl}
dds$condition <- factor(dds$condition, levels = c("untreated","treated"))
```
...or using *relevel*, just specifying the reference level:
```{r relevel}
dds$condition <- relevel(dds$condition, ref = "untreated")
```
If you need to subset the columns of a *DESeqDataSet*,
i.e., when removing certain samples from the analysis, it is possible
that all the samples for one or more levels of a variable in the design
formula would be removed. In this case, the *droplevels* function can be used
to remove those levels which do not have samples in the current *DESeqDataSet*:
```{r droplevels}
dds$condition <- droplevels(dds$condition)
```
### Collapsing technical replicates
DESeq2 provides a function *collapseReplicates* which can
assist in combining the counts from technical replicates into single
columns of the count matrix. The term *technical replicate*
implies multiple sequencing runs of the same library.
You should not collapse biological replicates using this function.
See the manual page for an example of the use of
*collapseReplicates*.
### About the pasilla dataset
We continue with the [pasilla](http://bioconductor.org/packages/pasilla) data constructed from the
count matrix method above. This data set is from an experiment on
*Drosophila melanogaster* cell cultures and investigated the
effect of RNAi knock-down of the splicing factor *pasilla*
[@Brooks2010]. The detailed transcript of the production of
the [pasilla](http://bioconductor.org/packages/pasilla) data is provided in the vignette of the
data package [pasilla](http://bioconductor.org/packages/pasilla).
<a name="de"/>
## Differential expression analysis
The standard differential expression analysis steps are wrapped
into a single function, *DESeq*. The estimation steps performed
by this function are described [below](#theory), in the manual page for
`?DESeq` and in the Methods section of the DESeq2 publication [@Love2014].
Results tables are generated using the function *results*, which
extracts a results table with log2 fold changes, *p* values and adjusted
*p* values. With no additional arguments to *results*, the log2 fold change and
Wald test *p* value will be for the **last variable** in the design
formula, and if this is a factor, the comparison will be the **last
level** of this variable over the **reference level**
(see previous [note on factor levels](#factorlevels)).
However, the order of the variables of the design do not matter
so long as the user specifies the comparison to build a results table
for, using the `name` or `contrast` arguments of *results*.
Details about the comparison are printed to the console, directly above the
results table. The text, `condition treated vs untreated`, tells you that the
estimates are of the logarithmic fold change log2(treated/untreated).
```{r deseq}
dds <- DESeq(dds)
res <- results(dds)
res
```
Note that we could have specified the coefficient or contrast we want
to build a results table for, using either of the following equivalent
commands:
```{r eval=FALSE}
res <- results(dds, name="condition_treated_vs_untreated")
res <- results(dds, contrast=c("condition","treated","untreated"))
```
One exception to the equivalence of these two commands, is that, using
`contrast` will additionally set to 0 the estimated LFC in a
comparison of two groups, where all of the counts in the two groups
are equal to 0 (while other groups have positive counts). As this may
be a desired feature to have the LFC in these cases set to 0, one can
use `contrast` to build these results tables.
More information about extracting specific coefficients from a fitted
*DESeqDataSet* object can be found in the help page `?results`.
The use of the `contrast` argument is also further discussed [below](#contrasts).
<a name="lfcShrink"/>
### Log fold change shrinkage for visualization and ranking
Shrinkage of effect size (LFC estimates) is useful for visualization
and ranking of genes. To shrink the LFC, we pass the `dds`
object to the function `lfcShrink`. Below we specify to use the
*apeglm* method for effect size shrinkage [@Zhu2018], which improves
on the previous estimator.
We provide the `dds` object and the name or number of the
coefficient we want to shrink, where the number refers to the order
of the coefficient as it appears in `resultsNames(dds)`.
```{r lfcShrink}
resultsNames(dds)
resLFC <- lfcShrink(dds, coef="condition_treated_vs_untreated", type="apeglm")
resLFC
```
Shrinkage estimation is discussed more in a [later section](#altshrink).
<a name="parallel"/>
### Using parallelization
The above steps should take less than 30 seconds for most
analyses. For experiments with complex designs and many samples
(e.g. dozens of coefficients, ~100s of samples), one
can take advantage of parallelized computation. Parallelizing `DESeq`,
`results`, and `lfcShrink` can be easily accomplished by loading the
BiocParallel package, and then setting the following arguments:
`parallel=TRUE` and `BPPARAM=MulticoreParam(4)`, for example,
splitting the job over 4 cores. Note that `results` for coefficients
or contrasts listed in `resultsNames(dds)` is fast and will not need
parallelization. As an alternative to `BPPARAM`, one can `register`
cores at the beginning of an analysis, and then just specify
`parallel=TRUE` to the functions when called.
```{r parallel, eval=FALSE}
library("BiocParallel")
register(MulticoreParam(4))
```
### p-values and adjusted p-values
We can order our results table by the smallest *p* value:
```{r resOrder}
resOrdered <- res[order(res$pvalue),]
```
We can summarize some basic tallies using the
*summary* function.
```{r sumRes}
summary(res)
```
How many adjusted p-values were less than 0.1?
```{r sumRes01}
sum(res$padj < 0.1, na.rm=TRUE)
```
The *results* function contains a number of arguments to
customize the results table which is generated. You can read about
these arguments by looking up `?results`.
Note that the *results* function automatically performs independent
filtering based on the mean of normalized counts for each gene,
optimizing the number of genes which will have an adjusted *p* value
below a given FDR cutoff, `alpha`.
Independent filtering is further discussed [below](#indfilt).
By default the argument `alpha` is set to $0.1$. If the adjusted *p*
value cutoff will be a value other than $0.1$, `alpha` should be set to
that value:
```{r resAlpha05}
res05 <- results(dds, alpha=0.05)
summary(res05)
sum(res05$padj < 0.05, na.rm=TRUE)
```
<a name="IHW"/>
### Independent hypothesis weighting
A generalization of the idea of *p* value filtering is to *weight*
hypotheses to optimize power. A Bioconductor
package, [IHW](http://bioconductor.org/packages/IHW), is available
that implements the method of *Independent Hypothesis Weighting*
[@Ignatiadis2016]. Here we show the use of *IHW* for *p* value
adjustment of DESeq2 results. For more details, please see the
vignette of the [IHW](http://bioconductor.org/packages/IHW)
package. The *IHW* result object is stored in the metadata.
**Note:** If the results of independent hypothesis weighting are used
in published research, please cite:
> Ignatiadis, N., Klaus, B., Zaugg, J.B., Huber, W. (2016)
> Data-driven hypothesis weighting increases detection power in genome-scale multiple testing.
> *Nature Methods*, **13**:7.
> [10.1038/nmeth.3885](http://dx.doi.org/10.1038/nmeth.3885)
```{r IHW, eval=FALSE}
# (unevaluated code chunk)
library("IHW")
resIHW <- results(dds, filterFun=ihw)
summary(resIHW)
sum(resIHW$padj < 0.1, na.rm=TRUE)
metadata(resIHW)$ihwResult
```
For advanced users, note that all the values calculated by the DESeq2
package are stored in the *DESeqDataSet* object or the *DESeqResults*
object, and access to these values is discussed [below](#access).
## Exploring and exporting results
### MA-plot
In DESeq2, the function *plotMA* shows the log2
fold changes attributable to a given variable over the mean of
normalized counts for all the samples in the *DESeqDataSet*.
Points will be colored red if the adjusted *p* value is less than 0.1.
Points which fall out of the window are plotted as open triangles pointing
either up or down.
```{r MA}
plotMA(res, ylim=c(-2,2))
```
It is more useful visualize the MA-plot for the shrunken log2 fold
changes, which remove the noise associated with log2 fold changes from
low count genes without requiring arbitrary filtering thresholds.
```{r shrunkMA}
plotMA(resLFC, ylim=c(-2,2))
```
After calling *plotMA*, one can use the function
*identify* to interactively detect the row number of
individual genes by clicking on the plot. One can then recover
the gene identifiers by saving the resulting indices:
```{r MAidentify, eval=FALSE}
idx <- identify(res$baseMean, res$log2FoldChange)
rownames(res)[idx]
```
<a name="shrink"/>
### Alternative shrinkage estimators
The moderated log fold changes proposed by @Love2014 use a normal
prior distribution, centered on zero and with a scale that is fit to
the data. The shrunken log fold changes are useful for ranking and
visualization, without the need for arbitrary filters on low count
genes. The normal prior can sometimes produce too strong of
shrinkage for certain datasets. In DESeq2 version 1.18, we include two
additional adaptive shrinkage estimators, available via the `type`
argument of `lfcShrink`. For more details, see `?lfcShrink`
The options for `type` are:
* `apeglm` is the adaptive t prior shrinkage estimator from the
[apeglm](http://bioconductor.org/packages/apeglm) package
[@Zhu2018]. As of version 1.28.0, it is the default estimator.
* `ashr` is the adaptive shrinkage estimator from the
[ashr](https://github.com/stephens999/ashr) package [@Stephens2016].
Here DESeq2 uses the ashr option to fit a mixture of Normal distributions to
form the prior, with `method="shrinkage"`.
* `normal` is the the original DESeq2 shrinkage estimator, an adaptive
Normal distribution as prior.
If the shrinkage estimator `apeglm` is used in published research, please cite:
> Zhu, A., Ibrahim, J.G., Love, M.I. (2018)
> Heavy-tailed prior distributions for sequence count data:
> removing the noise and preserving large differences.
> *Bioinformatics*. [10.1093/bioinformatics/bty895](https://doi.org/10.1093/bioinformatics/bty895)
If the shrinkage estimator `ashr` is used in published research, please cite:
> Stephens, M. (2016)
> False discovery rates: a new deal. *Biostatistics*, **18**:2.
> [10.1093/biostatistics/kxw041](https://doi.org/10.1093/biostatistics/kxw041)
In the LFC shrinkage code above, we specified
`coef="condition_treated_vs_untreated"`. We can also just specify the
coefficient by the order that it appears in `resultsNames(dds)`, in
this case `coef=2`. For more details explaining how the shrinkage
estimators differ, and what kinds of designs, contrasts and output is
provided by each, see the [extended section on shrinkage estimators](#moreshrink).
```{r warning=FALSE}
resultsNames(dds)
# because we are interested in treated vs untreated, we set 'coef=2'
resNorm <- lfcShrink(dds, coef=2, type="normal")
resAsh <- lfcShrink(dds, coef=2, type="ashr")
```
```{r fig.width=8, fig.height=3}
par(mfrow=c(1,3), mar=c(4,4,2,1))
xlim <- c(1,1e5); ylim <- c(-3,3)
plotMA(resLFC, xlim=xlim, ylim=ylim, main="apeglm")
plotMA(resNorm, xlim=xlim, ylim=ylim, main="normal")
plotMA(resAsh, xlim=xlim, ylim=ylim, main="ashr")
```
**Note:** We have sped up the `apeglm` method so it takes roughly
about the same amount of time as `normal`, e.g. ~5 seconds for the
`pasilla` dataset of ~10,000 genes and 7 samples.
If fast shrinkage estimation of LFC is needed,
*but the posterior standard deviation is not needed*,
setting `apeMethod="nbinomC"` will produce a ~10x speedup,
but the `lfcSE` column will be returned with `NA`.
A variant of this fast method, `apeMethod="nbinomC*"` includes random starts.
**Note:** If there is unwanted variation present in the data (e.g. batch
effects) it is always recommend to correct for this, which can be
accommodated in DESeq2 by including in the design any known batch
variables or by using functions/packages such as
`svaseq` in [sva](http://bioconductor.org/packages/sva) [@Leek2014] or
the `RUV` functions in [RUVSeq](http://bioconductor.org/packages/RUVSeq) [@Risso2014]
to estimate variables that capture the unwanted variation.
In addition, the ashr developers have a
[specific method](https://github.com/dcgerard/vicar)
for accounting for unwanted variation in combination with ashr [@Gerard2017].
### Plot counts
It can also be useful to examine the counts of reads for a single gene
across the groups. A simple function for making this
plot is *plotCounts*, which normalizes counts by sequencing depth
and adds a pseudocount of 1/2 to allow for log scale plotting.
The counts are grouped by the variables in `intgroup`, where
more than one variable can be specified. Here we specify the gene
which had the smallest *p* value from the results table created
above. You can select the gene to plot by rowname or by numeric index.
```{r plotCounts}
plotCounts(dds, gene=which.min(res$padj), intgroup="condition")
```
For customized plotting, an argument `returnData` specifies
that the function should only return a *data.frame* for
plotting with *ggplot*.
```{r plotCountsAdv}
d <- plotCounts(dds, gene=which.min(res$padj), intgroup="condition",
returnData=TRUE)
library("ggplot2")
ggplot(d, aes(x=condition, y=count)) +
geom_point(position=position_jitter(w=0.1,h=0)) +
scale_y_log10(breaks=c(25,100,400))
```
### More information on results columns
Information about which variables and tests were used can be found by calling
the function *mcols* on the results object.
```{r metadata}
mcols(res)$description
```
For a particular gene, a log2 fold change of -1 for
`condition treated vs untreated` means that the treatment
induces a multiplicative change in observed gene expression level of
$2^{-1} = 0.5$ compared to the untreated condition. If the variable of
interest is continuous-valued, then the reported log2 fold change is
per unit of change of that variable.
<a name="pvaluesNA"/>
**Note on p-values set to NA**: some values in the results table
can be set to `NA` for one of the following reasons:
* If within a row, all samples have zero counts,
the `baseMean` column will be zero, and the
log2 fold change estimates, *p* value and adjusted *p* value
will all be set to `NA`.
* If a row contains a sample with an extreme count outlier
then the *p* value and adjusted *p* value will be set to `NA`.
These outlier counts are detected by Cook's distance. Customization
of this outlier filtering and description of functionality for
replacement of outlier counts and refitting is described
[below](#outlier)
* If a row is filtered by automatic independent filtering,
for having a low mean normalized count, then only the adjusted *p*
value will be set to `NA`.
Description and customization of independent filtering is
described [below](#indfilt)
### Rich visualization and reporting of results
**ReportingTools** An HTML report of the results with plots and
sortable/filterable columns can be generated using
the [ReportingTools](http://bioconductor.org/packages/ReportingTools)
package on a *DESeqDataSet* that has been processed by the *DESeq* function.
For a code example, see the *RNA-seq differential expression* vignette at
the [ReportingTools](http://bioconductor.org/packages/ReportingTools)
page, or the manual page for the *publish* method for the
*DESeqDataSet* class.
**regionReport** An HTML and PDF summary of the results with plots
can also be generated using
the [regionReport](http://bioconductor.org/packages/regionReport)
package. The *DESeq2Report* function should be run on a
*DESeqDataSet* that has been processed by the *DESeq* function.
For more details see the manual page for *DESeq2Report*
and an example vignette in
the [regionReport](http://bioconductor.org/packages/regionReport)
package.
**Glimma** Interactive visualization of DESeq2 output,
including MA-plots (also called MD-plot) can be generated using the
[Glimma](http://bioconductor.org/packages/Glimma) package. See the
manual page for *glMDPlot.DESeqResults*.
**pcaExplorer** Interactive visualization of DESeq2 output,
including PCA plots, boxplots of counts and other useful summaries can be
generated using
the [pcaExplorer](http://bioconductor.org/packages/pcaExplorer)
package. See the *Launching the application* section of the package vignette.
**iSEE** Provides functions for creating an interactive Shiny-based
graphical user interface for exploring data stored in
SummarizedExperiment objects, including row- and column-level
metadata. Particular attention is given to single-cell data in a
SingleCellExperiment object with visualization of dimensionality
reduction results.