forked from microbiome/mia
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathmia.Rmd
347 lines (273 loc) · 9.47 KB
/
mia.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
---
title: "mia: Microbiome analysis tools"
date: "`r Sys.Date()`"
package: mia
output:
BiocStyle::html_document:
fig_height: 7
fig_width: 10
toc: true
toc_float: true
toc_depth: 2
number_sections: true
vignette: >
%\VignetteIndexEntry{mia}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
bibliography: references.bib
---
```{r, echo=FALSE}
knitr::opts_chunk$set(
cache = FALSE,
fig.width = 9,
message = FALSE,
warning = FALSE)
```
`mia` implements tools for microbiome analysis based on the
`SummarizedExperiment` [@SE], `SingleCellExperiment` [@SCE] and
`TreeSummarizedExperiment` [@TSE] infrastructure. Data wrangling and analysis
are the main scope of this package.
# Installation
To install `mia`, install `BiocManager` first, if it is not installed.
Afterwards use the `install` function from `BiocManager`.
```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("mia")
```
# Load *mia*
```{r load-packages, message=FALSE, warning=FALSE}
library("mia")
```
# Loading a `TreeSummarizedExperiment` object
A few example datasets are available via `mia`. For this vignette the
`GlobalPatterns` dataset is loaded first.
```{r}
data(GlobalPatterns, package = "mia")
tse <- GlobalPatterns
tse
```
# Functions for working with microbiome data
One of the main topics for analysing microbiome data is the application of
taxonomic data to describe features measured. The interest lies in the
connection between individual bacterial species and their relation to each
other.
`mia` does not rely on a specific object type to hold taxonomic data, but
uses specific columns in the `rowData` of a `TreeSummarizedExperiment` object.
`taxonomyRanks` can be used to construct a `character` vector of available
taxonomic levels. This can be used, for example, for subsetting.
```{r}
# print the available taxonomic ranks
colnames(rowData(tse))
taxonomyRanks(tse)
# subset to taxonomic data only
rowData(tse)[,taxonomyRanks(tse)]
```
The columns are recognized case insensitive. Additional functions are
available to check for validity of taxonomic information or generate labels
based on the taxonomic information.
```{r}
table(taxonomyRankEmpty(tse, "Species"))
head(getTaxonomyLabels(tse))
```
For more details see the man page `?taxonomyRanks`.
## Merging and agglomeration based on taxonomic information.
Agglomeration of data based on these taxonomic descriptors can be performed
using functions implemented in `mia`. In addition to the `aggValue` functions
provide by `TreeSummarizedExperiment` `agglomerateByRank` is available.
`agglomerateByRank` does not require tree data to be present.
`agglomerateByRank` constructs a `factor` to guide merging from the available
taxonomic information. For more information on merging have a look at the man
page via `?mergeFeatures`.
```{r}
# agglomerate at the Family taxonomic rank
x1 <- agglomerateByRank(tse, rank = "Family")
## How many taxa before/after agglomeration?
nrow(tse)
nrow(x1)
```
Tree data can also be shrunk alongside agglomeration, but this is turned of
by default.
```{r}
# with agglomeration of the tree
x2 <- agglomerateByRank(tse, rank = "Family",
agglomerateTree = TRUE)
nrow(x2) # same number of rows, but
rowTree(x1) # ... different
rowTree(x2) # ... tree
```
For `agglomerateByRank` to work, taxonomic data must be present. Even though
only one rank is available for the `enterotype` dataset, agglomeration can be
performed effectively de-duplicating entries for the genus level.
```{r}
data(enterotype, package = "mia")
taxonomyRanks(enterotype)
agglomerateByRank(enterotype)
```
To keep data tidy, the agglomerated data can be stored as an alternative
experiment in the object of origin. With this synchronized sample subsetting
becomes very easy.
```{r}
altExp(tse, "family") <- x2
```
Keep in mind, that if you set `empty.rm = TRUE`, rows with `NA` or similar value
(defined via the `empty.fields` argument) will be removed. Depending on these
settings different number of rows will be returned.
```{r}
x1 <- agglomerateByRank(tse, rank = "Species", empty.rm = TRUE)
altExp(tse,"species") <- agglomerateByRank(
tse, rank = "Species", empty.rm = FALSE)
dim(x1)
dim(altExp(tse,"species"))
```
For convenience the function `agglomerateByRanks` is available, which
agglomerates data on all `ranks` selected. By default all available ranks will
be used. The output is compatible to be stored as alternative experiments.
```{r}
tse <- agglomerateByRanks(tse)
tse
altExpNames(tse)
```
## Constructing a tree from taxonomic data
Constructing a taxonomic tree from taxonomic data stored in `rowData` is quite
straightforward and uses mostly functions implemented in
`TreeSummarizedExperiment`.
```{r}
taxa <- rowData(altExp(tse,"Species"))[,taxonomyRanks(tse)]
taxa_res <- resolveLoop(as.data.frame(taxa))
taxa_tree <- toTree(data = taxa_res)
taxa_tree$tip.label <- getTaxonomyLabels(altExp(tse,"Species"))
rowNodeLab <- getTaxonomyLabels(altExp(tse,"Species"), make.unique = FALSE)
altExp(tse,"Species") <- changeTree(
altExp(tse,"Species"),
rowTree = taxa_tree,
rowNodeLab = rowNodeLab)
```
## Transformation of assay data
Transformation of count data stored in `assays` is also a main task when work
with microbiome data. `transformAssay` can be used for this and offers a few
choices of available transformations. A modified object is returned and the
transformed counts are stored in a new `assay`.
```{r}
assayNames(enterotype)
anterotype <- transformAssay(enterotype, method = "log10", pseudocount = 1)
assayNames(enterotype)
```
For more details have a look at the man page `?transformAssay`.
Sub-sampling to equal number of counts per sample. Also known as rarefying.
```{r}
data(GlobalPatterns, package = "mia")
tse.subsampled <- rarefyAssay(
GlobalPatterns,
sample = 60000,
name = "subsampled",
replace = TRUE,
seed = 1938)
tse.subsampled
```
Alternatively, one can save both original TreeSE and subsampled TreeSE within a
MultiAssayExperiment object.
```{r}
library(MultiAssayExperiment)
mae <- MultiAssayExperiment(
c("originalTreeSE" = GlobalPatterns,
"subsampledTreeSE" = tse.subsampled))
mae
```
```{r}
# To extract specifically the subsampled TreeSE
experiments(mae)$subsampledTreeSE
```
## Community indices
In the field of microbiome ecology several indices to describe samples and
community of samples are available. In this vignette we just want to give a
very brief introduction.
Functions for calculating alpha and beta diversity indices are available.
Using `addAlpha` multiple diversity indices are calculated by default
and results are stored automatically in `colData`. Selected indices can be
calculated individually by setting `index = "shannon"` for example.
```{r}
tse <- addAlpha(tse, index = "shannon")
colnames(colData(tse))[8:ncol(colData(tse))]
```
Beta diversity indices are used to describe inter-sample connections.
Technically they are calculated as `dist` object and reduced dimensions can
be extracted using `cmdscale`. This is wrapped up in the `runMDS` function
of the `scater` package, but can be easily used to calculated beta diversity
indices using the established functions from the `vegan` package or any other
package using comparable inputs.
```{r}
library(scater)
altExp(tse,"Genus") <- runMDS(
altExp(tse,"Genus"),
FUN = getDissimilarity,
method = "bray",
name = "BrayCurtis",
ncomponents = 5,
assay.type = "counts",
keep_dist = TRUE)
```
JSD and UniFrac are implemented in `mia` as well. `getJSD` can be used
as a drop-in replacement in the example above (omit the `method` argument as
well) to calculate the JSD. For calculating the UniFrac distance via
`getUniFrac` either a `TreeSummarizedExperiment` must be used or a tree
supplied via the `tree` argument. For more details see `?getJSD`,
`?getUnifrac` or `?getDPCoA`.
`runMDS` performs the decomposition. Alternatively `addNMDS` can also be used.
## Other indices
`estimateDominance` and `estimateEvenness` implement other sample-wise indices.
The function behave equivalently to `estimateDiversity`. For more information
see the corresponding man pages.
# Utility functions
To make migration and adoption as easy as possible several utility functions
are available.
## Data loading functions
Functions to load data from `biom` files, `QIIME2` output, `DADA2` objects
[@dada2] or `phyloseq` objects are available.
```{r, message=FALSE, warning=FALSE}
library(phyloseq)
data(esophagus, package = "phyloseq")
```
```{r}
esophagus
esophagus <- convertFromPhyloseq(esophagus)
esophagus
```
For more details have a look at the man page, for examples
`?convert`.
## General wrapper functions
Row-wise or column-wise assay data subsetting.
```{r}
# Specific taxa
assay(tse['522457',], "counts") |> head()
# Specific sample
assay(tse[,'CC1'], "counts") |> head()
```
## Selecting most interesting features
`getTop` returns a vector of the most `top` abundant feature IDs.
```{r}
data(esophagus, package = "mia")
top_taxa <- getTop(
esophagus,
method = "mean",
top = 5,
assay.type = "counts")
top_taxa
```
## Generating tidy data
To generate tidy data as used and required in most of the tidyverse,
`meltAssay` can be used. A `data.frame` in the long format will be returned.
```{r}
molten_data <- meltAssay(
tse,
assay.type = "counts",
add.row = TRUE,
add.col = TRUE
)
molten_data
```
# Session info
```{r}
sessionInfo()
```
# References