vignettes/mia.Rmd

---
title: "mia: Microbiome analysis tools"
date: "`r Sys.Date()`"
package: mia
output:
    BiocStyle::html_document:
        fig_height: 7
        fig_width: 10
        toc: true
        toc_float: true
        toc_depth: 2
        number_sections: true
vignette: >
    %\VignetteIndexEntry{mia}
    %\VignetteEngine{knitr::rmarkdown}
    %\VignetteEncoding{UTF-8}
bibliography: references.bib
---

```{r, echo=FALSE}
knitr::opts_chunk$set(
    cache = FALSE,
    fig.width = 9,
    message = FALSE,
    warning = FALSE)
```

`mia` implements tools for microbiome analysis based on the
`SummarizedExperiment` [@SE], `SingleCellExperiment` [@SCE] and
`TreeSummarizedExperiment` [@TSE] infrastructure. Data wrangling and analysis
are the main scope of this package.

# Installation

To install `mia`, install `BiocManager` first, if it is not installed.
Afterwards use the `install` function from `BiocManager`.

```{r, eval=FALSE}
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("mia")
```

# Load *mia*

```{r load-packages, message=FALSE, warning=FALSE}
library("mia")
```

# Loading a `TreeSummarizedExperiment` object

A few example datasets are available via `mia`. For this vignette the 
`GlobalPatterns` dataset is loaded first.

```{r}
data(GlobalPatterns, package = "mia")
tse <- GlobalPatterns
tse
```

# Functions for working with microbiome data

One of the main topics for analysing microbiome data is the application of
taxonomic data to describe features measured. The interest lies in the 
connection between individual bacterial species and their relation to each 
other. 

`mia` does not rely on a specific object type to hold taxonomic data, but
uses specific columns in the `rowData` of a `TreeSummarizedExperiment` object.
`taxonomyRanks` can be used to construct a `character` vector of available 
taxonomic levels. This can be used, for example, for subsetting.

```{r}
# print the available taxonomic ranks
colnames(rowData(tse))
taxonomyRanks(tse)
# subset to taxonomic data only
rowData(tse)[,taxonomyRanks(tse)]
```

The columns are recognized case insensitive. Additional functions are 
available to check for validity of taxonomic information or generate labels
based on the taxonomic information.

```{r}
table(taxonomyRankEmpty(tse, "Species"))
head(getTaxonomyLabels(tse))
```

For more details see the man page `?taxonomyRanks`.

## Merging and agglomeration based on taxonomic information.

Agglomeration of data based on these taxonomic descriptors can be performed
using functions implemented in `mia`. In addition to the `aggValue` functions
provide by `TreeSummarizedExperiment` `agglomerateByRank` is available.
`agglomerateByRank` does not require tree data to be present.

`agglomerateByRank` constructs a `factor` to guide merging from the available
taxonomic information. For more information on merging have a look at the man
page via `?mergeFeatures`.

```{r}
# agglomerate at the Family taxonomic rank
x1 <- agglomerateByRank(tse, rank = "Family")
## How many taxa before/after agglomeration?
nrow(tse)
nrow(x1)
```

Tree data can also be shrunk alongside agglomeration, but this is turned of 
by default.

```{r}
# with agglomeration of the tree
x2 <- agglomerateByRank(tse, rank = "Family",
                        agglomerateTree = TRUE)
nrow(x2) # same number of rows, but
rowTree(x1) # ... different
rowTree(x2) # ... tree
```

For `agglomerateByRank` to work, taxonomic data must be present. Even though
only one rank is available for the `enterotype` dataset, agglomeration can be
performed effectively de-duplicating entries for the genus level.

```{r}
data(enterotype, package = "mia")
taxonomyRanks(enterotype)
agglomerateByRank(enterotype)
```

To keep data tidy, the agglomerated data can be stored as an alternative 
experiment in the object of origin. With this synchronized sample subsetting
becomes very easy.

```{r}
altExp(tse, "family") <- x2
```

Keep in mind, that if you set `empty.rm = TRUE`, rows with `NA` or similar value
(defined via the `empty.fields` argument) will be removed. Depending on these
settings different number of rows will be returned.

```{r}
x1 <- agglomerateByRank(tse, rank = "Species", empty.rm = TRUE)
altExp(tse,"species") <- agglomerateByRank(
    tse, rank = "Species", empty.rm = FALSE)
dim(x1)
dim(altExp(tse,"species"))
```

For convenience the function `agglomerateByRanks` is available, which
agglomerates data on all `ranks` selected. By default all available ranks will
be used. The output is compatible to be stored as alternative experiments.

```{r}
tse <- agglomerateByRanks(tse)
tse
altExpNames(tse)
```

## Constructing a tree from taxonomic data

Constructing a taxonomic tree from taxonomic data stored in `rowData` is quite
straightforward and uses mostly functions implemented in 
`TreeSummarizedExperiment`.

```{r}
taxa <- rowData(altExp(tse,"Species"))[,taxonomyRanks(tse)]
taxa_res <- resolveLoop(as.data.frame(taxa))
taxa_tree <- toTree(data = taxa_res)
taxa_tree$tip.label <- getTaxonomyLabels(altExp(tse,"Species"))
rowNodeLab <- getTaxonomyLabels(altExp(tse,"Species"), make.unique = FALSE)
altExp(tse,"Species") <- changeTree(
    altExp(tse,"Species"),
    rowTree = taxa_tree,
    rowNodeLab = rowNodeLab)
```

## Transformation of assay data

Transformation of count data stored in `assays` is also a main task when work
with microbiome data. `transformAssay` can be used for this and offers a few
choices of available transformations. A modified object is returned and the
transformed counts are stored in a new `assay`.

```{r}
assayNames(enterotype)
anterotype <- transformAssay(enterotype, method = "log10", pseudocount = 1)
assayNames(enterotype)
```

For more details have a look at the man page `?transformAssay`.  

Sub-sampling to equal number of counts per sample. Also known as rarefying.  
```{r}
data(GlobalPatterns, package = "mia")

tse.subsampled <- rarefyAssay(
    GlobalPatterns, 
    sample = 60000, 
    name = "subsampled",
    replace = TRUE,
    seed = 1938)
tse.subsampled

```

Alternatively, one can save both original TreeSE and subsampled TreeSE within a
MultiAssayExperiment object.  

```{r}
library(MultiAssayExperiment)
mae <- MultiAssayExperiment(
    c("originalTreeSE" = GlobalPatterns,
    "subsampledTreeSE" = tse.subsampled))
mae
```

```{r}
# To extract specifically the subsampled TreeSE
experiments(mae)$subsampledTreeSE
```

## Community indices

In the field of microbiome ecology several indices to describe samples and
community of samples are available. In this vignette we just want to give a
very brief introduction.

Functions for calculating alpha and beta diversity indices are available.
Using `addAlpha` multiple diversity indices are calculated by default
and results are stored automatically in `colData`. Selected indices can be 
calculated individually by setting `index = "shannon"` for example. 

```{r}
tse <- addAlpha(tse, index = "shannon")
colnames(colData(tse))[8:ncol(colData(tse))]
```

Beta diversity indices are used to describe inter-sample connections. 
Technically they are calculated as `dist` object and reduced dimensions can
be extracted using `cmdscale`. This is wrapped up in the `runMDS` function
of the `scater` package, but can be easily used to calculated beta diversity
indices using the established functions from the `vegan` package or any other
package using comparable inputs.

```{r}
library(scater)
altExp(tse,"Genus") <- runMDS(
    altExp(tse,"Genus"), 
    FUN = getDissimilarity,
    method = "bray",
    name = "BrayCurtis", 
    ncomponents = 5, 
    assay.type = "counts", 
    keep_dist = TRUE)
```

JSD and UniFrac are implemented in `mia` as well. `getJSD` can be used
as a drop-in replacement in the example above (omit the `method` argument as 
well) to calculate the JSD. For calculating the UniFrac distance via
`getUniFrac` either a `TreeSummarizedExperiment` must be used or a tree
supplied via the `tree` argument. For more details see `?getJSD`,
`?getUnifrac` or `?getDPCoA`.

`runMDS` performs the decomposition. Alternatively `addNMDS` can also be used.

## Other indices

`estimateDominance` and `estimateEvenness` implement other sample-wise indices.
The function behave equivalently to `estimateDiversity`. For more information
see the corresponding man pages.

# Utility functions

To make migration and adoption as easy as possible several utility functions
are available.

## Data loading functions

Functions to load data from `biom` files, `QIIME2` output, `DADA2` objects 
[@dada2] or `phyloseq` objects are available. 

```{r, message=FALSE, warning=FALSE}
library(phyloseq)
data(esophagus, package = "phyloseq")
```

```{r}
esophagus
esophagus <- convertFromPhyloseq(esophagus)
esophagus
```

For more details have a look at the man page, for examples 
`?convert`.

## General wrapper functions 

Row-wise or column-wise assay data subsetting.

```{r}
# Specific taxa
assay(tse['522457',], "counts") |> head()
# Specific sample
assay(tse[,'CC1'], "counts") |> head()

```

## Selecting most interesting features 

`getTop` returns a vector of the most `top` abundant feature IDs.

```{r}
data(esophagus, package = "mia")
top_taxa <- getTop(
    esophagus,
    method = "mean",
    top = 5,
    assay.type = "counts")
top_taxa
```

## Generating tidy data

To generate tidy data as used and required in most of the tidyverse, 
`meltAssay` can be used. A `data.frame` in the long format will be returned.

```{r}
molten_data <- meltAssay(
    tse,
    assay.type = "counts",
    add.row = TRUE,
    add.col = TRUE
)
molten_data
```

# Session info

```{r}
sessionInfo()
```

# References