vignettes/intro.Rmd

---
title: "Introduction to sleuth"
author: "Harold Pimentel, Nicolas Bray, Pall Melsted and Lior Pachter"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Introduction to sleuth}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

## Overview

__sleuth__ is a tool for the analysis and comparison of multiple related RNA-Seq experiments. Key features include:


1. The use of boostraps to ascertain and correct for technical variation in experiments.
2. Implemention of a flexible response error measurement model for inference that allows for a multitude of experimental designs.
3. Interactive plots that enable real-time exploratory data analysis.

To use __sleuth__, RNA-Seq data must first be quantified with [__kallisto__](http://pachterlab.github.io/kallisto/), which is a program for _very_ fast RNA-Seq quantification based on pseudo-alignment. An important feature of __kallisto__ is that it outputs bootstraps along with the estimates of transcript abundances. These can serve as proxies for technical replicates, allowing for an ascertainment of the variability in estimates due to the random processes underlying RNA-Seq as well as the statistical procedure of read assignment. __kallisto__ can quantify 30 million human reads in less than 3 minutes on a Mac desktop computer using only the read sequences and a transcriptome index that itself takes less than 10 minutes to build. __sleuth__ has also been designed to be lightweight and fast, and therefore RNA-Seq analysis with __kallisto__ and  __sleuth__ is tractable on a laptop computer in a matter of minutes.

The model __sleuth__ uses for performing differential analysis is a general linear model where there is error in the response. Formally, in the case of two conditions being assayed, for a transcript $t$ in a sample $i$, the (log) "true" unobserved abundance $y_i$ measured in read counts is modeled by

$$ y_{t,i} = \beta_{t,0} + \beta_{t,1} x_{t,i} + \epsilon_{t,i} $$

where $x_{t,i}$ is an indicator variable describing the condition, $\beta_{t,0}$ and $\beta_{t,1}$ are parameters of the model and $\epsilon_{t,i}$ is biological "noise". However, conditioned on $x_i$ and $y_i$ the estimated number of counts from the observations in the experiment is given by

$$ d_{t,i} = y_{t,i} + \zeta_{t,i} $$

where $\zeta_{t,i}$ represents technical "noise", i.e. uncertainty in the measurement due to effects other than biological variability. The __sleuth__ model incorporates the assumptions that the expectation $E(\zeta_{t,i}|y_{t,i}) = 0$, that $E(d_{t,i}) = \beta_{t,0} + \beta_{t,1} x_{t,i}$  and that the response error is _additive_, i.e. if the variance $V(\epsilon_{t,i}) = \sigma_t^2$ and the variance $V(\zeta_{t,i}|y_{t,i}) = \tau_t^2$ then the variance $V(d_{t,i}) = \sigma_{t}^2 + \tau_{t}^2$. __sleuth__ makes use of the boostraps from __kallisto__ to estimate the $\tau_t^2$, and after subtracting the estimated technical variance the $\sigma_t^2$ are estimated via a shrinkage procedure similar to that used in [Limma Voom](http://www.genomebiology.com/2014/15/2/R29).

More generally, __sleuth__ can be used to analyze multiple conditions organized in complex experimental designs. In the general case the unobserved model is given by

$$ Y_t = X_t \beta_t + \epsilon_t $$

where $t$ is a transcript, $Y_t$ is a vector of length $n$ where $n$ is the number of samples, $X_t$ is an $n \times p$ matrix where $p$ is the number of covariates, $\beta_t$ represents the effect sizes and is a vector of length $p$, and $\epsilon_t$ is a vector of length $n$. The observed response is described by

$$ D_t = Y_t + \zeta_t $$

where $\zeta_t$ and $D_t$ are vectors of length $n$.

__sleuth__ has been designed to facilitate the exploration of RNA-Seq data by utilizing the [Shiny](http://shiny.rstudio.com) web application framework by RStudio. The worked example below illustrates how to load data into __sleuth__ and how to open Shiny plots for exploratory data analysis. The code underlying all plots is available via the Shiny interface so that analyses can be fully "open source".

## Installation

To install __sleuth__ start [R](https://www.r-project.org) and first install `rhdf5` by typing:

```{r eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("rhdf5")
```

Then install devtools by typing

```{r eval=FALSE}
install.packages("devtools")
```

and install __sleuth__ by typing
```{r eval=FALSE}
devtools::install_github("pachterlab/sleuth")
```

Next load __sleuth__ with
```{r}
library("sleuth")
```

## Example

To explain how to use __sleuth__ we provide an example based on the data in the "Cuffdiff2 paper":

* [Differential analysis of gene regulation at transcript resolution with RNA-seq](http://www.nature.com/nbt/journal/v31/n1/full/nbt.2450.html)	by Cole Trapnell, David G Henderickson, Martin Savageau, Loyal Goff, John L Rinn and Lior Pachter, Nature Biotechnology __31__, 46--53 (2013).

The human fibroblast RNA-Seq data for the paper is available on GEO at accession [GSE37704](http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37704). The samples to be analyzed are the six samples LFB_scramble_hiseq_repA, LFB_scramble_hiseq_repB, LFB_scramble_hiseq_repC, LFB_HOXA1KD_hiseq_repA, LFB_HOXA1KD_hiseq_repA, and LFB_HOXA1KD_hiseq_repC. These are three biological replicates in each of two conditions (scramble and HoxA1 knockdown) that will be compared with __sleuth__.

To analyze the data, first download the raw reads, install __kallisto__ and then quantify the data with boostraps as described [here](http://pachterlab.github.io/kallisto/starting.html). This step can be skipped for the purposes of the vignette, by downloading the __kallisto__ processed data directly by clicking [here](http://bio.math.berkeley.edu/sleuth/cuffdiff2/cuffdiff2_data_kallisto_results.zip).

The first step in a __sleuth__ analysis is to specify where the __kallisto__ results are stored. Begin by storing the base directory of the results in a variable,
```{r}
base_dir <- "~/Downloads/cuffdiff2_data_kallisto_results"
```
Next get the list of sample IDs with
```{r}
sample_id <- dir(file.path(base_dir,"results"))
```
The result can be displayed by typing
```{r}
sample_id
```
In the box above, lines beginning with ## show the output of the command (in what follows we include the output that should appear with each command).

A list of paths to the __kallisto__ results indexed by the sample IDs is collated with
```{r}
kal_dirs <- sapply(sample_id, function(id) file.path(base_dir, "results", id, "kallisto"))
kal_dirs
```

The next step is to load an auxillary table that describes the experimental design and the relationship between the kallisto directories and the samples:
```{r}
s2c <- read.table(file.path(base_dir, "hiseq_info.txt"), header = TRUE, stringsAsFactors=FALSE)
s2c <- dplyr::select(s2c, sample = run_accession, condition)
s2c
```

Now, we must enter the directories into a column in the table describing the experiment.
This column must be labeled `path`, otherwise sleuth will throw an error.
This is to ensure that the user can check which samples correspond to which kallisto runs.

```{r}
s2c <- dplyr::mutate(s2c, path = kal_dirs)
```

The user should check whether or not the order is correct.
In this case, the kallisto output is correctly matched with the sample identifiers.

```{r}
print(s2c)
```

Now the "sleuth object" can be constructed. This requires four commands that (1) load the kallisto processed data into the object (2) estimate parameters for the __sleuth__ response error measurement (full) model (3) estimate parameters for the __sleuth__ reduced model, and (4) perform differential analysis (testing). On a laptop the four steps should take about a few minutes altogether.

First type

```{r eval=TRUE}
so <- sleuth_prep(s2c, ~ condition)
```

then fit the full model

```{r eval=TRUE}
so <- sleuth_fit(so)
```

Next, we fit the reduced model.
In this case, the reduced model is the intercept-only model:

```{r}
so <- sleuth_fit(so, ~1, 'reduced')
```

and finally perform the test:

```{r eval=TRUE}
so <- sleuth_lrt(so, 'reduced', 'full')
```

In general, we can test models that are nested using the likelihood ratio test.
Viewing models which have been fit can be done using the `models()` function.

```{r eval=TRUE}
models(so)
```

### Including gene names into transcript-level analysis

At this point the sleuth object constructed from the kallisto runs has information about the data, the experimental design, the __kallisto__ estimates, the model fit, and the testing. In other words it contains the entire analysis of the data. There is, however, one piece of information that can be useful to add in, but that is optional. In reading the kallisto output __sleuth__ has no information about _genes_, but this can be added allowing for searching and analysis by gene instead of transcript.

Since the example was constructed with the ENSEMBL human transcriptome, we will add gene names from ENSEMBL using biomaRt (there are other ways to do this as well):

First, install biomaRt with

```{r eval=FALSE}
source("http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
```

Then collect gene names with

```{r eval=TRUE}
mart <- biomaRt::useMart(biomart = "ENSEMBL_MART_ENSEMBL",
  dataset = "hsapiens_gene_ensembl",
  host = 'ensembl.org')
```

and add them into the __sleuth__ table with

```{r, eval=TRUE}
t2g <- biomaRt::getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id",
    "external_gene_name"), mart = mart)
t2g <- dplyr::rename(t2g, target_id = ensembl_transcript_id,
  ens_gene = ensembl_gene_id, ext_gene = external_gene_name)
so <- sleuth_prep(s2c, ~ condition, target_mapping = t2g)
so <- sleuth_fit(so)
so <- sleuth_fit(so, ~1, 'reduced')
so <- sleuth_lrt(so, 'reduced', 'full')
```

This addition of metadata to transcript IDs is very general, and can be used to add in other information.

The best way to view the results is to generate the Shiny webpage that allows for exploratory data analysis:

```{r, eval=FALSE}
sleuth_live(so)
```

To generate a table of results for analysis within R type

```{r}
results_table <- sleuth_results(so, 'reduced:full', test_type = 'lrt')
```

### Gene level analysis

Assuming `biomaRt` has been installed as in the previous step, sleuth can also be run in the 'gene aggregation' mode.
In addition to requiring a `target_mapping`, a string which references a column to aggregate by, (`aggregation_column`).
In our example, we could use `ens_gene` or `ext_gene`.
It is preferable to use `ens_gene` as `ext_gene` tends to be a bit redundant.

The modified sleuth prep command looks as follows:

```{r}
so <- sleuth_prep(s2c, ~condition, target_mapping = t2g,
  aggregation_column = 'ens_gene')
```

The remainder of the pipeline is unchanged.
When running `sleuth_live` or `sleuth_results`, the gene column you selected will be listed as `target_id`, rather than the transcript name.

`sleuth_prep` might take a bit longer here because aggregation has to be done amongst all of the bootstraps as well.
We are currently working to speed this up and expect to release it along with several memory improvements in the next version (0.28.2).
One way to speed this up is to use more processes (assuming you have more cores).
Internally, `sleuth` uses the function `parallel::mclapply`.
You can set the number of cores as follows:

```{r,eval=FALSE}
# set the number of available cores to 4
options(mc.cores = 4L)
```