forked from kasperdanielhansen/genbioconductor
-
Notifications
You must be signed in to change notification settings - Fork 0
/
GenomicRanges_Rle.Rmd
165 lines (110 loc) · 5.06 KB
/
GenomicRanges_Rle.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
---
author: Kasper D. Hansen
title: GenomicRanges - Rle
---
```{r front, child="front.Rmd", echo=FALSE}
```
## Dependencies
This document has the following dependencies:
```{r dependencies, warning=FALSE, message=FALSE}
library(GenomicRanges)
```
Use the following commands to install these packages in R.
```{r biocLite, eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicRanges"))
```
## Corrections
Improvements and corrections to this document can be submitted on its [GitHub](https://github.com/kasperdanielhansen/genbioconductor/blob/master/Rmd/GenomicRanges_Rle.Rmd) in its [repository](https://github.com/kasperdanielhansen/genbioconductor).
### Overview
In this session we will discuss a data representation class called `Rle` (run length encoding). This class is great for representation genome-wide sequence coverage.
## Other Resources
### Coverage
In high-throughput sequencing, coverage is the number of reads overlapping each base. In other words, it associates a number (the number of reads) to every base in the genome.
This is a fundamental quantity for many high-throughout sequencing analyses. For variant calling (DNA sequencing) it tells you how much power (information) you have to call a variant at a given location. For ChIP sequencing it is the primary signal; areas with high coverage are thought to be enriched for a given protein.
A file format which is often used to represent coverage data is `Wig` or the modern version `BigWig`.
### Rle
An `Rle` (run-length-encoded) vector is a specific representation of a vector. The `r Biocpkg("IRanges")` package implements support for this class. Watch out: there is also a base R class called `rle` which has much less functionality.
The run-length-encoded representation of a vector, represents the vector as a set of distinct runs with their own value. Let us take an example
```{r RleEx1}
rl <- Rle(c(1,1,1,1,2,2,3,3,2,2))
rl
runLength(rl)
runValue(rl)
as.numeric(rl)
```
Note the accessor functions `runLength()` and `runValue()`.
This is a very efficient representation if
- the vector is very long
- there are a lot of consecutive elements with the same value
This is especially useful for genomic data which is either piecewise constant, or where most of the genome is not covered (eg. RNA sequencing in mammals).
In many ways `Rle`s function as normal vectors, you can do arithmetic with them, transform them etc. using standard R functions like `+` and `log2`.
There are also `RleList` which is a list of `Rle`s. This class is used to represent a genome wide coverage track where each element of the list is a different chromosome.
### Useful functions for Rle
A standard usecase is that you have a number of regions (say `IRanges`) and you want to do something to your `Rle` over each of these regions. Enter `aggregate()`.
```{r aggregate}
ir <- IRanges(start = c(2,6), width = 2)
aggregate(rl, ir, FUN = mean)
```
It is also possible to covert an `IRanges` to a `Rle` by the `coverage()` function. This counts, for each integer, how many ranges overlap the integer.
```{r coverage}
ir <- IRanges(start = 1:10, width = 3)
rl <- coverage(ir)
rl
```
You can select high coverage regions by the `slice()` function:
```{r slice}
slice(rl, 2)
```
This outputs a `Views` object, see next section.
### Views and Rles
In the sessions on the `r Biocpkg("Biostrings")` package we learned about `Views` on genomes. `Views` can also be instantiated on `Rle`s.
```{r views}
vi <- Views(rl, start = c(3,7), width = 3)
vi
```
with `Views` you can now (again) apply functions:
```{r viewsMean}
mean(vi)
```
This is very similar to using `aggregate()` described above.
### RleList
An `RleList` is simply a list of `Rle`. It is similar to a `GRangesList` in concept.
### Rles and GRanges
`Rle`'s can also be constructed from `GRanges`.
This often involves `RleList` where each element of the list is a chromosome. Surprisingly, we do not yet have an `RleList` type structure which also contains information about say the length of the different chromosomes.
Let us see some examples
```{r GRanges}
gr <- GRanges(seqnames = "chr1", ranges = IRanges(start = 1:10, width = 3))
rl <- coverage(gr)
rl
```
Using `Views` on such an object exposes some missing functionality
```{r GRangesViews, error=TRUE}
grView <- GRanges("chr1", ranges = IRanges(start = 2, end = 7))
vi <- Views(rl, grView)
```
We get an error, mentioning some object called a `RangesList`. This type of object is similar to a `GRanges` and could be considered succeeded by the later class. We sometimes see instances of this popping around.
```{r GRangesViews2}
vi <- Views(rl, as(grView, "RangesList"))
vi
vi[[1]]
```
## Biology Usecase
Suppose we want to compute the average coverage of bases belonging to (known) exons.
Input objects are
**reads**: a `GRanges`.
**exons**: a `GRanges`.
pseudocode:
```{r usecase, eval=FALSE}
bases <- reduce(exons)
nBases <- sum(width(bases))
nCoverage <- sum(Views(coverage(reads), bases))
nCoverage / nBases
```
(watch out for strand)
## SessionInfo
\scriptsize
```{r sessionInfo, echo=FALSE}
sessionInfo()
```