forked from kasperdanielhansen/genbioconductor
-
Notifications
You must be signed in to change notification settings - Fork 0
/
GenomicRanges_GRanges.Rmd
116 lines (78 loc) · 3.26 KB
/
GenomicRanges_GRanges.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
---
author: Kasper D. Hansen
title: GenomicRanges - GRanges
bibliography: genbioconductor.bib
---
```{r front, child="front.Rmd", echo=FALSE}
```
## Dependencies
This document has the following dependencies:
```{r dependencies, warning=FALSE, message=FALSE}
library(GenomicRanges)
```
Use the following commands to install these packages in R.
```{r biocLite, eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicRanges"))
```
## Corrections
Improvements and corrections to this document can be submitted on its [GitHub](https://github.com/kasperdanielhansen/genbioconductor/blob/master/Rmd/GenomicRanges_GRanges.Rmd) in its [repository](https://github.com/kasperdanielhansen/genbioconductor).
## Other Resources
- The vignettes from the [GenomicRanges webpage](http://bioconductor.org/packages/GenomicRanges).
- The package is described in a paper in PLOS Computational Biology [@GenomicRanges].
## GRanges
`GRanges` are like `IRanges` with strand and chromosome. Strand can be `+`, `-` and `*`. The value `*` indicates 'unknown strand' or 'unstranded'. This value usually gets treated as a third strand, which is sometimes confusing to users (examples below).
They get created with the `GRanges` constructor:
```{r GRanges}
gr <- GRanges(seqnames = "chr1", strand = c("+", "-", "+"),
ranges = IRanges(start = c(1,3,5), width = 3))
```
Natural accessor functions: `strand()`, `seqnames()`, `ranges()`, `start()`, `end()`, `width()`.
Because the have strand, we now have operations which are relative to the direction of transcription (`upstream()`, `downstream()`):
```{r flank}
flank(gr, 2, start = FALSE)
```
## GRanges, seqinfo
`GRanges` operate within a universe of sequences (chromosomes/contigs) and their lengths.
This is described through `seqinfo`:
```{r seqinfo}
seqinfo(gr)
seqlengths(gr) <- c("chr1" = 10)
seqinfo(gr)
seqlevels(gr)
seqlengths(gr)
```
Especially the length of the chromosomes are used by some functions. For example `gaps()` return the stretches of the genome not covered by the `GRanges`.
```{r gaps}
gaps(gr)
```
In this example, we know that the last gap stops at 10, because that is the length of the chromosome. Note how a range on the `*` strand appears in the result.
Let us expand the `GRanges` with another chromosome
```{r gr2}
seqlevels(gr) <- c("chr1", "chr2")
seqnames(gr) <- c("chr1", "chr2", "chr1")
```
When you `sort()` a `GRanges`, the sorting order of the chromosomes is determined by their order in `seqlevel`. This is nice if you want the sorting "chr1", "chr2", ..., "chr10", ...
```{r sort}
sort(gr)
seqlevels(gr) <- c("chr2", "chr1")
sort(gr)
```
You can associate a genome with a `GRanges`.
```{r genome}
genome(gr) <- "hg19"
gr
```
This becomes valuable when you deal with data from different genome versions (as we all do), because it allows R to throw an error when you compare two `GRanges` from different genomes, like
```{r gr-error, error=TRUE}
gr2 <- gr
genome(gr2) <- "hg18"
findOverlaps(gr, gr2)
```
The fact that each sequence may have its own genome is more esoteric. One usecase is for experiments where the experimenter have spiked in sequences exogenous to the original organism.
## SessionInfo
\scriptsize
```{r sessionInfo, echo=FALSE}
sessionInfo()
```
## References