Rmd/GenomicRanges_GRanges.Rmd

---
author: Kasper D. Hansen
title: GenomicRanges - GRanges
bibliography: genbioconductor.bib
---

```{r front, child="front.Rmd", echo=FALSE}
```

## Dependencies

This document has the following dependencies:

```{r dependencies, warning=FALSE, message=FALSE}
library(GenomicRanges)
```

Use the following commands to install these packages in R.

```{r biocLite, eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("GenomicRanges"))
```

## Corrections

Improvements and corrections to this document can be submitted on its [GitHub](https://github.com/kasperdanielhansen/genbioconductor/blob/master/Rmd/GenomicRanges_GRanges.Rmd) in its [repository](https://github.com/kasperdanielhansen/genbioconductor).

## Other Resources

- The vignettes from the [GenomicRanges webpage](http://bioconductor.org/packages/GenomicRanges).
- The package is described in a paper in PLOS Computational Biology [@GenomicRanges].

## GRanges

`GRanges` are like `IRanges` with strand and chromosome.  Strand can be `+`, `-` and `*`. The value `*` indicates 'unknown strand' or 'unstranded'.  This value usually gets treated as a third strand, which is sometimes confusing to users (examples below).

They get created with the `GRanges` constructor:

```{r GRanges}
gr <- GRanges(seqnames = "chr1", strand = c("+", "-", "+"),
              ranges = IRanges(start = c(1,3,5), width = 3))
              
``` 

Natural accessor functions: `strand()`, `seqnames()`, `ranges()`, `start()`, `end()`, `width()`.

Because the have strand, we now have operations which are relative to the direction of transcription (`upstream()`, `downstream()`):

```{r flank}
flank(gr, 2, start = FALSE)
``` 


## GRanges, seqinfo

`GRanges` operate within a universe of sequences (chromosomes/contigs) and their lengths.

This is described through `seqinfo`:

```{r seqinfo}
seqinfo(gr)
seqlengths(gr) <- c("chr1" = 10)
seqinfo(gr)
seqlevels(gr)
seqlengths(gr)
``` 

Especially the length of the chromosomes are used by some functions.  For example `gaps()` return the stretches of the genome not covered by the `GRanges`.

```{r gaps}
gaps(gr)
```

In this example, we know that the last gap stops at 10, because that is the length of the chromosome.  Note how a range on the `*` strand appears in the result.

Let us expand the `GRanges` with another chromosome

```{r gr2}
seqlevels(gr) <- c("chr1", "chr2")
seqnames(gr) <- c("chr1", "chr2", "chr1")
```

When you `sort()` a `GRanges`, the sorting order of the chromosomes is determined by their order in `seqlevel`.  This is nice if you want the sorting "chr1", "chr2", ..., "chr10", ...

```{r sort}
sort(gr)
seqlevels(gr) <- c("chr2", "chr1")
sort(gr)
```
You can associate a genome with a `GRanges`. 
```{r genome}
genome(gr) <- "hg19"
gr
```

This becomes valuable when you deal with data from different genome versions (as we all do), because it allows R to throw an error when you compare two `GRanges` from different genomes, like

```{r gr-error, error=TRUE}
gr2 <- gr
genome(gr2) <- "hg18"
findOverlaps(gr, gr2)
```

The fact that each sequence may have its own genome is more esoteric. One usecase is for experiments where the experimenter have spiked in sequences exogenous to the original organism.

## SessionInfo

\scriptsize

```{r sessionInfo, echo=FALSE}
sessionInfo()
```

## References