forked from kasperdanielhansen/genbioconductor
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Rsamtools.Rmd
186 lines (130 loc) · 7.39 KB
/
Rsamtools.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
---
title: "Rsamtools"
author: "Kasper D. Hansen"
---
```{r front, child="front.Rmd", echo=FALSE}
```
## Dependencies
This document has the following dependencies:
```{r dependencies, warning=FALSE, message=FALSE}
library(Rsamtools)
```
Use the following commands to install these packages in R.
```{r biocLite, eval=FALSE}
source("http://www.bioconductor.org/biocLite.R")
biocLite(c("Rsamtools")
```
## Overview
The `r Biocpkg("Rsamtools")` packages contains functionality for reading and examining aligned reads in the BAM format.
## Other Resources
- The vignettes from the [Rsamtools webpage](http://bioconductor.org/packages/Rsamtools).
- For representing more complicated alignments (specifically spliced alignments from RNA-seq), see the `r Biocpkg("GenomicAlignments")` package.
## Rsamtools
The Rsamtools package is an interface to the widely used `samtools`/`htslib` library. The main functionality of the package is support for reading BAM files.
### The BAM / SAM file format
The SAM format is a text based representation of alignments. The BAM format is a binary version of SAM which is smaller and much faster. In general, always work with BAM.
The format is quite complicated, which means the R representation is also a bit complicated. This complication happens because of the following features of the file format
- It may contain unaligned sequences.
- Each sequence may be aligned to multiple locations.
- It supports spliced (split) alignments.
- It may contain reads from multiple samples.
We will not attempt to fully understanding all the intricacies of this format.
A BAM file can be sorted in multiple ways. If it is sorted according to genomic location and if it is also "indexed" it is possible to retrieve all reads mapping to a genomic location, something which can be very useful. In `r Biocpkg("Rsamtools")` this is done by the functions `sortBam()` and `indexBam()`.
## scanBam
How to read a BAM file goes conceptually like this
1. A pointer to the file is created by the `BamFile()` constructor.
2. (Optional) Parameters for which reads to report is constructed by `ScanBamParams()`.
3. The file is being read according to these parameters by `scanBam()`.
First we setup a `BamFile` object:
```{r bamPath}
bamPath <- system.file("extdata", "ex1.bam", package="Rsamtools")
bamFile <- BamFile(bamPath)
bamFile
```
Some high-level information can be accessed here, like
```{r bamFileInfo}
seqinfo(bamFile)
```
(obviously, `seqinfo()` and `seqlevels()` etc. are supported as well).
Now we read all the reads in the file using `scanBam()`, ignoring the possibility of selecting reads using `ScanBamParams()` (we will return to this below).
```{r scanBam}
aln <- scanBam(bamFile)
length(aln)
class(aln)
```
We get back a list of length 1; this is because `scanBam()` can return output from multiple genomic regions, and here we have only one (everything). We therefore subset the output; this again gives us a list and we show the information from the first alignment
```{r lookAtBam}
aln <- aln[[1]]
names(aln)
lapply(aln, function(xx) xx[1])
```
Notice how the `scanBam()` function returns a basic R object, instead of an S4 class. Representing the alignments as S4 object is done by the `r Biocpkg("GenomicAlignments")` package; this is especially useful for access to spliced alignments from RNA sequencing data.
The names of the `aln` list are basically the names used in the BAM specification. Here is a quick list of some important ones
- `qname`: The name of the read.
- `rname`: The name of the chromosome / sequence / contig it was aligned to.
- `strand`: The strand of the alignment.
- `pos`: The coordinate of the left-most part of the alignment.
- `qwidth`: The length of the read.
- `mapq`: The mapping quality of the alignment.
- `seq`: The actual sequence of the alignment.
- `qual`: The quality string of the alignment.
- `cigar`: The CIGAR string (below).
- `flag`: The flag (below).
### Reading in parts of the file
BAM files can be extremely big and it is there often necessary to read in parts of the file. You can do this in different ways
1. Read a set number of records (alignments).
2. Only read alignments satisfying certain criteria.
3. Only read alignments in certain genomic regions.
Let us start with the first of this. By specifying `yieldSize` when you use `BamFile()`, every invocation of `scanBam()` will only read `yieldSize` number of alignments. You can then invoke `scanBam()` again to get the next set of alignments; this requires you to `open()` the file first (otherwise you will keep read the same alignments).
```{r yieldSize}
yieldSize(bamFile) <- 1
open(bamFile)
scanBam(bamFile)[[1]]$seq
scanBam(bamFile)[[1]]$seq
## Cleanup
close(bamFile)
yieldSize(bamFile) <- NA
```
The other two ways of reading in parts of a BAM file is to use `ScanBamParams()`, specifically the `what` and `which` arguments.
```{r ScanBamParams}
gr <- GRanges(seqnames = "seq2",
ranges = IRanges(start = c(100, 1000), end = c(1500,2000)))
params <- ScanBamParam(which = gr, what = scanBamWhat())
aln <- scanBam(bamFile, param = params)
names(aln)
head(aln[[1]]$pos)
```
Notice how the `pos` is less than what is specified in the `which` argument; this is because the alignments overlap the `which` argument. The `what=scanBamWhat()` tells the function to read everything. Often, you may not be interested in the actual sequence of the read or its quality scores. These takes up a lot of space so you may consider disabling reading this information.
#### The CIGAR string
The "CIGAR" is how the BAM format represents spliced alignments. For example, the format stored the left most coordinate of the alignment. To get to the right coordinate, you have to parse the CIGAR string. In this example "36M" means that it has been aligned with no insertions or deletions. If you need to work with spliced alignments or alignments containing insertions or deletions, you should use the `r Biocpkg("GenomicAlignments")` package.
#### Flag
An alignment may have a number of "flags" set or unset. These flags provide information about the alignment. The flag integer is a representation of multiple flags simultanously. An example of a flag is indicating (for a paired end alignment) whether both pairs have been properly aligned. For more information, see the BAM specification.
In `r Biocpkg("Rsamtools")` there is a number of helper functions dealing with only reading certain flags; use these.
### BAM summary
Sometimes you want a quick summary of the alignments in a BAM file:
```{r summary}
quickBamFlagSummary(bamFile)
```
## Other functionality from Rsamtools
### BamViews
Instead of reading a single file, it is possible to construct something called a `BamViews`, a link to multiple files. In many ways, it has the same `Views` functionality as other views. A quick example should suffice, first we read everything;
```{r BamViews}
bamView <- BamViews(bamPath)
aln <- scanBam(bamView)
names(aln)
```
This gives us an extra list level on the return object; first level is files, second level is ranges.
We can also set `bamRanges()` on the `BamViews` to specify that only certain ranges are read; this is similar to setting a `which` argument to `ScanBamParams()`.
```{r BamViews2}
bamRanges(bamView) <- gr
aln <- scanBam(bamView)
names(aln)
names(aln[[1]])
```
### countBam
Sometimes, all you want to do is count... use `countBam()` instead of `scanBam()`.
## SessionInfo
\scriptsize
```{r sessionInfo, echo=FALSE}
sessionInfo()
```