-
Notifications
You must be signed in to change notification settings - Fork 8
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
66 changed files
with
700 additions
and
6,867 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -13,3 +13,6 @@ tmp/ | |
dist | ||
dist/ | ||
sourceme.sh | ||
*.swp | ||
docs/book | ||
docs/book/* |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
:root { | ||
--content-max-width: 960px; | ||
} | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
# Summary | ||
|
||
- [Basic Usage](./quick_start.md) | ||
- [Constructing bedMethyl tables](./intro_bedmethyl.md) | ||
- [Updating and Adjusting MM tags](./intro_adjust.md) | ||
- [Summarizing a modBAM](./intro_summary.md) | ||
- [Making a motif BED file](./intro_motif_bed.md) | ||
- [Extended subcommand help](./advanced_usage.md) | ||
- [Troubleshooting](./troubleshooting.md) | ||
- [Current limitations](./limitations.md) | ||
- [Performance considerations](./perf_considerations.md) | ||
- [Algorithm details](./algo_details.md) | ||
- [Filtering base modification calls](./filtering.md) | ||
- [Ignoring and combining calls](./collapse.md) |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Algorithm details | ||
|
||
- [Filtering low confidence base modification calls](./filtering.md) | ||
- [Removing and combining base modification calls](./filtering.md) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,15 +1,15 @@ | ||
# Filtering base modification calls | ||
# Filtering base modification calls. | ||
|
||
Modified base calls are qualitied with a probability that is contained in the ML tag (see the | ||
[specification](https://samtools.github.io/hts-specs/SAMtags.pdf)). We calculate the confidence that the model | ||
has in the base modification prediction as $`\mathcal{q} = argmax(\textbf{P})`$ where $`\textbf{P}`$ is the | ||
has in the base modification prediction as \\(\mathcal{q} = argmax(\textbf{P})\\) where \\(\textbf{P}\\) is the | ||
vector of probabilities for each modification. For example, given a model that can classify canonical | ||
cytosine, 5mC, and 5hmC, $`\textbf{P}`$ is $`[P_{C}, P_m, P_h]`$, and $`\mathcal{q}`$ will be $`\mathcal{q} = | ||
argmax(P_{C}, P_m, P_h)`$, the maximum of the three probabilities. Filtering in `modkit` is performed by | ||
first determining the value of $`\mathcal{q}`$ for the lowest n-th percentile of calls (10th percentile by | ||
cytosine, 5mC, and 5hmC, \\(\textbf{P}\\) is \\([P_{C}, P_m, P_h]\\), and \\(\mathcal{q}\\) will be \\(\mathcal{q} = | ||
argmax(P_{C}, P_m, P_h)\\), the maximum of the three probabilities. Filtering in `modkit` is performed by | ||
first determining the value of \\(\mathcal{q}\\) for the lowest n-th percentile of calls (10th percentile by | ||
default). The threshold value is typically an estimate because the base modification probabilities are | ||
sampled from a subset of the reads. All calls can be used to calculate the exact value, but in practice the | ||
approximation gives the same value. Base modification calls with a confidence value less than this number will | ||
not be counted. Determination of the threshold value can be performed on the fly (by sampling) or the | ||
threshold value can be specified on the command line with the `--filter_threshold` flag. The `sample-probs` | ||
command can be used to quickly estimate the value of $`\mathcal{q}`$ at various percentiles. | ||
command can be used to quickly estimate the value of \\(\mathcal{q}\\) at various percentiles. |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,66 @@ | ||
# Updating and Adjusting MM tags. | ||
|
||
The `adjust-mods` subcommand can be used to manipulate MM (and corresponding ML) tags in a | ||
modBam. In general, these simple commands are run prior to `pileup`, visualization, or | ||
other analysis. | ||
|
||
|
||
## Ignoring a modification class. | ||
|
||
To remove a base modification class from a modBAM and produce a new modBAM, use the | ||
`--ignore` option for `adjust-mods`. | ||
|
||
``` | ||
modkit adjust-mods input.bam output.adjust.bam --ignore <mod_code_to_ignore> | ||
``` | ||
For example the command below will remove 5hmC calls, leaving just 5mC calls. | ||
|
||
``` | ||
modkit adjust-mods input.bam output.adjust.bam --ignore h | ||
``` | ||
For technical details on the transformation see [Removing modification calls from | ||
BAMs](./collapse.md#removing-dna-base-modification-probabilities). | ||
|
||
## Combining base modification probabilities. | ||
|
||
Combining base modification probabilities may be desirable for downstream analysis or | ||
visualization. Unlike `--ignore` which removes the probability of a class, `--convert` | ||
will sum the probability of one class with another if the second class already exists. For | ||
example, the command below will convert probabilities associated with `h` probability into | ||
`m` probability. If `m` already exists, the probabilities will be summed. As described in | ||
[changing the modification code](./intro_adjust.md#changing-the-base-modification-code), | ||
if the second base modification code doesn't exist, the probabilities are left unchanged. | ||
|
||
``` | ||
modkit adjust-mods input.bam output.convert.bam --convert h m | ||
``` | ||
|
||
This would be equivalent to setting `--combine-mods` flag, for example. | ||
``` | ||
modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods | ||
``` | ||
|
||
## Updating the flag (`?` and `.`). | ||
The [specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (Section 1.7) allows | ||
for omission of the MM flag, however this may not be the intent of missing base | ||
modification probabilities for some models. The command below will add or change the `?` flag to a modBAM. | ||
|
||
``` | ||
modkit adjust-mods input.bam output.bam --mode ambiguous | ||
``` | ||
|
||
Another option is to set the flag to `.`, the "implicitly canonical" mode: | ||
|
||
``` | ||
modkit adjust-mods input.bam output.bam --mode implicit | ||
``` | ||
|
||
## Changing the base modification code. | ||
Some functions in `modkit` or other tools may require the mod-codes in the MM tag be in | ||
the [specification](https://samtools.github.io/hts-specs/SAMtags.pdf). | ||
|
||
For example, the following command will change `C+Z,` tags to `C+m,` tags. | ||
|
||
``` | ||
modkit adjust-mods input.bam output.bam --convert Z m | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,100 @@ | ||
# Constructing bedMethyl tables. | ||
|
||
A primary use of `modkit` is to create summary counts of modified and unmodified bases in | ||
an extended [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) format. | ||
bedMethyl files tabulate the counts of base modifications from every sequencing read over | ||
each aligned reference genomic position. | ||
|
||
In its simplest form `modkit` creates a bedMethyl file using the following: | ||
|
||
```bash | ||
modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log | ||
``` | ||
|
||
No reference sequence is required. A single file (described | ||
[below](#description-of-bedmethyl-output)) with base count summaries will be created. The | ||
final argument here specifies an optional log file output. | ||
|
||
The program performs best-practices filtering and manipulation of the raw data stored in | ||
the input file. For further details see [filtering modified-base calls](./filtering.md). | ||
|
||
For user convenience, the counting process can be modulated using several additional | ||
transforms and filters. The most basic of these is to report only counts from reference | ||
CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs | ||
in the reference: | ||
|
||
```bash | ||
modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta | ||
``` | ||
|
||
The program also contains preset which combine several options for ease of use. The | ||
`traditional` preset, | ||
|
||
```bash | ||
modkit pileup path/to/reads.bam output/path/pileup.bed \ | ||
--ref path/to/reference.fasta \ | ||
--preset traditional | ||
``` | ||
|
||
performs three transforms: | ||
* restricts output to locations where there is a CG dinucleotide in the reference, | ||
* reports only a C and 5mC counts, using procedures to take into account counts of other | ||
forms of cytosine modification (notably 5hmC), and | ||
* aggregates data across strands. The strand field od the output will be marked as '.' | ||
indicating that the strand information has been lost. | ||
|
||
Using this option is equivalent to running with the options: | ||
|
||
```bash | ||
modkit pileup --cpg --ref <reference.fasta> --collapse h --combine-strands | ||
``` | ||
|
||
For more information on the individual options see the [Advanced Usage](./advanced_usage.md) help document. | ||
|
||
## Description of bedMethyl output. | ||
|
||
Below is a description of the bedMethyl columns generated by `modkit pileup`. A brief description of the | ||
bedMethyl specification can be found on [Encode](https://www.encodeproject.org/data-standards/wgbs/). | ||
|
||
### Definitions: | ||
|
||
* N<sub>mod</sub> - Number of calls passing filters that were classified as a residue with a specified base modification. | ||
* N<sub>canonical</sub> - Number of calls passing filters were classified as the canonical base rather than modified. The | ||
exact base must be inferred by the modification code. For example, if the modification code is `m` (5mC) then | ||
the canonical base is cytosine. If the modification code is `a`, the canonical base is adenosine. | ||
* N<sub>other mod</sub> - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with | ||
`h` calls, 1 with a canonical call, and 2 with `m` calls. In the bedMethyl row for `h` N<sub>other_mod</sub> would be 2. In the | ||
`m` row N<sub>other_mod</sub> would be 3. | ||
* N<sub>valid_cov</sub> - the valid coverage. N<sub>valid_cov</sub> = N<sub>mod</sub> + N<sub>other_mod</sub> + N<sub>canonical</sub>, also used as the `score` in the bedMethyl | ||
* N<sub>diff</sub> - Number of reads with a base other than the canonical base for this modification. For example, in a row | ||
for `h` the canonical base is cytosine, if there are 2 reads with C->A substitutions, N<sub>diff</sub> will be 2. | ||
* N<sub>delete</sub> - Number of reads with a deletion at this reference position | ||
* N<sub>filtered</sub> - Number of calls where the probability of the call was below the threshold. The threshold can be | ||
set on the command line or computed from the data (usually filtering out the lowest 10th percentile of calls). | ||
* N<sub>nocall</sub> - Number of reads aligned to this reference position, with the correct canonical base, but without a base | ||
modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a | ||
CG->CH substitution such that no modification call was produced by the basecaller. | ||
|
||
### bedMethyl column descriptions. | ||
|
||
| column | name | description | type | | ||
|--------|-----------------------|--------------------------------------------------------------------------------|-------| | ||
| 1 | chrom | name of reference sequence from BAM header | str | | ||
| 2 | start position | 0-based start position | int | | ||
| 3 | end position | 0-based exclusive end position | int | | ||
| 4 | modified base code | single letter code for modified base | str | | ||
| 5 | score | Equal to N<sub>valid_cov</sub>. | int | | ||
| 6 | strand | '+' for positive strand '-' for negative strand, '.' when strands are combined | str | | ||
| 7 | start position | included for compatibility | int | | ||
| 8 | end position | included for compatibility | int | | ||
| 9 | color | included for compatibility, always 255,0,0 | str | | ||
| 10 | N<sub>valid_cov</sub> | See definitions above. | int | | ||
| 11 | fraction modified | N<sub>mod</sub> / N<sub>valid_cov</sub> | float | | ||
| 12 | N<sub>mod</sub> | See definitions above. | int | | ||
| 13 | N<sub>canonical</sub> | See definitions above. | int | | ||
| 14 | N<sub>other_mod</sub> | See definitions above. | int | | ||
| 15 | N<sub>delete</sub> | See definitions above. | int | | ||
| 16 | N<sub>filtered</sub> | See definitions above. | int | | ||
| 17 | N<sub>diff</sub> | See definitions above. | int | | ||
| 18 | N<sub>nocall</sub> | See definitions above. | int | | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Making a motif BED file. | ||
|
||
Downstream analysis may require a [BED](https://en.wikipedia.org/wiki/BED_(file_format)) | ||
file to select motifs of interest. For example, selecting GATC motifs in _E. coli_. This | ||
command requires a reference sequence in | ||
[FASTA](https://en.wikipedia.org/wiki/FASTA_format) a motif to find, which can include | ||
[IUPAC ambiguous bases](https://en.wikipedia.org/wiki/Nucleic_acid_notation) and a | ||
position within the motif. | ||
|
||
The following command would make a BED file for CG motifs. | ||
|
||
``` | ||
modkit motif-bed reference.fasta CG 0 1> cg_modifs.bed | ||
``` | ||
|
||
The output is directed to standard out. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
# Summarizing a modBAM. | ||
|
||
The `modkit summary` sub-command is intended for collecting read-level statistics on | ||
either a sample of reads, a region, or an entire modBam. | ||
|
||
## Summarize the base modification calls in a modBAM. | ||
|
||
``` | ||
modkit summary input.bam | ||
``` | ||
|
||
will output a table similar to this | ||
|
||
```bash | ||
> parsing region chr20 # only present if --region option is provided | ||
> sampling 10042 reads from BAM # modulated with --num-reads | ||
> calculating threshold at 10% percentile # modulated with --filter-percentile | ||
> calculated thresholds: C: 0.7167969 # calculated per-canonical base, on the fly | ||
# bases C | ||
# total_reads_used 9989 | ||
# count_reads_C 9989 | ||
# pass_threshold_C 0.7167969 | ||
# region chr20:0-64444167 | ||
base code all_count all_frac pass_count pass_frac | ||
C h 195335 0.086608544 119937 0.0590528 | ||
C m 1305956 0.5790408 1192533 0.58716166 | ||
C - 754087 0.33435062 718543 0.3537855 | ||
``` | ||
|
||
The `pass_count` and `pass_frac` columns are the statistics for calls with confidence | ||
greater than or equal to the `pass_threshold` for that canonical base's calls. For more | ||
details on thresholds see [filtering base modification calls](./filtering.md). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,11 @@ | ||
# Current limitations | ||
|
||
Known limitations and forecasts for when they will be removed. | ||
|
||
1. ChEBI codes are not supported at all, only mod-codes in the | ||
[specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (top of page 9) are | ||
supported in `pileup`. | ||
- This limitation will be removed by version 0.2.0 | ||
|
||
1. Only one MM-flag (`.`, `?`) per-canonical base is supported within a read. | ||
- This limitation may be removed in the future. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Performance considerations | ||
|
||
## Sharding a large modBAM by region. | ||
|
||
The `--region` option in `pileup`, `summary`, and `sample-probs` can be used to operate on | ||
a subset of records in a large BAM. If you're working in a distributed environment, the | ||
genome could be sharded into large sections which are specified to `modkit` in concurrent | ||
processes and merged afterward in a "map-reduce" pattern. | ||
|
||
## Setting the `--interval-size` and `--sampling-interval-size`. | ||
|
||
Whenever operating on an sorted, indexed BAM, `modkit` will operate in parallel on | ||
disjoint spans of the genome. The length of these spans (i.e. intervals) can be determined | ||
by the `--interval-size` (for general `pileup` and other use cases) or the | ||
`--sampleing-interval-size` (for the sampling algorithm only). The defaults for these | ||
parameters works well for genomes such as the human genome. For smaller genomes with high | ||
coverage, you may decide to _decrease_ the interval size in order to take advantage of | ||
parallelism. If you're operating in a memory-constrained environment, _decreasing_ the | ||
interval size will reduce the memory usage per-thread. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Basic Usage | ||
|
||
### `modkit` is a bioinformatics tool for working with modified bases from Oxford Nanopore. | ||
|
||
![ONT_logo](./images/ONT_logo_590x106.png) | ||
|
||
## Installation | ||
|
||
Pre-compiled binaries are provided for Linux from the [release | ||
page](https://github.com/nanoporetech/modkit/releases). We recommend the use of these in | ||
most circumstances. As a rust-based project, `modkit` can also be installed with | ||
[cargo](https://www.rust-lang.org/learn/get-started). | ||
```bash | ||
git clone https://github.com/nanoporetech/modkit.git | ||
cd modkit | ||
cargo install --path . | ||
# or | ||
cargo install --git https://github.com/nanoporetech/modkit.git | ||
``` | ||
|
||
## Common Use Cases | ||
1. [Creating a bedMethyl table with `pileup`](./intro_bedmethyl.md) | ||
1. [Updating and Adjusting MM tags with `adjust-mods` and `update-tags`](./intro_adjust.md) | ||
1. [Summarizing a modBAM with `summary`](./intro_summary.md) | ||
1. [Making a motif BED file with `motif-bed`](./intro_motif_bed.md) | ||
|
Oops, something went wrong.