Skip to content

Commit

Permalink
Book-docs ready for review.
Browse files Browse the repository at this point in the history
  • Loading branch information
ArtRand committed May 2, 2023
1 parent 0740934 commit 16c6ed7
Show file tree
Hide file tree
Showing 66 changed files with 700 additions and 6,867 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,6 @@ tmp/
dist
dist/
sourceme.sh
*.swp
docs/book
docs/book/*
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log

No reference sequence is required. A single file (described [below](#description-of-bedmethyl-output)) with base count summaries will be created. The final argument here specifies an optional log file output.

The program perform best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./filtering.md).
The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./filtering.md).

For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference:

Expand Down
4 changes: 4 additions & 0 deletions modkit-book/book.toml → docs/book.toml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,7 @@ language = "en"
multilingual = false
src = "src"
title = "Modkit"

[output.html]
mathjax-support = true
additional-css = ["custom.css"]
4 changes: 4 additions & 0 deletions docs/custom.css
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
:root {
--content-max-width: 960px;
}

14 changes: 14 additions & 0 deletions docs/src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
# Summary

- [Basic Usage](./quick_start.md)
- [Constructing bedMethyl tables](./intro_bedmethyl.md)
- [Updating and Adjusting MM tags](./intro_adjust.md)
- [Summarizing a modBAM](./intro_summary.md)
- [Making a motif BED file](./intro_motif_bed.md)
- [Extended subcommand help](./advanced_usage.md)
- [Troubleshooting](./troubleshooting.md)
- [Current limitations](./limitations.md)
- [Performance considerations](./perf_considerations.md)
- [Algorithm details](./algo_details.md)
- [Filtering base modification calls](./filtering.md)
- [Ignoring and combining calls](./collapse.md)
326 changes: 326 additions & 0 deletions docs/src/advanced_usage.md

Large diffs are not rendered by default.

4 changes: 4 additions & 0 deletions docs/src/algo_details.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Algorithm details

- [Filtering low confidence base modification calls](./filtering.md)
- [Removing and combining base modification calls](./filtering.md)
12 changes: 6 additions & 6 deletions modkit-book/src/collapse.md → docs/src/collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,14 @@ modification calls when performing a pileup to generate a bedMethyl file.

BAM records containing probabilities for multiple base modifications at each residue can be transformed to
ignore one (or more) of the predicted base modifications. For example, if a BAM file contains 5hmC and 5mC
probabilities, the total probability is necessarily distributed over 5mC $`p_m`$, 5hmC $`p_h`$, and canonical
C, $`p_{C}`$. Described below are the two implemented methods to reduce the dimensionality of the probability
probabilities, the total probability is necessarily distributed over 5mC, \\(p_m\\), 5hmC, \\(p_h\\), and canonical
C, \\(p_{C}\\). Described below are the two implemented methods to reduce the dimensionality of the probability
distribution by one or more predicted base modification class.

Take for example a 3-way modification call for C, 5mC, and 5hmC, the probabilities of each being $`p_{C}`$,
$`p_{m}`$, $`p_{h}`$, respectively. To remove the 5hmC probability, $`p_{h}`$ is distributed evenly across
the other two options. In this example, the updates are $`p_C \leftarrow p_C + (\frac{p_h}{2})`$ and $`p_m
\leftarrow p_m + (\frac{p_h}{2})`$.
Take for example a 3-way modification call for C, 5mC, and 5hmC, the probabilities of each being \\(p_{C}\\),
\\(p_{m}\\), \\(p_{h}\\), respectively. To remove the 5hmC probability, \\(p_{h}\\) is distributed evenly
across the other two options. In this example, the updates are \\( p_C \leftarrow p_C + (\frac{p_h}{2}) \\)
and \\( p_m \leftarrow p_m + (\frac{p_h}{2}) \\).


## Combining multiple base modifications into a single count.
Expand Down
12 changes: 6 additions & 6 deletions modkit-book/src/filtering.md → docs/src/filtering.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Filtering base modification calls
# Filtering base modification calls.

Modified base calls are qualitied with a probability that is contained in the ML tag (see the
[specification](https://samtools.github.io/hts-specs/SAMtags.pdf)). We calculate the confidence that the model
has in the base modification prediction as $`\mathcal{q} = argmax(\textbf{P})`$ where $`\textbf{P}`$ is the
has in the base modification prediction as \\(\mathcal{q} = argmax(\textbf{P})\\) where \\(\textbf{P}\\) is the
vector of probabilities for each modification. For example, given a model that can classify canonical
cytosine, 5mC, and 5hmC, $`\textbf{P}`$ is $`[P_{C}, P_m, P_h]`$, and $`\mathcal{q}`$ will be $`\mathcal{q} =
argmax(P_{C}, P_m, P_h)`$, the maximum of the three probabilities. Filtering in `modkit` is performed by
first determining the value of $`\mathcal{q}`$ for the lowest n-th percentile of calls (10th percentile by
cytosine, 5mC, and 5hmC, \\(\textbf{P}\\) is \\([P_{C}, P_m, P_h]\\), and \\(\mathcal{q}\\) will be \\(\mathcal{q} =
argmax(P_{C}, P_m, P_h)\\), the maximum of the three probabilities. Filtering in `modkit` is performed by
first determining the value of \\(\mathcal{q}\\) for the lowest n-th percentile of calls (10th percentile by
default). The threshold value is typically an estimate because the base modification probabilities are
sampled from a subset of the reads. All calls can be used to calculate the exact value, but in practice the
approximation gives the same value. Base modification calls with a confidence value less than this number will
not be counted. Determination of the threshold value can be performed on the fly (by sampling) or the
threshold value can be specified on the command line with the `--filter_threshold` flag. The `sample-probs`
command can be used to quickly estimate the value of $`\mathcal{q}`$ at various percentiles.
command can be used to quickly estimate the value of \\(\mathcal{q}\\) at various percentiles.
Binary file added docs/src/images/ONT_logo_590x106.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 66 additions & 0 deletions docs/src/intro_adjust.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# Updating and Adjusting MM tags.

The `adjust-mods` subcommand can be used to manipulate MM (and corresponding ML) tags in a
modBam. In general, these simple commands are run prior to `pileup`, visualization, or
other analysis.


## Ignoring a modification class.

To remove a base modification class from a modBAM and produce a new modBAM, use the
`--ignore` option for `adjust-mods`.

```
modkit adjust-mods input.bam output.adjust.bam --ignore <mod_code_to_ignore>
```
For example the command below will remove 5hmC calls, leaving just 5mC calls.

```
modkit adjust-mods input.bam output.adjust.bam --ignore h
```
For technical details on the transformation see [Removing modification calls from
BAMs](./collapse.md#removing-dna-base-modification-probabilities).

## Combining base modification probabilities.

Combining base modification probabilities may be desirable for downstream analysis or
visualization. Unlike `--ignore` which removes the probability of a class, `--convert`
will sum the probability of one class with another if the second class already exists. For
example, the command below will convert probabilities associated with `h` probability into
`m` probability. If `m` already exists, the probabilities will be summed. As described in
[changing the modification code](./intro_adjust.md#changing-the-base-modification-code),
if the second base modification code doesn't exist, the probabilities are left unchanged.

```
modkit adjust-mods input.bam output.convert.bam --convert h m
```

This would be equivalent to setting `--combine-mods` flag, for example.
```
modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods
```

## Updating the flag (`?` and `.`).
The [specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (Section 1.7) allows
for omission of the MM flag, however this may not be the intent of missing base
modification probabilities for some models. The command below will add or change the `?` flag to a modBAM.

```
modkit adjust-mods input.bam output.bam --mode ambiguous
```

Another option is to set the flag to `.`, the "implicitly canonical" mode:

```
modkit adjust-mods input.bam output.bam --mode implicit
```

## Changing the base modification code.
Some functions in `modkit` or other tools may require the mod-codes in the MM tag be in
the [specification](https://samtools.github.io/hts-specs/SAMtags.pdf).

For example, the following command will change `C+Z,` tags to `C+m,` tags.

```
modkit adjust-mods input.bam output.bam --convert Z m
```
100 changes: 100 additions & 0 deletions docs/src/intro_bedmethyl.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
# Constructing bedMethyl tables.

A primary use of `modkit` is to create summary counts of modified and unmodified bases in
an extended [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) format.
bedMethyl files tabulate the counts of base modifications from every sequencing read over
each aligned reference genomic position.

In its simplest form `modkit` creates a bedMethyl file using the following:

```bash
modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
```

No reference sequence is required. A single file (described
[below](#description-of-bedmethyl-output)) with base count summaries will be created. The
final argument here specifies an optional log file output.

The program performs best-practices filtering and manipulation of the raw data stored in
the input file. For further details see [filtering modified-base calls](./filtering.md).

For user convenience, the counting process can be modulated using several additional
transforms and filters. The most basic of these is to report only counts from reference
CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs
in the reference:

```bash
modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
```

The program also contains preset which combine several options for ease of use. The
`traditional` preset,

```bash
modkit pileup path/to/reads.bam output/path/pileup.bed \
--ref path/to/reference.fasta \
--preset traditional
```

performs three transforms:
* restricts output to locations where there is a CG dinucleotide in the reference,
* reports only a C and 5mC counts, using procedures to take into account counts of other
forms of cytosine modification (notably 5hmC), and
* aggregates data across strands. The strand field od the output will be marked as '.'
indicating that the strand information has been lost.

Using this option is equivalent to running with the options:

```bash
modkit pileup --cpg --ref <reference.fasta> --collapse h --combine-strands
```

For more information on the individual options see the [Advanced Usage](./advanced_usage.md) help document.

## Description of bedMethyl output.

Below is a description of the bedMethyl columns generated by `modkit pileup`. A brief description of the
bedMethyl specification can be found on [Encode](https://www.encodeproject.org/data-standards/wgbs/).

### Definitions:

* N<sub>mod</sub> - Number of calls passing filters that were classified as a residue with a specified base modification.
* N<sub>canonical</sub> - Number of calls passing filters were classified as the canonical base rather than modified. The
exact base must be inferred by the modification code. For example, if the modification code is `m` (5mC) then
the canonical base is cytosine. If the modification code is `a`, the canonical base is adenosine.
* N<sub>other mod</sub> - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with
`h` calls, 1 with a canonical call, and 2 with `m` calls. In the bedMethyl row for `h` N<sub>other_mod</sub> would be 2. In the
`m` row N<sub>other_mod</sub> would be 3.
* N<sub>valid_cov</sub> - the valid coverage. N<sub>valid_cov</sub> = N<sub>mod</sub> + N<sub>other_mod</sub> + N<sub>canonical</sub>, also used as the `score` in the bedMethyl
* N<sub>diff</sub> - Number of reads with a base other than the canonical base for this modification. For example, in a row
for `h` the canonical base is cytosine, if there are 2 reads with C->A substitutions, N<sub>diff</sub> will be 2.
* N<sub>delete</sub> - Number of reads with a deletion at this reference position
* N<sub>filtered</sub> - Number of calls where the probability of the call was below the threshold. The threshold can be
set on the command line or computed from the data (usually filtering out the lowest 10th percentile of calls).
* N<sub>nocall</sub> - Number of reads aligned to this reference position, with the correct canonical base, but without a base
modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a
CG->CH substitution such that no modification call was produced by the basecaller.

### bedMethyl column descriptions.

| column | name | description | type |
|--------|-----------------------|--------------------------------------------------------------------------------|-------|
| 1 | chrom | name of reference sequence from BAM header | str |
| 2 | start position | 0-based start position | int |
| 3 | end position | 0-based exclusive end position | int |
| 4 | modified base code | single letter code for modified base | str |
| 5 | score | Equal to N<sub>valid_cov</sub>. | int |
| 6 | strand | '+' for positive strand '-' for negative strand, '.' when strands are combined | str |
| 7 | start position | included for compatibility | int |
| 8 | end position | included for compatibility | int |
| 9 | color | included for compatibility, always 255,0,0 | str |
| 10 | N<sub>valid_cov</sub> | See definitions above. | int |
| 11 | fraction modified | N<sub>mod</sub> / N<sub>valid_cov</sub> | float |
| 12 | N<sub>mod</sub> | See definitions above. | int |
| 13 | N<sub>canonical</sub> | See definitions above. | int |
| 14 | N<sub>other_mod</sub> | See definitions above. | int |
| 15 | N<sub>delete</sub> | See definitions above. | int |
| 16 | N<sub>filtered</sub> | See definitions above. | int |
| 17 | N<sub>diff</sub> | See definitions above. | int |
| 18 | N<sub>nocall</sub> | See definitions above. | int |

17 changes: 17 additions & 0 deletions docs/src/intro_motif_bed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Making a motif BED file.

Downstream analysis may require a [BED](https://en.wikipedia.org/wiki/BED_(file_format))
file to select motifs of interest. For example, selecting GATC motifs in _E. coli_. This
command requires a reference sequence in
[FASTA](https://en.wikipedia.org/wiki/FASTA_format) a motif to find, which can include
[IUPAC ambiguous bases](https://en.wikipedia.org/wiki/Nucleic_acid_notation) and a
position within the motif.

The following command would make a BED file for CG motifs.

```
modkit motif-bed reference.fasta CG 0 1> cg_modifs.bed
```

The output is directed to standard out.

32 changes: 32 additions & 0 deletions docs/src/intro_summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Summarizing a modBAM.

The `modkit summary` sub-command is intended for collecting read-level statistics on
either a sample of reads, a region, or an entire modBam.

## Summarize the base modification calls in a modBAM.

```
modkit summary input.bam
```

will output a table similar to this

```bash
> parsing region chr20 # only present if --region option is provided
> sampling 10042 reads from BAM # modulated with --num-reads
> calculating threshold at 10% percentile # modulated with --filter-percentile
> calculated thresholds: C: 0.7167969 # calculated per-canonical base, on the fly
# bases C
# total_reads_used 9989
# count_reads_C 9989
# pass_threshold_C 0.7167969
# region chr20:0-64444167
base code all_count all_frac pass_count pass_frac
C h 195335 0.086608544 119937 0.0590528
C m 1305956 0.5790408 1192533 0.58716166
C - 754087 0.33435062 718543 0.3537855
```

The `pass_count` and `pass_frac` columns are the statistics for calls with confidence
greater than or equal to the `pass_threshold` for that canonical base's calls. For more
details on thresholds see [filtering base modification calls](./filtering.md).
11 changes: 11 additions & 0 deletions docs/src/limitations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Current limitations

Known limitations and forecasts for when they will be removed.

1. ChEBI codes are not supported at all, only mod-codes in the
[specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (top of page 9) are
supported in `pileup`.
- This limitation will be removed by version 0.2.0

1. Only one MM-flag (`.`, `?`) per-canonical base is supported within a read.
- This limitation may be removed in the future.
19 changes: 19 additions & 0 deletions docs/src/perf_considerations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
# Performance considerations

## Sharding a large modBAM by region.

The `--region` option in `pileup`, `summary`, and `sample-probs` can be used to operate on
a subset of records in a large BAM. If you're working in a distributed environment, the
genome could be sharded into large sections which are specified to `modkit` in concurrent
processes and merged afterward in a "map-reduce" pattern.

## Setting the `--interval-size` and `--sampling-interval-size`.

Whenever operating on an sorted, indexed BAM, `modkit` will operate in parallel on
disjoint spans of the genome. The length of these spans (i.e. intervals) can be determined
by the `--interval-size` (for general `pileup` and other use cases) or the
`--sampleing-interval-size` (for the sampling algorithm only). The defaults for these
parameters works well for genomes such as the human genome. For smaller genomes with high
coverage, you may decide to _decrease_ the interval size in order to take advantage of
parallelism. If you're operating in a memory-constrained environment, _decreasing_ the
interval size will reduce the memory usage per-thread.
26 changes: 26 additions & 0 deletions docs/src/quick_start.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Basic Usage

### `modkit` is a bioinformatics tool for working with modified bases from Oxford Nanopore.

![ONT_logo](./images/ONT_logo_590x106.png)

## Installation

Pre-compiled binaries are provided for Linux from the [release
page](https://github.com/nanoporetech/modkit/releases). We recommend the use of these in
most circumstances. As a rust-based project, `modkit` can also be installed with
[cargo](https://www.rust-lang.org/learn/get-started).
```bash
git clone https://github.com/nanoporetech/modkit.git
cd modkit
cargo install --path .
# or
cargo install --git https://github.com/nanoporetech/modkit.git
```

## Common Use Cases
1. [Creating a bedMethyl table with `pileup`](./intro_bedmethyl.md)
1. [Updating and Adjusting MM tags with `adjust-mods` and `update-tags`](./intro_adjust.md)
1. [Summarizing a modBAM with `summary`](./intro_summary.md)
1. [Making a motif BED file with `motif-bed`](./intro_motif_bed.md)

Loading

0 comments on commit 16c6ed7

Please sign in to comment.