Book-docs ready for review.

nanoporetech · May 2, 2023 · 16c6ed7 · 16c6ed7
1 parent 0740934
commit 16c6ed7
Show file tree

Hide file tree

Showing 66 changed files with 700 additions and 6,867 deletions.
diff --git a/.gitignore b/.gitignore
@@ -13,3 +13,6 @@ tmp/
 dist
 dist/
 sourceme.sh
+*.swp
+docs/book
+docs/book/*
diff --git a/README.md b/README.md
@@ -37,7 +37,7 @@ modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
 
 No reference sequence is required. A single file (described [below](#description-of-bedmethyl-output)) with base count summaries will be created. The final argument here specifies an optional log file output.
 
-The program perform best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./filtering.md).
+The program performs best-practices filtering and manipulation of the raw data stored in the input file. For further details see [filtering modified-base calls](./filtering.md).
 
 For user convenience the counting process can be modulated using several additional transforms and filters. The most basic of these is to report only counts from reference CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs in the reference:
 

diff --git a/modkit-book/book.toml → docs/book.toml b/modkit-book/book.toml → docs/book.toml
@@ -4,3 +4,7 @@ language = "en"
 multilingual = false
 src = "src"
 title = "Modkit"
+
+[output.html]
+mathjax-support = true
+additional-css = ["custom.css"]
diff --git a/docs/custom.css b/docs/custom.css
@@ -0,0 +1,4 @@
+:root {
+  --content-max-width: 960px;
+}
+
diff --git a/docs/src/SUMMARY.md b/docs/src/SUMMARY.md
@@ -0,0 +1,14 @@
+# Summary
+
+- [Basic Usage](./quick_start.md)
+    - [Constructing bedMethyl tables](./intro_bedmethyl.md)
+    - [Updating and Adjusting MM tags](./intro_adjust.md)
+    - [Summarizing a modBAM](./intro_summary.md)
+    - [Making a motif BED file](./intro_motif_bed.md)
+- [Extended subcommand help](./advanced_usage.md)
+- [Troubleshooting](./troubleshooting.md)
+- [Current limitations](./limitations.md)
+- [Performance considerations](./perf_considerations.md)
+- [Algorithm details](./algo_details.md)
+    - [Filtering base modification calls](./filtering.md)
+    - [Ignoring and combining calls](./collapse.md)
diff --git a/docs/src/advanced_usage.md b/docs/src/advanced_usage.md
diff --git a/docs/src/algo_details.md b/docs/src/algo_details.md
@@ -0,0 +1,4 @@
+# Algorithm details
+
+- [Filtering low confidence base modification calls](./filtering.md)
+- [Removing and combining base modification calls](./filtering.md)
diff --git a/modkit-book/src/collapse.md → docs/src/collapse.md b/modkit-book/src/collapse.md → docs/src/collapse.md
@@ -12,14 +12,14 @@ modification calls when performing a pileup to generate a bedMethyl file.
 
 BAM records containing probabilities for multiple base modifications at each residue can be transformed to
 ignore one (or more) of the predicted base modifications. For example, if a BAM file contains 5hmC and 5mC
-probabilities, the total probability is necessarily distributed over 5mC $`p_m`$, 5hmC $`p_h`$, and canonical
-C, $`p_{C}`$. Described below are the two implemented methods to reduce the dimensionality of the probability
+probabilities, the total probability is necessarily distributed over 5mC, \\(p_m\\), 5hmC, \\(p_h\\), and canonical
+C, \\(p_{C}\\). Described below are the two implemented methods to reduce the dimensionality of the probability
 distribution by one or more predicted base modification class.
 
-Take for example a 3-way modification call for C, 5mC, and 5hmC, the probabilities of each being $`p_{C}`$,
-$`p_{m}`$, $`p_{h}`$, respectively.  To remove the 5hmC probability, $`p_{h}`$ is distributed evenly across
-the other two options. In this example, the updates are $`p_C \leftarrow p_C + (\frac{p_h}{2})`$ and $`p_m
-\leftarrow p_m + (\frac{p_h}{2})`$.
+Take for example a 3-way modification call for C, 5mC, and 5hmC, the probabilities of each being \\(p_{C}\\),
+\\(p_{m}\\), \\(p_{h}\\), respectively.  To remove the 5hmC probability, \\(p_{h}\\) is distributed evenly
+across the other two options. In this example, the updates are \\( p_C \leftarrow p_C + (\frac{p_h}{2}) \\)
+and \\( p_m \leftarrow p_m + (\frac{p_h}{2}) \\).
 
 
 ## Combining multiple base modifications into a single count.

diff --git a/modkit-book/src/filtering.md → docs/src/filtering.md b/modkit-book/src/filtering.md → docs/src/filtering.md
@@ -1,15 +1,15 @@
-# Filtering base modification calls
+# Filtering base modification calls.
 
 Modified base calls are qualitied with a probability that is contained in the ML tag (see the
 [specification](https://samtools.github.io/hts-specs/SAMtags.pdf)). We calculate the confidence that the model
-has in the base modification prediction as $`\mathcal{q} = argmax(\textbf{P})`$ where $`\textbf{P}`$ is the
+has in the base modification prediction as \\(\mathcal{q} = argmax(\textbf{P})\\) where \\(\textbf{P}\\) is the
 vector of probabilities for each modification. For example, given a model that can classify canonical
-cytosine, 5mC, and 5hmC, $`\textbf{P}`$ is $`[P_{C}, P_m, P_h]`$, and $`\mathcal{q}`$ will be $`\mathcal{q} =
-argmax(P_{C}, P_m, P_h)`$, the maximum of the three probabilities.  Filtering in `modkit` is performed by
-first determining the value of $`\mathcal{q}`$ for the lowest n-th percentile of calls (10th percentile by
+cytosine, 5mC, and 5hmC, \\(\textbf{P}\\) is \\([P_{C}, P_m, P_h]\\), and \\(\mathcal{q}\\) will be \\(\mathcal{q} =
+argmax(P_{C}, P_m, P_h)\\), the maximum of the three probabilities.  Filtering in `modkit` is performed by
+first determining the value of \\(\mathcal{q}\\) for the lowest n-th percentile of calls (10th percentile by
 default).  The threshold value is typically an estimate because the base modification probabilities are
 sampled from a subset of the reads. All calls can be used to calculate the exact value, but in practice the
 approximation gives the same value. Base modification calls with a confidence value less than this number will
 not be counted.  Determination of the threshold value can be performed on the fly (by sampling) or the
 threshold value can be specified on the command line with the `--filter_threshold` flag. The `sample-probs`
-command can be used to quickly estimate the value of $`\mathcal{q}`$ at various percentiles.
+command can be used to quickly estimate the value of \\(\mathcal{q}\\) at various percentiles.
diff --git a/docs/src/images/ONT_logo_590x106.png b/docs/src/images/ONT_logo_590x106.png
diff --git a/docs/src/intro_adjust.md b/docs/src/intro_adjust.md
@@ -0,0 +1,66 @@
+# Updating and Adjusting MM tags.
+
+The `adjust-mods` subcommand can be used to manipulate MM (and corresponding ML) tags in a
+modBam. In general, these simple commands are run prior to `pileup`, visualization, or
+other analysis.
+
+
+## Ignoring a modification class.
+
+To remove a base modification class from a modBAM and produce a new modBAM, use the
+`--ignore` option for `adjust-mods`.
+
+```
+modkit adjust-mods input.bam output.adjust.bam --ignore <mod_code_to_ignore>
+```
+For example the command below will remove 5hmC calls, leaving just 5mC calls.
+
+```
+modkit adjust-mods input.bam output.adjust.bam --ignore h
+```
+For technical details on the transformation see [Removing modification calls from
+BAMs](./collapse.md#removing-dna-base-modification-probabilities).
+
+## Combining base modification probabilities.
+
+Combining base modification probabilities may be desirable for downstream analysis or
+visualization. Unlike `--ignore` which removes the probability of a class, `--convert`
+will sum the probability of one class with another if the second class already exists. For
+example, the command below will convert probabilities associated with `h` probability into
+`m` probability. If `m` already exists, the probabilities will be summed.  As described in
+[changing the modification code](./intro_adjust.md#changing-the-base-modification-code),
+if the second base modification code doesn't exist, the probabilities are left unchanged.
+
+```
+modkit adjust-mods input.bam output.convert.bam --convert h m
+```
+
+This would be equivalent to setting `--combine-mods` flag, for example.
+```
+modkit pileup path/to/reads.bam output/path/pileup.bed --combine-mods
+```
+
+## Updating the flag (`?` and `.`).
+The [specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (Section 1.7) allows
+for omission of the MM flag, however this may not be the intent of missing base
+modification probabilities for some models. The command below will add or change the `?` flag to a modBAM.
+
+```
+modkit adjust-mods input.bam output.bam --mode ambiguous
+```
+
+Another option is to set the flag to `.`, the "implicitly canonical" mode:
+
+```
+modkit adjust-mods input.bam output.bam --mode implicit
+```
+
+## Changing the base modification code.
+Some functions in `modkit` or other tools may require the mod-codes in the MM tag be in
+the [specification](https://samtools.github.io/hts-specs/SAMtags.pdf). 
+
+For example, the following command will change `C+Z,` tags to `C+m,` tags.
+
+```
+modkit adjust-mods input.bam output.bam --convert Z m
+```
diff --git a/docs/src/intro_bedmethyl.md b/docs/src/intro_bedmethyl.md
@@ -0,0 +1,100 @@
+# Constructing bedMethyl tables.
+
+A primary use of `modkit` is to create summary counts of modified and unmodified bases in
+an extended [bedMethyl](https://www.encodeproject.org/data-standards/wgbs/) format.
+bedMethyl files tabulate the counts of base modifications from every sequencing read over
+each aligned reference genomic position.
+
+In its simplest form `modkit` creates a bedMethyl file using the following:
+
+```bash
+modkit pileup path/to/reads.bam output/path/pileup.bed --log-filepath pileup.log
+```
+
+No reference sequence is required. A single file (described
+[below](#description-of-bedmethyl-output)) with base count summaries will be created. The
+final argument here specifies an optional log file output.
+
+The program performs best-practices filtering and manipulation of the raw data stored in
+the input file. For further details see [filtering modified-base calls](./filtering.md).
+
+For user convenience, the counting process can be modulated using several additional
+transforms and filters. The most basic of these is to report only counts from reference
+CpG dinucleotides. This option requires a reference sequence in order to locate the CpGs
+in the reference:
+
+```bash
+modkit pileup path/to/reads.bam output/path/pileup.bed --cpg --ref path/to/reference.fasta
+```
+
+The program also contains preset which combine several options for ease of use. The
+`traditional` preset,
+
+```bash
+modkit pileup path/to/reads.bam output/path/pileup.bed \
+  --ref path/to/reference.fasta \
+  --preset traditional
+```
+
+performs three transforms:
+* restricts output to locations where there is a CG dinucleotide in the reference,
+* reports only a C and 5mC counts, using procedures to take into account counts of other
+  forms of cytosine modification (notably 5hmC), and
+* aggregates data across strands. The strand field od the output will be marked as '.'
+  indicating that the strand information has been lost.
+
+Using this option is equivalent to running with the options:
+
+```bash
+modkit pileup --cpg --ref <reference.fasta> --collapse h --combine-strands
+```
+
+For more information on the individual options see the [Advanced Usage](./advanced_usage.md) help document.
+
+## Description of bedMethyl output.
+
+Below is a description of the bedMethyl columns generated by `modkit pileup`. A brief description of the
+bedMethyl specification can be found on [Encode](https://www.encodeproject.org/data-standards/wgbs/).
+
+### Definitions:
+
+* N<sub>mod</sub> - Number of calls passing filters that were classified as a residue with a specified base modification.
+* N<sub>canonical</sub> - Number of calls passing filters were classified as the canonical base rather than modified. The
+exact base must be inferred by the modification code. For example, if the modification code is `m` (5mC) then
+the canonical base is cytosine. If the modification code is `a`, the canonical base is adenosine.
+* N<sub>other mod</sub> - Number of calls passing filters that were classified as modified, but where the modification is different from the listed base (and the corresponding canonical base is equal). For example, for a given cytosine there may be 3 reads with
+`h` calls, 1 with a canonical call, and 2 with `m` calls. In the bedMethyl row for `h` N<sub>other_mod</sub> would be 2. In the
+`m` row N<sub>other_mod</sub> would be 3.
+* N<sub>valid_cov</sub> - the valid coverage. N<sub>valid_cov</sub> = N<sub>mod</sub> + N<sub>other_mod</sub> + N<sub>canonical</sub>, also used as the `score` in the bedMethyl
+* N<sub>diff</sub> - Number of reads with a base other than the canonical base for this modification. For example, in a row
+for `h` the canonical base is cytosine, if there are 2 reads with C->A substitutions, N<sub>diff</sub> will be 2.
+* N<sub>delete</sub> - Number of reads with a deletion at this reference position
+* N<sub>filtered</sub> - Number of calls where the probability of the call was below the threshold. The threshold can be
+set on the command line or computed from the data (usually filtering out the lowest 10th percentile of calls).
+* N<sub>nocall</sub> - Number of reads aligned to this reference position, with the correct canonical base, but without a base
+modification call. This can happen, for example, if the model requires a CpG dinucleotide and the read has a
+CG->CH substitution such that no modification call was produced by the basecaller.
+
+### bedMethyl column descriptions.
+
+| column | name                  | description                                                                    | type  |
+|--------|-----------------------|--------------------------------------------------------------------------------|-------|
+| 1      | chrom                 | name of reference sequence from BAM header                                     | str   |
+| 2      | start position        | 0-based start position                                                         | int   |
+| 3      | end position          | 0-based exclusive end position                                                 | int   |
+| 4      | modified base code    | single letter code for modified base                                           | str   |
+| 5      | score                 | Equal to N<sub>valid_cov</sub>.                                                | int   |
+| 6      | strand                | '+' for positive strand '-' for negative strand, '.' when strands are combined | str   |
+| 7      | start position        | included for compatibility                                                     | int   |
+| 8      | end position          | included for compatibility                                                     | int   |
+| 9      | color                 | included for compatibility, always 255,0,0                                     | str   |
+| 10     | N<sub>valid_cov</sub> | See definitions above.                                                         | int   |
+| 11     | fraction modified     | N<sub>mod</sub> / N<sub>valid_cov</sub>                                        | float |
+| 12     | N<sub>mod</sub>       | See definitions above.                                                         | int   |
+| 13     | N<sub>canonical</sub> | See definitions above.                                                         | int   |
+| 14     | N<sub>other_mod</sub> | See definitions above.                                                         | int   |
+| 15     | N<sub>delete</sub>    | See definitions above.                                                         | int   |
+| 16     | N<sub>filtered</sub>  | See definitions above.                                                         | int   |
+| 17     | N<sub>diff</sub>      | See definitions above.                                                         | int   |
+| 18     | N<sub>nocall</sub>    | See definitions above.                                                         | int   |
+
diff --git a/docs/src/intro_motif_bed.md b/docs/src/intro_motif_bed.md
@@ -0,0 +1,17 @@
+# Making a motif BED file.
+
+Downstream analysis may require a [BED](https://en.wikipedia.org/wiki/BED_(file_format))
+file to select motifs of interest. For example, selecting GATC motifs in _E. coli_. This
+command requires a reference sequence in
+[FASTA](https://en.wikipedia.org/wiki/FASTA_format) a motif to find, which can include
+[IUPAC ambiguous bases](https://en.wikipedia.org/wiki/Nucleic_acid_notation) and a
+position within the motif.
+
+The following command would make a BED file for CG motifs.
+
+```
+modkit motif-bed reference.fasta CG 0 1> cg_modifs.bed
+```
+
+The output is directed to standard out.
+
diff --git a/docs/src/intro_summary.md b/docs/src/intro_summary.md
@@ -0,0 +1,32 @@
+# Summarizing a modBAM.
+
+The `modkit summary` sub-command is intended for collecting read-level statistics on
+either a sample of reads, a region, or an entire modBam.
+
+## Summarize the base modification calls in a modBAM.
+
+```
+modkit summary input.bam 
+```
+
+will output a table similar to this
+
+```bash
+> parsing region chr20  # only present if --region option is provided
+> sampling 10042 reads from BAM # modulated with --num-reads
+> calculating threshold at 10% percentile # modulated with --filter-percentile
+> calculated thresholds: C: 0.7167969 # calculated per-canonical base, on the fly
+# bases             C
+# total_reads_used  9989
+# count_reads_C     9989
+# pass_threshold_C  0.7167969
+# region            chr20:0-64444167
+ base  code  all_count  all_frac     pass_count  pass_frac
+ C     h     195335     0.086608544  119937      0.0590528
+ C     m     1305956    0.5790408    1192533     0.58716166
+ C     -     754087     0.33435062   718543      0.3537855
+```
+
+The `pass_count` and `pass_frac` columns are the statistics for calls with confidence
+greater than or equal to the `pass_threshold` for that canonical base's calls. For more
+details on thresholds see [filtering base modification calls](./filtering.md).
diff --git a/docs/src/limitations.md b/docs/src/limitations.md
@@ -0,0 +1,11 @@
+# Current limitations
+
+Known limitations and forecasts for when they will be removed.
+
+1. ChEBI codes are not supported at all, only mod-codes in the
+[specification](https://samtools.github.io/hts-specs/SAMtags.pdf) (top of page 9) are
+supported in `pileup`.
+    - This limitation will be removed by version 0.2.0
+
+1. Only one MM-flag (`.`, `?`) per-canonical base is supported within a read.
+    - This limitation may be removed in the future.
diff --git a/docs/src/perf_considerations.md b/docs/src/perf_considerations.md
@@ -0,0 +1,19 @@
+# Performance considerations
+
+## Sharding a large modBAM by region.
+
+The `--region` option in `pileup`, `summary`, and `sample-probs` can be used to operate on
+a subset of records in a large BAM. If you're working in a distributed environment, the
+genome could be sharded into large sections which are specified to `modkit` in concurrent
+processes and merged afterward in a "map-reduce" pattern.
+
+## Setting the `--interval-size` and `--sampling-interval-size`.
+
+Whenever operating on an sorted, indexed BAM, `modkit` will operate in parallel on
+disjoint spans of the genome. The length of these spans (i.e. intervals) can be determined
+by the `--interval-size` (for general `pileup` and other use cases) or the
+`--sampleing-interval-size` (for the sampling algorithm only). The defaults for these
+parameters works well for genomes such as the human genome. For smaller genomes with high
+coverage, you may decide to _decrease_ the interval size in order to take advantage of
+parallelism. If you're operating in a memory-constrained environment, _decreasing_ the
+interval size will reduce the memory usage per-thread.
diff --git a/docs/src/quick_start.md b/docs/src/quick_start.md
@@ -0,0 +1,26 @@
+# Basic Usage
+
+### `modkit` is a bioinformatics tool for working with modified bases from Oxford Nanopore.
+
+![ONT_logo](./images/ONT_logo_590x106.png)
+
+## Installation
+
+Pre-compiled binaries are provided for Linux from the [release
+page](https://github.com/nanoporetech/modkit/releases). We recommend the use of these in
+most circumstances. As a rust-based project, `modkit` can also be installed with 
+[cargo](https://www.rust-lang.org/learn/get-started).
+```bash
+git clone https://github.com/nanoporetech/modkit.git
+cd modkit
+cargo install --path .
+# or
+cargo install --git https://github.com/nanoporetech/modkit.git
+```
+
+## Common Use Cases
+1. [Creating a bedMethyl table with `pileup`](./intro_bedmethyl.md)
+1. [Updating and Adjusting MM tags with `adjust-mods` and `update-tags`](./intro_adjust.md)
+1. [Summarizing a modBAM with `summary`](./intro_summary.md)
+1. [Making a motif BED file with `motif-bed`](./intro_motif_bed.md)
+