Skip to content

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

License

Notifications You must be signed in to change notification settings

althonos/lightmotif

Repository files navigation

🎼🧬 lightmotif Star me

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Actions Coverage License Crate Docs Source Mirror GitHub issues Changelog

🗺️ Overview

Motif scanning with position weight matrices (also known as position-specific scoring matrices) is a robust method for identifying motifs of fixed length inside a biological sequence. They can be used to identify transcription factor binding sites in DNA, or protease cleavage site in polypeptides. Position weight matrices are often viewed as sequence logos:

MX000274.svg

The lightmotif library provides a Rust crate to run very efficient searches for a motif encoded in a position weight matrix. The position scanning combines several techniques to allow high-throughput processing of sequences:

  • Compile-time definition of alphabets and matrix dimensions.
  • Sequence symbol encoding for fast table look-ups, as implemented in HMMER[1] or MEME[2]
  • Striped sequence matrices to process several positions in parallel, inspired by Michael Farrar[3].
  • Vectorized matrix row look-up using permute instructions of AVX2.
  • High-throughput Gibbs sampler[4] implementation in oops and zoops modes, featuring deterministic results using randomness from the rand crate.

Other crates from the ecosystem provide additional features if needed:

  • lightmotif-io is a crate with parser implementations for various count matrix, frequency matrix and position-specific scoring matrix formats such as TRANSFAC or JASPAR.
  • lightmotif-tfmpvalue is an exact reimplementation of the TFM-PVALUE[5] algorithm for converting between a score and a p-value for a given scoring matrix.

This is the Rust version, there is a Python package available as well.

💡 Example

use lightmotif::*;
use lightmotif::abc::Nucleotide;

// Create a count matrix from an iterable of motif sequences
let counts = CountMatrix::<Dna>::from_sequences(
    ["GTTGACCTTATCAAC", "GTTGATCCAGTCAAC"]
        .into_iter()
        .map(|s| EncodedSequence::encode(s).unwrap()),
)
.unwrap();

// Create a PSSM with 0.1 pseudocounts and uniform background frequencies.
let pssm = counts.to_freq(0.1).to_scoring(None);

// Use the pipeline to encode the target sequence into a striped matrix
let seq = "ATGTCCCAACAACGATACCCCGAGCCCATCGCCGTCATCGGCTCGGCATGCAGATTCCCAGGCG";
let encoded = EncodedSequence::encode(seq).unwrap();
let mut striped = encoded.to_striped();

// Organize layout of striped matrix to allow scoring with PSSM.
striped.configure(&pssm);

// Compute scores for every position of the matrix.
let scores = pssm.score(&striped);

// Scores can be extracted into a Vec<f32>, or indexed directly.
let v = scores.unstripe();
assert_eq!(scores[0], -23.07094);
assert_eq!(v[0], -23.07094);

// Find the highest scoring position.
let best = scores.argmax().unwrap();
assert_eq!(best, 18);

// Find the positions above an absolute score threshold.
let indices = scores.threshold(10.0);
assert_eq!(indices, []);

This example uses a dynamic dispatch pipeline, which selects the best available backend (AVX2, SSE2, NEON, or a generic implementation) depending on the local platform.

⏱️ Benchmarks

Both benchmarks use the MX000001 motif from PRODORIC[5], and the complete genome of an Escherichia coli K12 strain. Benchmarks were run on a i7-10710U CPU running @1.10GHz, compiled with --target-cpu=native.

  • Score every position of the genome with the motif weight matrix:

    test bench_avx2    ... bench:   4,510,794 ns/iter (+/-     9,570) = 1029 MB/s
    test bench_sse2    ... bench:  26,773,537 ns/iter (+/-    57,891) =  173 MB/s
    test bench_generic ... bench: 317,731,004 ns/iter (+/- 2,567,370) =   14 MB/s
  • Find the highest-scoring position for a motif in a 10kb sequence (compared to the PSSM algorithm implemented in bio::pattern_matching::pssm):

    test bench_avx2    ... bench:      12,797 ns/iter (+/-   380) = 781 MB/s
    test bench_sse2    ... bench:      62,597 ns/iter (+/-    43) = 159 MB/s
    test bench_generic ... bench:     671,900 ns/iter (+/- 1,150) =  14 MB/s
    test bench_bio     ... bench:   1,193,911 ns/iter (+/- 2,519) =   8 MB/s

💭 Feedback

⚠️ Issue Tracker

Found a bug ? Have an enhancement request ? Head over to the GitHub issue tracker if you need to report or ask something. If you are filing in on a bug, please include as much information as you can about the issue, and try to recreate the same bug in a simple, easily reproducible situation.

📋 Changelog

This project adheres to Semantic Versioning and provides a changelog in the Keep a Changelog format.

⚖️ License

This library is provided under the open-source MIT license.

This project was developed by Martin Larralde during his PhD project at the European Molecular Biology Laboratory in the Zeller team.

📚 References

  • [1] Eddy, Sean R. ‘Accelerated Profile HMM Searches’. PLOS Computational Biology 7, no. 10 (20 October 2011): e1002195. doi:10.1371/journal.pcbi.1002195.
  • [2] Grant, Charles E., Timothy L. Bailey, and William Stafford Noble. ‘FIMO: Scanning for Occurrences of a given Motif’. Bioinformatics 27, no. 7 (1 April 2011): 1017–18. doi:10.1093/bioinformatics/btr064.
  • [3] Farrar, Michael. ‘Striped Smith–Waterman Speeds Database Searches Six Times over Other SIMD Implementations’. Bioinformatics 23, no. 2 (15 January 2007): 156–61. doi:10.1093/bioinformatics/btl582.
  • [4] Lawrence, Charles E., Stephen F. Altschul, Mark S. Boguski, Jun S. Liu, Andrew F. Neuwald, and John C. Wootton. ’Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignment’. Science. 1993 Oct 8;262(5131):208-14. doi:10.1126/science.8211139.
  • [5] Touzet, Hélène, and Jean-Stéphane Varré. ‘Efficient and Accurate P-Value Computation for Position Weight Matrices’. Algorithms for Molecular Biology 2, no. 1 (2007): 1–12. doi:10.1186/1748-7188-2-15.
  • [6] Dudek, Christian-Alexander, and Dieter Jahn. ‘PRODORIC: State-of-the-Art Database of Prokaryotic Gene Regulation’. Nucleic Acids Research 50, no. D1 (7 January 2022): D295–302. doi:10.1093/nar/gkab1110.

About

A lightweight platform-accelerated library for biological motif scanning using position weight matrices.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 3

  •  
  •  
  •