Skip to content

oma219/digest

 
 

Repository files navigation

✂️ Digest: fast, multi-use $k$-mer sub-sampling library

image1
Visualization of different minimizer schemes supported in Digest and code example using library

What is the Digest library?

  • a C++ library that supports various sub-sampling schemes for $k$-mers in DNA sequences.
    • Digest library utilizes the rolling hash-function from ntHash to order the $k$-mers in a window.

How to install and build into your project?

image2

Step 1: Install library

After cloning from GitHub, we use the Meson build-system to install the library.

  • PREFIX is an absolute path to library files will be install (*.h and *.a files)
    • IMPORTANT: PREFIX should not be the root directory of the Digest/ repo to avoid any issues with installation.
  • These commands generate an include and lib folders in PREFIX folder
git clone https://github.com/VeryAmazed/digest.git

meson setup --prefix=<PREFIX> --buildtype=release build
meson install -C build

Step 2: Include Digest in your project

(a) Using Meson:

If your coding project uses Meson to build the executable(s), you can include a file called subprojects/digest.wrap in your repository and let Meson install it for you.

(b) Using g++:

To use Digest in your C++ project, you just need to include the header files (*.h) and library file (*.a) that were installed in the first step. Assuming that install/ is the directory you installed them in, here is how you can compile.

g++ -std=c++17  -o main main.cpp -I install/include/ -L install/lib -lnthash

Detailed Look at Example Usage (2 ways):

There are three types of minimizer schemes that can be used:

  1. Windowed Minimizer
  2. Modimizer
  3. Syncmer

The general steps to use Digest is as follows: (1) include the relevant header files, (2) declare the Digest object and (3) find the positions where the minimizers are present in the sequence.

1. Find positions of minimizers:

#include "digest/digester.hpp"
#include "digest/window_minimizer.hpp"

digest::WindowMin<digest::BadCharPolicy::WRITEOVER, digest::ds::Adaptive> digester (dna, 15, 7);

std::vector<size_t> output;
digester.roll_minimizer(100, output);
  • This code snippet will find up to 100 Windowed Minimizers and store their positions in the vector called output.
  • digest::BadCharPolicy::WRITEOVER means that anytime the code encounters an non-ACTG character, it will replace it with an A.
    • digest::BadCharPolicy::SKIPOVER will skip any $k$-mers with non-ACTG characters
  • digest::ds::Adaptive is our recommended data-structure for finding the minimum value in a window (see wiki for other options)

2. Find both positions and hash values of minimizers

If you would like to obtain both the positions and hash values for each minimizer, you can pass a vector of paired integers to do so.

std::vector<std::pair<size_t, size_t>> output;
digester.roll_minimizer(100, output);

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • C++ 97.2%
  • Python 1.4%
  • Meson 1.2%
  • CMake 0.2%