doc
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
<?xml version="1.0" encoding="utf-8" ?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> <meta name="generator" content="Docutils 0.5: http://docutils.sourceforge.net/" /> <title></title> <style type="text/css"> /* :Author: David Goodger ([email protected]) :Id: $Id: html4css1.css 5196 2007-06-03 20:25:28Z wiemann $ :Copyright: This stylesheet has been placed in the public domain. Default cascading style sheet for the HTML output of Docutils. See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to customize this style sheet. */ /* used to remove borders from tables and images */ .borderless, table.borderless td, table.borderless th { border: 0 } table.borderless td, table.borderless th { /* Override padding for "table.docutils td" with "! important". The right padding separates the table cells. */ padding: 0 0.5em 0 0 ! important } .first { /* Override more specific margin styles with "! important". */ margin-top: 0 ! important } .last, .with-subtitle { margin-bottom: 0 ! important } .hidden { display: none } a.toc-backref { text-decoration: none ; color: black } blockquote.epigraph { margin: 2em 5em ; } dl.docutils dd { margin-bottom: 0.5em } /* Uncomment (and remove this text!) to get bold-faced definition list terms dl.docutils dt { font-weight: bold } */ div.abstract { margin: 2em 5em } div.abstract p.topic-title { font-weight: bold ; text-align: center } div.admonition, div.attention, div.caution, div.danger, div.error, div.hint, div.important, div.note, div.tip, div.warning { margin: 2em ; border: medium outset ; padding: 1em } div.admonition p.admonition-title, div.hint p.admonition-title, div.important p.admonition-title, div.note p.admonition-title, div.tip p.admonition-title { font-weight: bold ; font-family: sans-serif } div.attention p.admonition-title, div.caution p.admonition-title, div.danger p.admonition-title, div.error p.admonition-title, div.warning p.admonition-title { color: red ; font-weight: bold ; font-family: sans-serif } /* Uncomment (and remove this text!) to get reduced vertical space in compound paragraphs. div.compound .compound-first, div.compound .compound-middle { margin-bottom: 0.5em } div.compound .compound-last, div.compound .compound-middle { margin-top: 0.5em } */ div.dedication { margin: 2em 5em ; text-align: center ; font-style: italic } div.dedication p.topic-title { font-weight: bold ; font-style: normal } div.figure { margin-left: 2em ; margin-right: 2em } div.footer, div.header { clear: both; font-size: smaller } div.line-block { display: block ; margin-top: 1em ; margin-bottom: 1em } div.line-block div.line-block { margin-top: 0 ; margin-bottom: 0 ; margin-left: 1.5em } div.sidebar { margin: 0 0 0.5em 1em ; border: medium outset ; padding: 1em ; background-color: #ffffee ; width: 40% ; float: right ; clear: right } div.sidebar p.rubric { font-family: sans-serif ; font-size: medium } div.system-messages { margin: 5em } div.system-messages h1 { color: red } div.system-message { border: medium outset ; padding: 1em } div.system-message p.system-message-title { color: red ; font-weight: bold } div.topic { margin: 2em } h1.section-subtitle, h2.section-subtitle, h3.section-subtitle, h4.section-subtitle, h5.section-subtitle, h6.section-subtitle { margin-top: 0.4em } h1.title { text-align: center } h2.subtitle { text-align: center } hr.docutils { width: 75% } img.align-left { clear: left } img.align-right { clear: right } ol.simple, ul.simple { margin-bottom: 1em } ol.arabic { list-style: decimal } ol.loweralpha { list-style: lower-alpha } ol.upperalpha { list-style: upper-alpha } ol.lowerroman { list-style: lower-roman } ol.upperroman { list-style: upper-roman } p.attribution { text-align: right ; margin-left: 50% } p.caption { font-style: italic } p.credits { font-style: italic ; font-size: smaller } p.label { white-space: nowrap } p.rubric { font-weight: bold ; font-size: larger ; color: maroon ; text-align: center } p.sidebar-title { font-family: sans-serif ; font-weight: bold ; font-size: larger } p.sidebar-subtitle { font-family: sans-serif ; font-weight: bold } p.topic-title { font-weight: bold } pre.address { margin-bottom: 0 ; margin-top: 0 ; font-family: serif ; font-size: 100% } pre.literal-block, pre.doctest-block { margin-left: 2em ; margin-right: 2em } span.classifier { font-family: sans-serif ; font-style: oblique } span.classifier-delimiter { font-family: sans-serif ; font-weight: bold } span.interpreted { font-family: sans-serif } span.option { white-space: nowrap } span.pre { white-space: pre } span.problematic { color: red } span.section-subtitle { /* font-size relative to parent (h1..h6 element) */ font-size: 80% } table.citation { border-left: solid 1px gray; margin-left: 1px } table.docinfo { margin: 2em 4em } table.docutils { margin-top: 0.5em ; margin-bottom: 0.5em } table.footnote { border-left: solid 1px black; margin-left: 1px } table.docutils td, table.docutils th, table.docinfo td, table.docinfo th { padding-left: 0.5em ; padding-right: 0.5em ; vertical-align: top } table.docutils th.field-name, table.docinfo th.docinfo-name { font-weight: bold ; text-align: left ; white-space: nowrap ; padding-left: 0 } h1 tt.docutils, h2 tt.docutils, h3 tt.docutils, h4 tt.docutils, h5 tt.docutils, h6 tt.docutils { font-size: 100% } ul.auto-toc { list-style-type: none } </style> </head> <body> <div class="document"> <div class="section" id="khmer-a-simple-k-mer-counting-library"> <h1>khmer, a simple k-mer counting library</h1> <p>khmer is a simple C++ library for counting k-mers in DNA sequences. It has a complete Python wrapping and should be pretty darned fast; it's intended for genome-scale k-mer counting.</p> <p>The current version is <strong>0.2</strong>. I haven't used it for much myself, but the test code functions & it should work as advertised.</p> <p>khmer operates by building a 'ktable', a table of 4**k counters. It then maps each k-mer into this table with a simple (and reversible) hash function.</p> <p>Right now, only the Python interface is documented here. The C++ interface is essentially identical; if you need to use it and want it documented, drop me a line.</p> </div> <div class="section" id="counting-speed-and-memory-usage"> <h1>Counting Speed and Memory Usage</h1> <p>On the 5 mb <em>Shewanella oneidensis</em> genome, khmer takes less than a second to count all k-mers, for any k between 6 and 12. At 13 it craps out because the table goes over my default stack size limit.</p> <p>Approximate memory usage can be calculated by finding the size of a <tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> on your machine and then multiplying that by 4**k. For a 12bp wordsize, this works out to 16384*1024; on an Intel-based processor running Linux, <tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> is 8 bytes, so memory usage is approximately 128 mb.</p> </div> <div class="section" id="python-interface"> <h1>Python interface</h1> <p>Essentially everything requires a <tt class="docutils literal"><span class="pre">ktable</span></tt>.</p> <pre class="literal-block"> import khmer ktable = khmer.new_ktable(L) </pre> <p>These commands will create a new <tt class="docutils literal"><span class="pre">ktable</span></tt> of size 4**L, suitable for counting L-mers.</p> <p>Each <tt class="docutils literal"><span class="pre">ktable</span></tt> object has a few accessor functions:</p> <blockquote> <ul class="simple"> <li><tt class="docutils literal"><span class="pre">ktable.ksize()</span></tt> will return L.</li> <li><tt class="docutils literal"><span class="pre">ktable.max_hash()</span></tt> will return the max hash value in the table, 4**L - 1.</li> <li><tt class="docutils literal"><span class="pre">ktable.n_entries()</span></tt> will return the number of table entries, 4**L.</li> </ul> </blockquote> <p>The forward and reverse hashing functions are directly accessible:</p> <blockquote> <ul> <li><dl class="first docutils"> <dt><tt class="docutils literal"><span class="pre">hashval</span> <span class="pre">=</span> <span class="pre">ktable.forward_hash(kmer)</span></tt> will return the hash value</dt> <dd><p class="first last">of the given kmer.</p> </dd> </dl> </li> <li><dl class="first docutils"> <dt><tt class="docutils literal"><span class="pre">kmer</span> <span class="pre">=</span> <span class="pre">ktable.reverse_hash(hashval)</span></tt> will return the kmer that hashes</dt> <dd><p class="first last">to the given hashval.</p> </dd> </dl> </li> </ul> </blockquote> <p>There are also some counting functions:</p> <blockquote> <ul class="simple"> <li><tt class="docutils literal"><span class="pre">ktable.count(kmer)</span></tt> will increment the count associated with the given kmer by one.</li> <li><tt class="docutils literal"><span class="pre">ktable.consume(sequence)</span></tt> will run through the sequence and count each kmer present.</li> <li><tt class="docutils literal"><span class="pre">n</span> <span class="pre">=</span> <span class="pre">ktable.get(kmer|hashval)</span></tt> will return the count associated with the given kmer string or the given hashval, whichever is passed in.</li> <li><tt class="docutils literal"><span class="pre">ktable.set(kmer|hashval,</span> <span class="pre">count)</span></tt> set the count for the given kmer string or hashval.</li> </ul> </blockquote> <p>In all of the cases above, 'kmer' is an L-length string, 'hashval' is a non-negative integer, and 'sequence' is a DNA sequence containg ONLY A/C/G/T.</p> <p><strong>Note:</strong> 'N' is not a legal DNA character as far as khmer is concerned!</p> <p>And, finally, there are some set operations:</p> <blockquote> <ul class="simple"> <li><tt class="docutils literal"><span class="pre">ktable.clear()</span></tt> empties the ktable.</li> <li><tt class="docutils literal"><span class="pre">ktable.update(other)</span></tt> adds all of the entries in <tt class="docutils literal"><span class="pre">other</span></tt> into <tt class="docutils literal"><span class="pre">ktable</span></tt>. The wordsize must be the same for both ktables.</li> <li><tt class="docutils literal"><span class="pre">intersection</span> <span class="pre">=</span> <span class="pre">ktable.intersect(other)</span></tt> returns a ktable where only nonzero entries in both ktables are kept. The count for ach entry is the sum of the counts in <tt class="docutils literal"><span class="pre">ktable</span></tt> and <tt class="docutils literal"><span class="pre">other</span></tt>.</li> </ul> </blockquote> </div> <div class="section" id="an-example"> <h1>An Example</h1> <p>This short code example will count all 6-mers present in the given DNA sequence, and then print them all out along with their prevalence.</p> <pre class="literal-block"> # make a new ktable, L=6 ktable = khmer.new_ktable(6) # count all k-mers in the given string ktable.consume("ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC") # run through all entries. if they have nonzero presence, print. for i in range(0, ktable.n_entries()): n = ktable.get(i) if n: print ktable.reverse_hash(i), "is present", n, "times." </pre> <p>And that's all, folks... Let me know if there's other functionality that you think is important.</p> <pre class="literal-block"> CTB, 3/2005 </pre> </div> </div> </body> </html>