Skip to content

Latest commit

 

History

History
 
 

doc

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.5: http://docutils.sourceforge.net/" />
<title></title>
<style type="text/css">

/*
:Author: David Goodger ([email protected])
:Id: $Id: html4css1.css 5196 2007-06-03 20:25:28Z wiemann $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.

See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
  border: 0 }

table.borderless td, table.borderless th {
  /* Override padding for "table.docutils td" with "! important".
     The right padding separates the table cells. */
  padding: 0 0.5em 0 0 ! important }

.first {
  /* Override more specific margin styles with "! important". */
  margin-top: 0 ! important }

.last, .with-subtitle {
  margin-bottom: 0 ! important }

.hidden {
  display: none }

a.toc-backref {
  text-decoration: none ;
  color: black }

blockquote.epigraph {
  margin: 2em 5em ; }

dl.docutils dd {
  margin-bottom: 0.5em }

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
  font-weight: bold }
*/

div.abstract {
  margin: 2em 5em }

div.abstract p.topic-title {
  font-weight: bold ;
  text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
  margin: 2em ;
  border: medium outset ;
  padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
  font-weight: bold ;
  font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
  color: red ;
  font-weight: bold ;
  font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
   compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
  margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
  margin-top: 0.5em }
*/

div.dedication {
  margin: 2em 5em ;
  text-align: center ;
  font-style: italic }

div.dedication p.topic-title {
  font-weight: bold ;
  font-style: normal }

div.figure {
  margin-left: 2em ;
  margin-right: 2em }

div.footer, div.header {
  clear: both;
  font-size: smaller }

div.line-block {
  display: block ;
  margin-top: 1em ;
  margin-bottom: 1em }

div.line-block div.line-block {
  margin-top: 0 ;
  margin-bottom: 0 ;
  margin-left: 1.5em }

div.sidebar {
  margin: 0 0 0.5em 1em ;
  border: medium outset ;
  padding: 1em ;
  background-color: #ffffee ;
  width: 40% ;
  float: right ;
  clear: right }

div.sidebar p.rubric {
  font-family: sans-serif ;
  font-size: medium }

div.system-messages {
  margin: 5em }

div.system-messages h1 {
  color: red }

div.system-message {
  border: medium outset ;
  padding: 1em }

div.system-message p.system-message-title {
  color: red ;
  font-weight: bold }

div.topic {
  margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  margin-top: 0.4em }

h1.title {
  text-align: center }

h2.subtitle {
  text-align: center }

hr.docutils {
  width: 75% }

img.align-left {
  clear: left }

img.align-right {
  clear: right }

ol.simple, ul.simple {
  margin-bottom: 1em }

ol.arabic {
  list-style: decimal }

ol.loweralpha {
  list-style: lower-alpha }

ol.upperalpha {
  list-style: upper-alpha }

ol.lowerroman {
  list-style: lower-roman }

ol.upperroman {
  list-style: upper-roman }

p.attribution {
  text-align: right ;
  margin-left: 50% }

p.caption {
  font-style: italic }

p.credits {
  font-style: italic ;
  font-size: smaller }

p.label {
  white-space: nowrap }

p.rubric {
  font-weight: bold ;
  font-size: larger ;
  color: maroon ;
  text-align: center }

p.sidebar-title {
  font-family: sans-serif ;
  font-weight: bold ;
  font-size: larger }

p.sidebar-subtitle {
  font-family: sans-serif ;
  font-weight: bold }

p.topic-title {
  font-weight: bold }

pre.address {
  margin-bottom: 0 ;
  margin-top: 0 ;
  font-family: serif ;
  font-size: 100% }

pre.literal-block, pre.doctest-block {
  margin-left: 2em ;
  margin-right: 2em }

span.classifier {
  font-family: sans-serif ;
  font-style: oblique }

span.classifier-delimiter {
  font-family: sans-serif ;
  font-weight: bold }

span.interpreted {
  font-family: sans-serif }

span.option {
  white-space: nowrap }

span.pre {
  white-space: pre }

span.problematic {
  color: red }

span.section-subtitle {
  /* font-size relative to parent (h1..h6 element) */
  font-size: 80% }

table.citation {
  border-left: solid 1px gray;
  margin-left: 1px }

table.docinfo {
  margin: 2em 4em }

table.docutils {
  margin-top: 0.5em ;
  margin-bottom: 0.5em }

table.footnote {
  border-left: solid 1px black;
  margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
  padding-left: 0.5em ;
  padding-right: 0.5em ;
  vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
  font-weight: bold ;
  text-align: left ;
  white-space: nowrap ;
  padding-left: 0 }

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  font-size: 100% }

ul.auto-toc {
  list-style-type: none }

</style>
</head>
<body>
<div class="document">


<div class="section" id="khmer-a-simple-k-mer-counting-library">
<h1>khmer, a simple k-mer counting library</h1>
<p>khmer is a simple C++ library for counting k-mers in DNA sequences.
It has a complete Python wrapping and should be pretty darned fast;
it's intended for genome-scale k-mer counting.</p>
<p>The current version is <strong>0.2</strong>.  I haven't used it for much myself,
but the test code functions &amp; it should work as advertised.</p>
<p>khmer operates by building a 'ktable', a table of 4**k counters.
It then maps each k-mer into this table with a simple
(and reversible) hash function.</p>
<p>Right now, only the Python interface is documented here.  The C++
interface is essentially identical; if you need to use it and want
it documented, drop me a line.</p>
</div>
<div class="section" id="counting-speed-and-memory-usage">
<h1>Counting Speed and Memory Usage</h1>
<p>On the 5 mb <em>Shewanella oneidensis</em> genome, khmer takes less than a second
to count all k-mers, for any k between 6 and 12.  At 13 it craps out
because the table goes over my default stack size limit.</p>
<p>Approximate memory usage can be calculated by finding the size of a
<tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> on your machine and then multiplying that by 4**k.
For a 12bp wordsize, this works out to 16384*1024; on an Intel-based
processor running Linux, <tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> is 8 bytes, so memory usage
is approximately 128 mb.</p>
</div>
<div class="section" id="python-interface">
<h1>Python interface</h1>
<p>Essentially everything requires a <tt class="docutils literal"><span class="pre">ktable</span></tt>.</p>
<pre class="literal-block">
import khmer
ktable = khmer.new_ktable(L)
</pre>
<p>These commands will create a new <tt class="docutils literal"><span class="pre">ktable</span></tt> of size 4**L, suitable
for counting L-mers.</p>
<p>Each <tt class="docutils literal"><span class="pre">ktable</span></tt> object has a few accessor functions:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.ksize()</span></tt> will return L.</li>
<li><tt class="docutils literal"><span class="pre">ktable.max_hash()</span></tt> will return the max hash value in the table, 4**L - 1.</li>
<li><tt class="docutils literal"><span class="pre">ktable.n_entries()</span></tt> will return the number of table entries, 4**L.</li>
</ul>
</blockquote>
<p>The forward and reverse hashing functions are directly accessible:</p>
<blockquote>
<ul>
<li><dl class="first docutils">
<dt><tt class="docutils literal"><span class="pre">hashval</span> <span class="pre">=</span> <span class="pre">ktable.forward_hash(kmer)</span></tt> will return the hash value</dt>
<dd><p class="first last">of the given kmer.</p>
</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><tt class="docutils literal"><span class="pre">kmer</span> <span class="pre">=</span> <span class="pre">ktable.reverse_hash(hashval)</span></tt> will return the kmer that hashes</dt>
<dd><p class="first last">to the given hashval.</p>
</dd>
</dl>
</li>
</ul>
</blockquote>
<p>There are also some counting functions:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.count(kmer)</span></tt> will increment the count associated with the given kmer
by one.</li>
<li><tt class="docutils literal"><span class="pre">ktable.consume(sequence)</span></tt> will run through the sequence and count
each kmer present.</li>
<li><tt class="docutils literal"><span class="pre">n</span> <span class="pre">=</span> <span class="pre">ktable.get(kmer|hashval)</span></tt> will return the count associated with the
given kmer string or the given hashval, whichever is passed in.</li>
<li><tt class="docutils literal"><span class="pre">ktable.set(kmer|hashval,</span> <span class="pre">count)</span></tt> set the count for the given kmer
string or hashval.</li>
</ul>
</blockquote>
<p>In all of the cases above, 'kmer' is an L-length string, 'hashval' is
a non-negative integer, and 'sequence' is a DNA sequence containg ONLY
A/C/G/T.</p>
<p><strong>Note:</strong> 'N' is not a legal DNA character as far as khmer is concerned!</p>
<p>And, finally, there are some set operations:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.clear()</span></tt> empties the ktable.</li>
<li><tt class="docutils literal"><span class="pre">ktable.update(other)</span></tt> adds all of the entries in <tt class="docutils literal"><span class="pre">other</span></tt> into
<tt class="docutils literal"><span class="pre">ktable</span></tt>.  The wordsize must be the same for both ktables.</li>
<li><tt class="docutils literal"><span class="pre">intersection</span> <span class="pre">=</span> <span class="pre">ktable.intersect(other)</span></tt> returns a ktable where
only nonzero entries in both ktables are kept.  The count for ach
entry is the sum of the counts in <tt class="docutils literal"><span class="pre">ktable</span></tt> and <tt class="docutils literal"><span class="pre">other</span></tt>.</li>
</ul>
</blockquote>
</div>
<div class="section" id="an-example">
<h1>An Example</h1>
<p>This short code example will count all 6-mers present in the given
DNA sequence, and then print them all out along with their prevalence.</p>
<pre class="literal-block">
# make a new ktable, L=6
ktable = khmer.new_ktable(6)

# count all k-mers in the given string
ktable.consume(&quot;ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC&quot;)

# run through all entries. if they have nonzero presence, print.
for i in range(0, ktable.n_entries()):
   n = ktable.get(i)
   if n:
      print ktable.reverse_hash(i), &quot;is present&quot;, n, &quot;times.&quot;
</pre>
<p>And that's all, folks... Let me know if there's other functionality that
you think is important.</p>
<pre class="literal-block">
CTB, 3/2005
</pre>
</div>
</div>
</body>
</html>