khmer/doc at master · kmt/khmer

History
Name		Name	Last commit message	Last commit date
parent directory ..
_static		_static
_templates		_templates
LICENSE.txt		LICENSE.txt
Makefile		Makefile
Makefile.bak		Makefile.bak
README.html		README.html
artifact-removal.graffle		artifact-removal.graffle
artifact-removal.png		artifact-removal.png
blog-posts.txt		blog-posts.txt
choosing-hash-sizes.txt		choosing-hash-sizes.txt
conf.py		conf.py
contributors.txt		contributors.txt
details.txt		details.txt
development.txt		development.txt
extra.txt		extra.txt
guide.txt		guide.txt
index.txt		index.txt
install.txt		install.txt
introduction.txt		introduction.txt
ktable.txt		ktable.txt
methods.txt		methods.txt
partitioning-big-data.txt		partitioning-big-data.txt
partitioning-workflow.graffle		partitioning-workflow.graffle
partitioning-workflow.png		partitioning-workflow.png
run-corn-50m.sh		run-corn-50m.sh
scripts-old.txt		scripts-old.txt
scripts.txt		scripts.txt
stats.txt		stats.txt
README.html

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.5: http://docutils.sourceforge.net/" />
<title></title>
<style type="text/css">

/*
:Author: David Goodger ([email protected])
:Id: $Id: html4css1.css 5196 2007-06-03 20:25:28Z wiemann $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.

See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
  border: 0 }

table.borderless td, table.borderless th {
  /* Override padding for "table.docutils td" with "! important".
     The right padding separates the table cells. */
  padding: 0 0.5em 0 0 ! important }

.first {
  /* Override more specific margin styles with "! important". */
  margin-top: 0 ! important }

.last, .with-subtitle {
  margin-bottom: 0 ! important }

.hidden {
  display: none }

a.toc-backref {
  text-decoration: none ;
  color: black }

blockquote.epigraph {
  margin: 2em 5em ; }

dl.docutils dd {
  margin-bottom: 0.5em }

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
  font-weight: bold }
*/

div.abstract {
  margin: 2em 5em }

div.abstract p.topic-title {
  font-weight: bold ;
  text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
  margin: 2em ;
  border: medium outset ;
  padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
  font-weight: bold ;
  font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
  color: red ;
  font-weight: bold ;
  font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
   compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
  margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
  margin-top: 0.5em }
*/

div.dedication {
  margin: 2em 5em ;
  text-align: center ;
  font-style: italic }

div.dedication p.topic-title {
  font-weight: bold ;
  font-style: normal }

div.figure {
  margin-left: 2em ;
  margin-right: 2em }

div.footer, div.header {
  clear: both;
  font-size: smaller }

div.line-block {
  display: block ;
  margin-top: 1em ;
  margin-bottom: 1em }

div.line-block div.line-block {
  margin-top: 0 ;
  margin-bottom: 0 ;
  margin-left: 1.5em }

div.sidebar {
  margin: 0 0 0.5em 1em ;
  border: medium outset ;
  padding: 1em ;
  background-color: #ffffee ;
  width: 40% ;
  float: right ;
  clear: right }

div.sidebar p.rubric {
  font-family: sans-serif ;
  font-size: medium }

div.system-messages {
  margin: 5em }

div.system-messages h1 {
  color: red }

div.system-message {
  border: medium outset ;
  padding: 1em }

div.system-message p.system-message-title {
  color: red ;
  font-weight: bold }

div.topic {
  margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  margin-top: 0.4em }

h1.title {
  text-align: center }

h2.subtitle {
  text-align: center }

hr.docutils {
  width: 75% }

img.align-left {
  clear: left }

img.align-right {
  clear: right }

ol.simple, ul.simple {
  margin-bottom: 1em }

ol.arabic {
  list-style: decimal }

ol.loweralpha {
  list-style: lower-alpha }

ol.upperalpha {
  list-style: upper-alpha }

ol.lowerroman {
  list-style: lower-roman }

ol.upperroman {
  list-style: upper-roman }

p.attribution {
  text-align: right ;
  margin-left: 50% }

p.caption {
  font-style: italic }

p.credits {
  font-style: italic ;
  font-size: smaller }

p.label {
  white-space: nowrap }

p.rubric {
  font-weight: bold ;
  font-size: larger ;
  color: maroon ;
  text-align: center }

p.sidebar-title {
  font-family: sans-serif ;
  font-weight: bold ;
  font-size: larger }

p.sidebar-subtitle {
  font-family: sans-serif ;
  font-weight: bold }

p.topic-title {
  font-weight: bold }

pre.address {
  margin-bottom: 0 ;
  margin-top: 0 ;
  font-family: serif ;
  font-size: 100% }

pre.literal-block, pre.doctest-block {
  margin-left: 2em ;
  margin-right: 2em }

span.classifier {
  font-family: sans-serif ;
  font-style: oblique }

span.classifier-delimiter {
  font-family: sans-serif ;
  font-weight: bold }

span.interpreted {
  font-family: sans-serif }

span.option {
  white-space: nowrap }

span.pre {
  white-space: pre }

span.problematic {
  color: red }

span.section-subtitle {
  /* font-size relative to parent (h1..h6 element) */
  font-size: 80% }

table.citation {
  border-left: solid 1px gray;
  margin-left: 1px }

table.docinfo {
  margin: 2em 4em }

table.docutils {
  margin-top: 0.5em ;
  margin-bottom: 0.5em }

table.footnote {
  border-left: solid 1px black;
  margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
  padding-left: 0.5em ;
  padding-right: 0.5em ;
  vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
  font-weight: bold ;
  text-align: left ;
  white-space: nowrap ;
  padding-left: 0 }

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  font-size: 100% }

ul.auto-toc {
  list-style-type: none }

</style>
</head>
<body>
<div class="document">


<div class="section" id="khmer-a-simple-k-mer-counting-library">
<h1>khmer, a simple k-mer counting library</h1>
<p>khmer is a simple C++ library for counting k-mers in DNA sequences.
It has a complete Python wrapping and should be pretty darned fast;
it's intended for genome-scale k-mer counting.</p>
<p>The current version is <strong>0.2</strong>.  I haven't used it for much myself,
but the test code functions &amp; it should work as advertised.</p>
<p>khmer operates by building a 'ktable', a table of 4**k counters.
It then maps each k-mer into this table with a simple
(and reversible) hash function.</p>
<p>Right now, only the Python interface is documented here.  The C++
interface is essentially identical; if you need to use it and want
it documented, drop me a line.</p>
</div>
<div class="section" id="counting-speed-and-memory-usage">
<h1>Counting Speed and Memory Usage</h1>
<p>On the 5 mb <em>Shewanella oneidensis</em> genome, khmer takes less than a second
to count all k-mers, for any k between 6 and 12.  At 13 it craps out
because the table goes over my default stack size limit.</p>
<p>Approximate memory usage can be calculated by finding the size of a
<tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> on your machine and then multiplying that by 4**k.
For a 12bp wordsize, this works out to 16384*1024; on an Intel-based
processor running Linux, <tt class="docutils literal"><span class="pre">long</span> <span class="pre">long</span></tt> is 8 bytes, so memory usage
is approximately 128 mb.</p>
</div>
<div class="section" id="python-interface">
<h1>Python interface</h1>
<p>Essentially everything requires a <tt class="docutils literal"><span class="pre">ktable</span></tt>.</p>
<pre class="literal-block">
import khmer
ktable = khmer.new_ktable(L)
</pre>
<p>These commands will create a new <tt class="docutils literal"><span class="pre">ktable</span></tt> of size 4**L, suitable
for counting L-mers.</p>
<p>Each <tt class="docutils literal"><span class="pre">ktable</span></tt> object has a few accessor functions:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.ksize()</span></tt> will return L.</li>
<li><tt class="docutils literal"><span class="pre">ktable.max_hash()</span></tt> will return the max hash value in the table, 4**L - 1.</li>
<li><tt class="docutils literal"><span class="pre">ktable.n_entries()</span></tt> will return the number of table entries, 4**L.</li>
</ul>
</blockquote>
<p>The forward and reverse hashing functions are directly accessible:</p>
<blockquote>
<ul>
<li><dl class="first docutils">
<dt><tt class="docutils literal"><span class="pre">hashval</span> <span class="pre">=</span> <span class="pre">ktable.forward_hash(kmer)</span></tt> will return the hash value</dt>
<dd><p class="first last">of the given kmer.</p>
</dd>
</dl>
</li>
<li><dl class="first docutils">
<dt><tt class="docutils literal"><span class="pre">kmer</span> <span class="pre">=</span> <span class="pre">ktable.reverse_hash(hashval)</span></tt> will return the kmer that hashes</dt>
<dd><p class="first last">to the given hashval.</p>
</dd>
</dl>
</li>
</ul>
</blockquote>
<p>There are also some counting functions:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.count(kmer)</span></tt> will increment the count associated with the given kmer
by one.</li>
<li><tt class="docutils literal"><span class="pre">ktable.consume(sequence)</span></tt> will run through the sequence and count
each kmer present.</li>
<li><tt class="docutils literal"><span class="pre">n</span> <span class="pre">=</span> <span class="pre">ktable.get(kmer|hashval)</span></tt> will return the count associated with the
given kmer string or the given hashval, whichever is passed in.</li>
<li><tt class="docutils literal"><span class="pre">ktable.set(kmer|hashval,</span> <span class="pre">count)</span></tt> set the count for the given kmer
string or hashval.</li>
</ul>
</blockquote>
<p>In all of the cases above, 'kmer' is an L-length string, 'hashval' is
a non-negative integer, and 'sequence' is a DNA sequence containg ONLY
A/C/G/T.</p>
<p><strong>Note:</strong> 'N' is not a legal DNA character as far as khmer is concerned!</p>
<p>And, finally, there are some set operations:</p>
<blockquote>
<ul class="simple">
<li><tt class="docutils literal"><span class="pre">ktable.clear()</span></tt> empties the ktable.</li>
<li><tt class="docutils literal"><span class="pre">ktable.update(other)</span></tt> adds all of the entries in <tt class="docutils literal"><span class="pre">other</span></tt> into
<tt class="docutils literal"><span class="pre">ktable</span></tt>.  The wordsize must be the same for both ktables.</li>
<li><tt class="docutils literal"><span class="pre">intersection</span> <span class="pre">=</span> <span class="pre">ktable.intersect(other)</span></tt> returns a ktable where
only nonzero entries in both ktables are kept.  The count for ach
entry is the sum of the counts in <tt class="docutils literal"><span class="pre">ktable</span></tt> and <tt class="docutils literal"><span class="pre">other</span></tt>.</li>
</ul>
</blockquote>
</div>
<div class="section" id="an-example">
<h1>An Example</h1>
<p>This short code example will count all 6-mers present in the given
DNA sequence, and then print them all out along with their prevalence.</p>
<pre class="literal-block">
# make a new ktable, L=6
ktable = khmer.new_ktable(6)

# count all k-mers in the given string
ktable.consume(&quot;ATGAGAGACACAGGGAGAGACCCAATTAGAGAATTGGACC&quot;)

# run through all entries. if they have nonzero presence, print.
for i in range(0, ktable.n_entries()):
   n = ktable.get(i)
   if n:
      print ktable.reverse_hash(i), &quot;is present&quot;, n, &quot;times.&quot;
</pre>
<p>And that's all, folks... Let me know if there's other functionality that
you think is important.</p>
<pre class="literal-block">
CTB, 3/2005
</pre>
</div>
</div>
</body>
</html>
Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

README.html

Files

doc

Directory actions

More options

Directory actions

More options

Latest commit

History

doc

Folders and files

parent directory

README.html