GitHub - paulsmith/templatemaker: templatemaker is a Python library that can extract data from files with a similar format, like HTML pages.

paulsmith / templatemaker Public
Notifications You must be signed in to change notification settings
Fork 14
Star 63
templatemaker is a Python library that can extract data from files with a similar format, like HTML pages.
code.google.com/p/templatemaker/
63 stars 14 forks Branches Tags Activity
Notifications
Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
templatemaker		templatemaker
README.TXT		README.TXT
TODO.TXT		TODO.TXT
setup.py		setup.py
templatemaker.c		templatemaker.c
tests.py		tests.py
Repository files navigation

=============
templatemaker
=============

Given a list of text files in a similar format, templatemaker creates a
template that can extract data from files in that same format.

The underlying longest-common-substring algorithm is implemented in C
for performance.

Installation
============

Because part of templatemaker is implemented in C, you'll need to compile the
C portion of it. Fortunately, Python makes this easy.

To install systemwide:

    sudo python setup.py install

To play around with it without having to install systemwide:

    python setup.py build
    cp build/lib*/_templatemaker.so .
    rm -rf build

Overview
========

The templatemaker.py module provides a class, Template, that is capable of two
things:

    * Given an arbitrary number of strings (called "Sample Strings"), it
      determines the "least-common denominator" template -- that is, a string
      representing all the common substrings, with placeholders ("holes") for
      areas in the Sample Strings that differ.

      For example, these three Sample Strings...
      
        '<b>this and that</b>'
        '<b>alex and sue</b>'
        '<b>fine and dandy</b>'

      ...would produce the following template, where "!" represents a "hole."

        '<b>! and !</b>'

    * Once you have a template, you can give it data that is formatted
      according to that template, and it will "extract" a tuple of the "hole"
      values for that particular string.
      
      Following the above example, giving the template the following data:
      
        '<b>larry and curly</b>'

      ...would produce the following tuple:
      
        ('larry', 'curly')

      You can interpret this as "Hole 0's value is 'larry', and hole 1's value
      is 'curly'."

Basic Python API
================

Here's how to express the above example in Python code:

    # Import the Template class.
    >>> from templatemaker import Template

    # Create a Template instance.
    >>> t = Template()

    # Learn a Sample String.
    >>> t.learn('<b>this and that</b>')

    # Output the template so far, using the "!" character to mark holes.
    # We've only learned a single string, so the template has no holes.
    >>> t.as_text('!')
    '<b>this and that</b>'

    # Learn another string. The True return value means the template gained
    # at least one hole.
    >>> t.learn('<b>alex and sue</b>')
    True

    # Sure enough, the template now has some holes.
    >>> t.as_text('!')
    '<b>! and !</b>'

    # Learn another string. This time, the return value is False, which means
    # the template didn't gain any holes.
    >>> t.learn('<b>fine and dandy</b>')
    False

    # The template is the same as before.
    >>> t.as_text('!')
    '<b>! and !</b>'

    # Now that we have a template, let's extract some data.
    >>> t.extract('<b>red and green</b>')
    ('red', 'green')
    >>> t.extract('<b>django and stephane</b>')
    ('django', 'stephane')

    # The extract() method is very literal. It doesn't magically trim
    # whitespace, nor does it have any knowledge of markup languages such as
    # HTML.
    >>> t.extract('<b>  spacy  and <u>underlined</u></b>')
    ('  spacy ', '<u>underlined</u>')

    # The extract() method will raise the NoMatch exception if the data
    # doesn't match the template. In this example, the data doesn't have the
    # leading and trailing "<b>" tags.
    >>> t.extract('this and that')
    Traceback (most recent call last):
    ...
    NoMatch

    # Use the extract_dict() method to get a dictionary instead of a tuple.
    # extract_dict() requires that you specify a tuple of field names.
    >>> t.extract_dict('<b>red and green</b>', ('value1', 'value2'))
    {'value1': 'red', 'value2': 'green'}

    # You can specify None as one or more of the field-name values in
    # extract_dict(). Any field whose value is None will *not* be included
    # in the resulting dictionary.
    >>> t.extract_dict('<b>red and green</b>', ('value1', None))
    {'value1': 'red'}
    >>> t.extract_dict('<b>red and green</b>', (None, 'value2'))
    {'value2': 'green'}

The as_text() method
====================

The as_text() method displays the template as a string, with holes represented
by a string of your choosing.

    >>> t = Template()
    >>> t.learn('123 and 456')
    >>> t.learn('456 and 789')
    True

    Get the template with "!" as the hole-representing string.
    >>> t.as_text('!')
    '! and !'

    Get the template with "{{ var }}" as the hole-representing string.
    >>> t.as_text('{{ var }}')
    '{{ var }} and {{ var }}'

Note that as_text() does *not* perform any escaping if your template contains
the hole-representing string:

    >>> t = Template()
    >>> t.learn('Yes!')
    >>> t.learn('No!')
    True

    Here, we use "!" as the hole-representing string, and the template contains
    a literal "!" character. The literal character is not escaped, so there is
    no way to tell apart the literal template character from the
    hole-representing string.
    >>> t.as_text('!')
    '!!'

    Here, we use an underscore to demonstrate that the template contains a
    literal "!" character. This wasn't detectable in the previous statement.
    >>> t.as_text('_')
    '_!'

With this in mind, you should choose a string in as_text() that is highly
unlikely to appear in your template. But, in any case, you shouldn't rely on
the output of as_text() for use in programs; it's solely intended to be a
visual aid for humans to see their templates. (The template-maker code has its
own internal way of representing holes, and it's guaranteed to be unambiguous.
See "The marker character" below.)

Tolerance
=========

This template-making algorithm can often be "too literal" for one's liking.
For example, given these three Sample Strings...

    'my favorite color is blue'
    'my favorite color is violet'
    'my favorite color is purple'

The color is the only text that changes, so one would assume the template would
be the following:

    'my favorite color is !'

Let's see what actually happens:

    >>> t = Template()
    >>> t.learn('my favorite color is blue')
    >>> t.learn('my favorite color is violet')
    True
    >>> t.learn('my favorite color is purple')
    False
    >>> t.as_text('!')
    'my favorite color is !l!e!'

Aha, the template-maker was a bit too literal -- it noticed that the strings
"blue", "violet" and "purple" all contain *something*, then the letter "l",
then *something*, then the letter "e", then *something*. Technically, that's
correct, but for most applications, this approach misses the forest for the
trees.

There are two ways to solve this problem:

    * Teach a template many, many Sample Strings. The more diverse your input,
      the less likely this will happen.

    * Use a feature called *tolerance*.

The template-maker algorithm lets you specify a tolerance -- the minimum
allowed length of text between holes. This gives you control over avoiding the
problem.

To specify tolerance, just pass the ``tolerance`` argument to the Template
constructor. Here's the above example with a tolerance=1.

    >>> t = Template(tolerance=1)
    >>> t.learn('my favorite color is blue')
    >>> t.learn('my favorite color is violet')
    True
    >>> t.learn('my favorite color is purple')
    False
    >>> t.as_text('!')
    'my favorite color is !'

Aha! Now, that's more like it.

Setting tolerance is useful for small cases like this, but note that there is a
risk of throwing the baby out with the bathwater, depending on how high your
tolerance is set. If the tolerance is set too high, your output template might
be "watered down" -- less accurate than it possibly could be.

For example, say we have a list of HTML strings representing significant
events:

    <p>My birthday is <span style="color: blue;">Dec. 11, 1954</span>.</p>
    <p>My wife's birthday is <span style="color: red;">Jan. 3, 1957</span>.</p>
    <p>Our wedding fell on <span style="color: green;">Sep. 24, 1982</span>.</p>

Say we'd like to extract five pieces of data from each string: the event, the
HTML color, the month, the day and the year. Here's what we might do in Python:

    >>> t = Template(tolerance=5)
    >>> t.learn('<p>My birthday is <span style="color: blue;">Dec. 11, 1954</span>.</p>')
    >>> t.learn('<p>My wife\'s birthday is <span style="color: red;">Jan. 3, 1957</span>.</p>')
    True
    >>> t.learn('<p>Our wedding fell on <span style="color: green;">Sep. 24, 1982</span>.</p>')
    True
    >>> t.as_text('!')
    '! <span style="color: !</span>.</p>'

This resulting template is correct, but it's watered down.

This template is, indeed, watered down. You can tell by extracting some data:

    >>> t.extract('<p>My sister\'s birthday is <span style="color: pink;">Jul. 12, 1952</span>.</p>')
    ("<p>My sister's birthday is", 'pink;">Jul. 12, 1952')

This data is messy. Namely, the color of the <span> is in the same data field
as the event date. Assuming we're interested in getting as granular as
possible in our parsing, it would be better to set a lower tolerance.

    >>> t = Template(tolerance=1)
    >>> t.learn('<p>My birthday is <span style="color: blue;">Dec. 11, 1954</span>.</p>')
    >>> t.learn('<p>My wife\'s birthday is <span style="color: red;">Jan. 3, 1957</span>.</p>')
    True
    >>> t.learn('<p>Our wedding fell on <span style="color: green;">Sep. 24, 1982</span>.</p>')
    False
    >>> t.as_text('!')
    '<p>! <span style="color: !;">!. !, 19!</span>.</p>'
    >>> t.extract('<p>My sister\'s birthday is <span style="color: pink;">Jul. 12, 1952</span>.</p>')
    ("My sister's birthday is", 'pink', 'Jul', '12', '52')

Much better.

Using tolerance is all about tradeoffs. To use this feature most effectively,
you'll need to experiment and consider the nature of the data you're parsing.

Versions
========

A Template instance keeps tracks of how many Sample Strings it has learned.
You can access this via the ``version`` attribute.

    >>> t = Template()
    >>> t.version
    0
    >>> t.learn('My name is Paul.')
    >>> t.version
    1
    >>> t.learn('My name is Jonas.')
    True
    >>> t.version
    2

The marker character
====================

The template-maker algorithm, implemented in C, works by comparing two strings
one byte at a time. Each time you call learn(some_string) on a template object,
the underlying C algorithm compares some_string to the *current template*.

A template internally represents each hole with a *marker character* -- a
character that is guaranteed not to appear in either string. This is set to the
character "\x1f".

In order to guarantee that the marker character doesn't appear in a string, the
Template's learn() method removes any occurrence of the marker character from
the input string before running the comparison algorithm.

The advantage of this "dumb" approach is its simplicity: the C implementation
doesn't have to treat markers as a special case. As a result, the C code is
cleaner and faster. Also, it was easier to write. :)

However, there are two disadvantages:

    * First, a template effectively cannot contain the literal marker
      character.

      In practice, this is *highly* unlikely to be a problem, because the
      marker character is obscure.

    * Two, in the slight chance that a template contains a multi-byte character
      (e.g., Unicode) that contains the marker character as one of its bytes,
      the template-maker algorithm will split that Unicode character at that
      byte. This happens because the underlying C implementation compares
      single bytes -- it is *not* Unicode-aware.

      In practice, this is *highly* unlikely to be a problem. In order to get
      bitten by this, you'd need an aforementioned multi-byte character to
      appear in your template (not just in a Sample String, but in your
      template -- i.e., in *each* Sample String).

      If it turns out there are significant multi-byte characters that contain
      the marker character "\x1f", you can change the MARKER value at the top
      of templatemaker.c and recompile it. Or, suggest a better character to
      the authors.

Change log
==========

2007-09-20    0.1.1    Created HTMLTemplate
2007-07-06    0.1      Initial release