-
Notifications
You must be signed in to change notification settings - Fork 14
templatemaker is a Python library that can extract data from files with a similar format, like HTML pages.
paulsmith/templatemaker
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
============= templatemaker ============= Given a list of text files in a similar format, templatemaker creates a template that can extract data from files in that same format. The underlying longest-common-substring algorithm is implemented in C for performance. Installation ============ Because part of templatemaker is implemented in C, you'll need to compile the C portion of it. Fortunately, Python makes this easy. To install systemwide: sudo python setup.py install To play around with it without having to install systemwide: python setup.py build cp build/lib*/_templatemaker.so . rm -rf build Overview ======== The templatemaker.py module provides a class, Template, that is capable of two things: * Given an arbitrary number of strings (called "Sample Strings"), it determines the "least-common denominator" template -- that is, a string representing all the common substrings, with placeholders ("holes") for areas in the Sample Strings that differ. For example, these three Sample Strings... '<b>this and that</b>' '<b>alex and sue</b>' '<b>fine and dandy</b>' ...would produce the following template, where "!" represents a "hole." '<b>! and !</b>' * Once you have a template, you can give it data that is formatted according to that template, and it will "extract" a tuple of the "hole" values for that particular string. Following the above example, giving the template the following data: '<b>larry and curly</b>' ...would produce the following tuple: ('larry', 'curly') You can interpret this as "Hole 0's value is 'larry', and hole 1's value is 'curly'." Basic Python API ================ Here's how to express the above example in Python code: # Import the Template class. >>> from templatemaker import Template # Create a Template instance. >>> t = Template() # Learn a Sample String. >>> t.learn('<b>this and that</b>') # Output the template so far, using the "!" character to mark holes. # We've only learned a single string, so the template has no holes. >>> t.as_text('!') '<b>this and that</b>' # Learn another string. The True return value means the template gained # at least one hole. >>> t.learn('<b>alex and sue</b>') True # Sure enough, the template now has some holes. >>> t.as_text('!') '<b>! and !</b>' # Learn another string. This time, the return value is False, which means # the template didn't gain any holes. >>> t.learn('<b>fine and dandy</b>') False # The template is the same as before. >>> t.as_text('!') '<b>! and !</b>' # Now that we have a template, let's extract some data. >>> t.extract('<b>red and green</b>') ('red', 'green') >>> t.extract('<b>django and stephane</b>') ('django', 'stephane') # The extract() method is very literal. It doesn't magically trim # whitespace, nor does it have any knowledge of markup languages such as # HTML. >>> t.extract('<b> spacy and <u>underlined</u></b>') (' spacy ', '<u>underlined</u>') # The extract() method will raise the NoMatch exception if the data # doesn't match the template. In this example, the data doesn't have the # leading and trailing "<b>" tags. >>> t.extract('this and that') Traceback (most recent call last): ... NoMatch # Use the extract_dict() method to get a dictionary instead of a tuple. # extract_dict() requires that you specify a tuple of field names. >>> t.extract_dict('<b>red and green</b>', ('value1', 'value2')) {'value1': 'red', 'value2': 'green'} # You can specify None as one or more of the field-name values in # extract_dict(). Any field whose value is None will *not* be included # in the resulting dictionary. >>> t.extract_dict('<b>red and green</b>', ('value1', None)) {'value1': 'red'} >>> t.extract_dict('<b>red and green</b>', (None, 'value2')) {'value2': 'green'} The as_text() method ==================== The as_text() method displays the template as a string, with holes represented by a string of your choosing. >>> t = Template() >>> t.learn('123 and 456') >>> t.learn('456 and 789') True Get the template with "!" as the hole-representing string. >>> t.as_text('!') '! and !' Get the template with "{{ var }}" as the hole-representing string. >>> t.as_text('{{ var }}') '{{ var }} and {{ var }}' Note that as_text() does *not* perform any escaping if your template contains the hole-representing string: >>> t = Template() >>> t.learn('Yes!') >>> t.learn('No!') True Here, we use "!" as the hole-representing string, and the template contains a literal "!" character. The literal character is not escaped, so there is no way to tell apart the literal template character from the hole-representing string. >>> t.as_text('!') '!!' Here, we use an underscore to demonstrate that the template contains a literal "!" character. This wasn't detectable in the previous statement. >>> t.as_text('_') '_!' With this in mind, you should choose a string in as_text() that is highly unlikely to appear in your template. But, in any case, you shouldn't rely on the output of as_text() for use in programs; it's solely intended to be a visual aid for humans to see their templates. (The template-maker code has its own internal way of representing holes, and it's guaranteed to be unambiguous. See "The marker character" below.) Tolerance ========= This template-making algorithm can often be "too literal" for one's liking. For example, given these three Sample Strings... 'my favorite color is blue' 'my favorite color is violet' 'my favorite color is purple' The color is the only text that changes, so one would assume the template would be the following: 'my favorite color is !' Let's see what actually happens: >>> t = Template() >>> t.learn('my favorite color is blue') >>> t.learn('my favorite color is violet') True >>> t.learn('my favorite color is purple') False >>> t.as_text('!') 'my favorite color is !l!e!' Aha, the template-maker was a bit too literal -- it noticed that the strings "blue", "violet" and "purple" all contain *something*, then the letter "l", then *something*, then the letter "e", then *something*. Technically, that's correct, but for most applications, this approach misses the forest for the trees. There are two ways to solve this problem: * Teach a template many, many Sample Strings. The more diverse your input, the less likely this will happen. * Use a feature called *tolerance*. The template-maker algorithm lets you specify a tolerance -- the minimum allowed length of text between holes. This gives you control over avoiding the problem. To specify tolerance, just pass the ``tolerance`` argument to the Template constructor. Here's the above example with a tolerance=1. >>> t = Template(tolerance=1) >>> t.learn('my favorite color is blue') >>> t.learn('my favorite color is violet') True >>> t.learn('my favorite color is purple') False >>> t.as_text('!') 'my favorite color is !' Aha! Now, that's more like it. Setting tolerance is useful for small cases like this, but note that there is a risk of throwing the baby out with the bathwater, depending on how high your tolerance is set. If the tolerance is set too high, your output template might be "watered down" -- less accurate than it possibly could be. For example, say we have a list of HTML strings representing significant events: <p>My birthday is <span style="color: blue;">Dec. 11, 1954</span>.</p> <p>My wife's birthday is <span style="color: red;">Jan. 3, 1957</span>.</p> <p>Our wedding fell on <span style="color: green;">Sep. 24, 1982</span>.</p> Say we'd like to extract five pieces of data from each string: the event, the HTML color, the month, the day and the year. Here's what we might do in Python: >>> t = Template(tolerance=5) >>> t.learn('<p>My birthday is <span style="color: blue;">Dec. 11, 1954</span>.</p>') >>> t.learn('<p>My wife\'s birthday is <span style="color: red;">Jan. 3, 1957</span>.</p>') True >>> t.learn('<p>Our wedding fell on <span style="color: green;">Sep. 24, 1982</span>.</p>') True >>> t.as_text('!') '! <span style="color: !</span>.</p>' This resulting template is correct, but it's watered down. This template is, indeed, watered down. You can tell by extracting some data: >>> t.extract('<p>My sister\'s birthday is <span style="color: pink;">Jul. 12, 1952</span>.</p>') ("<p>My sister's birthday is", 'pink;">Jul. 12, 1952') This data is messy. Namely, the color of the <span> is in the same data field as the event date. Assuming we're interested in getting as granular as possible in our parsing, it would be better to set a lower tolerance. >>> t = Template(tolerance=1) >>> t.learn('<p>My birthday is <span style="color: blue;">Dec. 11, 1954</span>.</p>') >>> t.learn('<p>My wife\'s birthday is <span style="color: red;">Jan. 3, 1957</span>.</p>') True >>> t.learn('<p>Our wedding fell on <span style="color: green;">Sep. 24, 1982</span>.</p>') False >>> t.as_text('!') '<p>! <span style="color: !;">!. !, 19!</span>.</p>' >>> t.extract('<p>My sister\'s birthday is <span style="color: pink;">Jul. 12, 1952</span>.</p>') ("My sister's birthday is", 'pink', 'Jul', '12', '52') Much better. Using tolerance is all about tradeoffs. To use this feature most effectively, you'll need to experiment and consider the nature of the data you're parsing. Versions ======== A Template instance keeps tracks of how many Sample Strings it has learned. You can access this via the ``version`` attribute. >>> t = Template() >>> t.version 0 >>> t.learn('My name is Paul.') >>> t.version 1 >>> t.learn('My name is Jonas.') True >>> t.version 2 The marker character ==================== The template-maker algorithm, implemented in C, works by comparing two strings one byte at a time. Each time you call learn(some_string) on a template object, the underlying C algorithm compares some_string to the *current template*. A template internally represents each hole with a *marker character* -- a character that is guaranteed not to appear in either string. This is set to the character "\x1f". In order to guarantee that the marker character doesn't appear in a string, the Template's learn() method removes any occurrence of the marker character from the input string before running the comparison algorithm. The advantage of this "dumb" approach is its simplicity: the C implementation doesn't have to treat markers as a special case. As a result, the C code is cleaner and faster. Also, it was easier to write. :) However, there are two disadvantages: * First, a template effectively cannot contain the literal marker character. In practice, this is *highly* unlikely to be a problem, because the marker character is obscure. * Two, in the slight chance that a template contains a multi-byte character (e.g., Unicode) that contains the marker character as one of its bytes, the template-maker algorithm will split that Unicode character at that byte. This happens because the underlying C implementation compares single bytes -- it is *not* Unicode-aware. In practice, this is *highly* unlikely to be a problem. In order to get bitten by this, you'd need an aforementioned multi-byte character to appear in your template (not just in a Sample String, but in your template -- i.e., in *each* Sample String). If it turns out there are significant multi-byte characters that contain the marker character "\x1f", you can change the MARKER value at the top of templatemaker.c and recompile it. Or, suggest a better character to the authors. Change log ========== 2007-09-20 0.1.1 Created HTMLTemplate 2007-07-06 0.1 Initial release
About
templatemaker is a Python library that can extract data from files with a similar format, like HTML pages.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published