Flexible inline (Python-Markdown#629)

Add new InlineProcessor class that handles inline processing much better and allows for more flexibility. This adds new InlineProcessors that no longer utilize unnecessary pretext and posttext captures. New class can accept the buffer that is being worked on and manually process the text without regex and return new replacement bounds. This helps us to handle links in a better way and handle nested brackets and logic that is too much for regular expression. The refactor also allows image links to have links/paths with spaces like links. Ref Python-Markdown#551, Python-Markdown#613, Python-Markdown#590, Python-Markdown#161.
cbeauchesne · Jan 18, 2018 · d18c3d0 · d18c3d0
1 parent de9cc42
commit d18c3d0
Show file tree

Hide file tree

Showing 16 changed files with 785 additions and 184 deletions.
diff --git a/docs/extensions/api.md b/docs/extensions/api.md
@@ -48,6 +48,8 @@ class MyPreprocessor(Preprocessor):
 
 ## Inline Patterns {: #inlinepatterns }
 
+### Legacy
+
 Inline Patterns implement the inline HTML element syntax for Markdown such as
 `*emphasis*` or `[links](http://example.com)`. Pattern objects should be
 instances of classes that inherit from `markdown.inlinepatterns.Pattern` or
@@ -85,7 +87,7 @@ from markdown.util import etree
 class EmphasisPattern(Pattern):
     def handleMatch(self, m):
         el = etree.Element('em')
-        el.text = m.group(3)
+        el.text = m.group(2)
         return el
 ```
 
@@ -110,8 +112,113 @@ implemented with separate instances of the `SimpleTagPattern` listed below.
 Feel free to use or extend any of the Pattern classes found at
 `markdown.inlinepatterns`.
 
+### Future
+
+While users can still create plugins with the existing
+`markdown.inlinepatterns.Pattern`, a new, more flexible inline processor has
+been added which users are encouraged to migrate to. The new inline processor
+is found at `markdown.inlinepatterns.InlineProcessor`.
+
+The new processor is very similar to legacy with two major distinctions.
+
+1. Patterns no longer need to match the entire block, so patterns no longer
+    start with `r'^(.*?)'` and end with `r'(.*?)!'`. This was a huge
+    performance sink and this requirement has been removed. The returned match
+    object will only contain what is explicitly matched in the pattern, and
+    extension pattern groups now start with `m.group(1)`.
+
+2. The `handleMatch` method now takes an additional input called `data`,
+    which is the entire block under analysis, not just what is matched with
+    the specified pattern. The method also returns the element *and* the index
+    boundaries relative to `data` that the return element is replacing
+    (usually `m.start(0)` and `m.end(0)`).  If the boundaries are returned as
+    `None`, it is assumed that the match did not take place, and nothing will
+    be altered in `data`.
+
+If all you need is the same functionality as the legacy processor, you can do
+as shown below. Most of the time, simple regular expression processing is all
+you'll need.
+
+```python
+from markdown.inlinepatterns import InlineProcessor
+from markdown.util import etree
+
+# an oversimplified regex
+MYPATTERN = r'\*([^*]+)\*'
+
+class EmphasisPattern(InlineProcessor):
+    def handleMatch(self, m, data):
+        el = etree.Element('em')
+        el.text = m.group(1)
+        return el, m.start(0), m.end(0)
+
+# pass in pattern and create instance
+emphasis = EmphasisPattern(MYPATTERN)
+```
+
+But, the new processor allows you handle much more complex patterns that are
+too much for Python's Re to handle.  For instance, to handle nested brackets in
+link patterns, the built-in link inline processor uses the following pattern to
+find where a link *might* start:
+
+```python
+LINK_RE = NOIMG + r'\['
+link = LinkInlineProcessor(LINK_RE, md_instance)
+```
+
+It then uses programmed logic to actually walk the string (`data`), starting at
+where the match started (`m.start(0)`). If for whatever reason, the text
+does not appear to be a link, it returns `None` for the start and end boundary
+in order to communicate to the parser that no match was found.
+
+```python
+    # Just a snippet of of the link's handleMatch
+    # method to illustrate new logic
+    def handleMatch(self, m, data):
+        text, index, handled = self.getText(data, m.end(0))
+
+        if not handled:
+            return None, None, None
+
+        href, title, index, handled = self.getLink(data, index)
+        if not handled:
+            return None, None, None
+
+        el = util.etree.Element("a")
+        el.text = text
+
+        el.set("href", href)
+
+        if title is not None:
+            el.set("title", title)
+
+        return el, m.start(0), index
+```
+
 ### Generic Pattern Classes
 
+Some example processors that are available.
+
+* **`SimpleTextInlineProcessor(pattern)`**:
+
+    Returns simple text of `group(2)` of a `pattern` and the start and end
+    position of the match.
+
+* **`SimpleTagInlineProcessor(pattern, tag)`**:
+
+    Returns an element of type "`tag`" with a text attribute of `group(3)`
+    of a `pattern`. `tag` should be a string of a HTML element (i.e.: 'em').
+    It also returns the start and end position of the match.
+
+* **`SubstituteTagInlineProcessor(pattern, tag)`**:
+
+    Returns an element of type "`tag`" with no children or text (i.e.: `br`)
+    and the start and end position of the match.
+
+A very small number of the basic legacy processors are still available to
+prevent breakage of 3rd party extensions during the transition period to the
+new processors. Three of the available processors are listed below.
+
 * **`SimpleTextPattern(pattern)`**:
 
     Returns simple text of `group(2)` of a `pattern`.

diff --git a/markdown/extensions/abbr.py b/markdown/extensions/abbr.py
@@ -20,7 +20,7 @@
 from __future__ import unicode_literals
 from . import Extension
 from ..preprocessors import Preprocessor
-from ..inlinepatterns import Pattern
+from ..inlinepatterns import InlineProcessor
 from ..util import etree, AtomicString
 import re
 
@@ -52,7 +52,7 @@ def run(self, lines):
                 abbr = m.group('abbr').strip()
                 title = m.group('title').strip()
                 self.markdown.inlinePatterns['abbr-%s' % abbr] = \
-                    AbbrPattern(self._generate_pattern(abbr), title)
+                    AbbrInlineProcessor(self._generate_pattern(abbr), title)
                 # Preserve the line to prevent raw HTML indexing issue.
                 # https://github.com/Python-Markdown/markdown/issues/584
                 new_text.append('')
@@ -76,18 +76,18 @@ def _generate_pattern(self, text):
         return r'(?P<abbr>\b%s\b)' % (r''.join(chars))
 
 
-class AbbrPattern(Pattern):
+class AbbrInlineProcessor(InlineProcessor):
     """ Abbreviation inline pattern. """
 
     def __init__(self, pattern, title):
-        super(AbbrPattern, self).__init__(pattern)
+        super(AbbrInlineProcessor, self).__init__(pattern)
         self.title = title
 
-    def handleMatch(self, m):
+    def handleMatch(self, m, data):
         abbr = etree.Element('abbr')
         abbr.text = AtomicString(m.group('abbr'))
         abbr.set('title', self.title)
-        return abbr
+        return abbr, m.start(0), m.end(0)
 
 
 def makeExtension(**kwargs):  # pragma: no cover

diff --git a/markdown/extensions/footnotes.py b/markdown/extensions/footnotes.py
@@ -17,7 +17,7 @@
 from __future__ import unicode_literals
 from . import Extension
 from ..preprocessors import Preprocessor
-from ..inlinepatterns import Pattern
+from ..inlinepatterns import InlineProcessor
 from ..treeprocessors import Treeprocessor
 from ..postprocessors import Postprocessor
 from .. import util
@@ -77,7 +77,7 @@ def extendMarkdown(self, md, md_globals):
         # Insert an inline pattern before ImageReferencePattern
         FOOTNOTE_RE = r'\[\^([^\]]*)\]'  # blah blah [^1] blah
         md.inlinePatterns.add(
-            "footnote", FootnotePattern(FOOTNOTE_RE, self), "<reference"
+            "footnote", FootnoteInlineProcessor(FOOTNOTE_RE, self), "<reference"
         )
         # Insert a tree-processor that would actually add the footnote div
         # This must be before all other treeprocessors (i.e., inline and
@@ -315,15 +315,15 @@ def detab(line):
         return items, i
 
 
-class FootnotePattern(Pattern):
+class FootnoteInlineProcessor(InlineProcessor):
     """ InlinePattern for footnote markers in a document's body text. """
 
     def __init__(self, pattern, footnotes):
-        super(FootnotePattern, self).__init__(pattern)
+        super(FootnoteInlineProcessor, self).__init__(pattern)
         self.footnotes = footnotes
 
-    def handleMatch(self, m):
-        id = m.group(2)
+    def handleMatch(self, m, data):
+        id = m.group(1)
         if id in self.footnotes.footnotes.keys():
             sup = util.etree.Element("sup")
             a = util.etree.SubElement(sup, "a")
@@ -333,9 +333,9 @@ def handleMatch(self, m):
                 a.set('rel', 'footnote')  # invalid in HTML5
             a.set('class', 'footnote-ref')
             a.text = util.text_type(self.footnotes.footnotes.index(id) + 1)
-            return sup
+            return sup, m.start(0), m.end(0)
         else:
-            return None
+            return None, None, None
 
 
 class FootnotePostTreeprocessor(Treeprocessor):

diff --git a/markdown/extensions/nl2br.py b/markdown/extensions/nl2br.py
@@ -19,15 +19,15 @@
 from __future__ import absolute_import
 from __future__ import unicode_literals
 from . import Extension
-from ..inlinepatterns import SubstituteTagPattern
+from ..inlinepatterns import SubstituteTagInlineProcessor
 
 BR_RE = r'\n'
 
 
 class Nl2BrExtension(Extension):
 
     def extendMarkdown(self, md, md_globals):
-        br_tag = SubstituteTagPattern(BR_RE, 'br')
+        br_tag = SubstituteTagInlineProcessor(BR_RE, 'br')
         md.inlinePatterns.add('nl', br_tag, '_end')
 
 

diff --git a/markdown/extensions/smart_strong.py b/markdown/extensions/smart_strong.py
@@ -18,21 +18,21 @@
 from __future__ import absolute_import
 from __future__ import unicode_literals
 from . import Extension
-from ..inlinepatterns import SimpleTagPattern
+from ..inlinepatterns import SimpleTagInlineProcessor
 
-SMART_STRONG_RE = r'(?<!\w)(_{2})(?!_)(.+?)(?<!_)\2(?!\w)'
-STRONG_RE = r'(\*{2})(.+?)\2'
+SMART_STRONG_RE = r'(?<!\w)(_{2})(?!_)(.+?)(?<!_)\1(?!\w)'
+STRONG_RE = r'(\*{2})(.+?)\1'
 
 
 class SmartEmphasisExtension(Extension):
     """ Add smart_emphasis extension to Markdown class."""
 
     def extendMarkdown(self, md, md_globals):
         """ Modify inline patterns. """
-        md.inlinePatterns['strong'] = SimpleTagPattern(STRONG_RE, 'strong')
+        md.inlinePatterns['strong'] = SimpleTagInlineProcessor(STRONG_RE, 'strong')
         md.inlinePatterns.add(
             'strong2',
-            SimpleTagPattern(SMART_STRONG_RE, 'strong'),
+            SimpleTagInlineProcessor(SMART_STRONG_RE, 'strong'),
             '>emphasis2'
         )
 

diff --git a/markdown/extensions/smarty.py b/markdown/extensions/smarty.py
@@ -83,7 +83,7 @@
 
 from __future__ import unicode_literals
 from . import Extension
-from ..inlinepatterns import HtmlPattern, HTML_RE
+from ..inlinepatterns import HtmlInlineProcessor, HTML_RE
 from ..odict import OrderedDict
 from ..treeprocessors import InlineProcessor
 
@@ -150,21 +150,21 @@
 HTML_STRICT_RE = HTML_RE + r'(?!\>)'
 
 
-class SubstituteTextPattern(HtmlPattern):
+class SubstituteTextPattern(HtmlInlineProcessor):
     def __init__(self, pattern, replace, markdown_instance):
         """ Replaces matches with some text. """
-        HtmlPattern.__init__(self, pattern)
+        HtmlInlineProcessor.__init__(self, pattern)
         self.replace = replace
         self.markdown = markdown_instance
 
-    def handleMatch(self, m):
+    def handleMatch(self, m, data):
         result = ''
         for part in self.replace:
             if isinstance(part, int):
                 result += m.group(part)
             else:
                 result += self.markdown.htmlStash.store(part)
-        return result
+        return result, m.start(0), m.end(0)
 
 
 class SmartyExtension(Extension):
@@ -233,11 +233,11 @@ def educateQuotes(self, md):
             (doubleQuoteSetsRe, (ldquo + lsquo,)),
             (singleQuoteSetsRe, (lsquo + ldquo,)),
             (decadeAbbrRe, (rsquo,)),
-            (openingSingleQuotesRegex, (2, lsquo)),
+            (openingSingleQuotesRegex, (1, lsquo)),
             (closingSingleQuotesRegex, (rsquo,)),
-            (closingSingleQuotesRegex2, (rsquo, 2)),
+            (closingSingleQuotesRegex2, (rsquo, 1)),
             (remainingSingleQuotesRegex, (lsquo,)),
-            (openingDoubleQuotesRegex, (2, ldquo)),
+            (openingDoubleQuotesRegex, (1, ldquo)),
             (closingDoubleQuotesRegex, (rdquo,)),
             (closingDoubleQuotesRegex2, (rdquo,)),
             (remainingDoubleQuotesRegex, (ldquo,))
@@ -255,7 +255,7 @@ def extendMarkdown(self, md, md_globals):
             self.educateAngledQuotes(md)
             # Override HTML_RE from inlinepatterns.py so that it does not
             # process tags with duplicate closing quotes.
-            md.inlinePatterns["html"] = HtmlPattern(HTML_STRICT_RE, md)
+            md.inlinePatterns["html"] = HtmlInlineProcessor(HTML_STRICT_RE, md)
         if configs['smart_dashes']:
             self.educateDashes(md)
         inlineProcessor = InlineProcessor(md)

diff --git a/markdown/extensions/wikilinks.py b/markdown/extensions/wikilinks.py
@@ -18,7 +18,7 @@
 from __future__ import absolute_import
 from __future__ import unicode_literals
 from . import Extension
-from ..inlinepatterns import Pattern
+from ..inlinepatterns import InlineProcessor
 from ..util import etree
 import re
 
@@ -46,20 +46,20 @@ def extendMarkdown(self, md, md_globals):
 
         # append to end of inline patterns
         WIKILINK_RE = r'\[\[([\w0-9_ -]+)\]\]'
-        wikilinkPattern = WikiLinks(WIKILINK_RE, self.getConfigs())
+        wikilinkPattern = WikiLinksInlineProcessor(WIKILINK_RE, self.getConfigs())
         wikilinkPattern.md = md
         md.inlinePatterns.add('wikilink', wikilinkPattern, "<not_strong")
 
 
-class WikiLinks(Pattern):
+class WikiLinksInlineProcessor(InlineProcessor):
     def __init__(self, pattern, config):
-        super(WikiLinks, self).__init__(pattern)
+        super(WikiLinksInlineProcessor, self).__init__(pattern)
         self.config = config
 
-    def handleMatch(self, m):
-        if m.group(2).strip():
+    def handleMatch(self, m, data):
+        if m.group(1).strip():
             base_url, end_url, html_class = self._getMeta()
-            label = m.group(2).strip()
+            label = m.group(1).strip()
             url = self.config['build_url'](label, base_url, end_url)
             a = etree.Element('a')
             a.text = label
@@ -68,7 +68,7 @@ def handleMatch(self, m):
                 a.set('class', html_class)
         else:
             a = ''
-        return a
+        return a, m.start(0), m.end(0)
 
     def _getMeta(self):
         """ Return meta data or config data. """