Skip to content

Latest commit

 

History

History
 
 

blib2to3

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Building Concrete Parse Trees using the Python grammar

This grammar is mostly reusing existing code:

  • lib2to3 is a part of the 2to3 utility (included in the CPython distribution) aimed at automatically converting Python 2 code to equivalent Python 3 code. Because it needs to be idempotent when applied to Python 3 code, this grammar must be capable of parsing both Python 2 and 3 (with certain restrictions).
  • blib2to3 is part of the black formatter for Python. It adds a few extensions on top of lib2to3.
  • Finally, we extend this grammar even further, in order to support things like f-strings even when the extractor is run using Python 2. (In this respect, blib2to3 "cheats" by requiring Python 3 if you want to parse Python 3 code. We do not have this luxury.)

The grammar of Python is described in Grammar.txt in the style of an EBNF:

  • Rules have the form nonterminal_name: production (where traditionally, one would use ::= instead of :)
  • Productions can contain
    • Literal strings, enclosed in single quotes.
    • Alternation, indicated by an infix |.
    • Repetition, indicated by a postfixed * for "zero or more" and + for "one or more".
    • Optional parts, indicated by these being surrounded by square brackets.
    • Parentheseses to indicate grouping, and to allow productions to span several lines.

Note: You may wonder: How is Grammar.txt parsed? The answer to this is that it is used to parse itself. In particular, it uses the same tokenizer as that for Python, and hence every symbol appearing in the grammar must be a valid Python token. This is why rules use : instead of ::=. This also explains why parentheses must be used when a production spans multiple lines, as the presence of parentheses affects the tokenization.

The concrete parse tree built based on these rules has a simple form: Each node has a name attribute, equal to that of the corresponding nonterminal, and a children attribute, which contains a list of all of the children of the node. These come directly from the production on the right hand side of the rule for the given nonterminal. Thus, something like

testlist: test (',' test)* [',']

will result in a node with name testlist, and its attribute children will be a list where the first element is a test node, the second (if any) is a node for ',', etc. Note in particular that every part of the production is included in the children, even parts that are just static tokens.

The leaves of the concrete parse tree (corresponding to the terminals of the grammar) will have an associated value attribute. This contains the underlying string for this token (in particular, for a NAME token, its value will be the underlying identifier).

From Concrete to Abstract

To turn the concrete parse tree into an asbstract parse tree, we walk the tree using the visitor pattern. Thus, for every nonterminal (e.g. testlist) we have a method (in this case visit_testlist) that takes care of visiting nodes of this type in the concrete parse tree. In doing so, we build up the abstract parse tree, eliding any nodes that are not relevant in terms of the abstract syntax.

TO DO:

  • Why we parse everything four times (async et al.)