diff --git a/CHANGES b/CHANGES index b82acb0..784784c 100644 --- a/CHANGES +++ b/CHANGES @@ -1,5 +1,45 @@ -Version 2.6 +Version 3.0 ----------------------------- +01/13/09: beazley + Minor change to the procedure for signalling a syntax error in a + production rule. A normal SyntaxError exception should be raised + instead of yacc.SyntaxError. + +01/13/09: beazley + Added a new method p.set_lineno(n,lineno) that can be used to set the + line number of symbol n in grammar rules. This simplifies manual + tracking of line numbers. + +01/11/09: beazley + Vastly improved debugging support for yacc.parse(). Instead of passing + debug as an integer, you can supply a Logging object (see the logging + module). Messages will be generated at the ERROR, INFO, and DEBUG + logging levels, each level providing progressively more information. + The debugging trace also shows states, grammar rule, values passed + into grammar rules, and the result of each reduction. + +01/09/09: beazley + The yacc() command now does all error-reporting and diagnostics using + the interface of the logging module. Use the errorlog parameter to + specify a logging object for error messages. Use the debuglog parameter + to specify a logging object for the 'parser.out' output. + +01/09/09: beazley + *HUGE* refactoring of the the ply.yacc() implementation. The high-level + user interface is backwards compatible, but the internals are completely + reorganized into classes. No more global variables. The internals + are also more extensible. For example, you can use the classes to + construct a LALR(1) parser in an entirely different manner than + what is currently the case. Documentation is forthcoming. + +01/07/09: beazley + Various cleanup and refactoring of yacc internals. + +01/06/09: beazley + Fixed a bug with precedence assignment. yacc was assigning the precedence + each rule based on the left-most token, when in fact, it should have been + using the right-most token. Reported by Bruce Frederiksen. + 11/27/08: beazley Numerous changes to support Python 3.0 including removal of deprecated statements (e.g., has_key) and the additional of compatibility code diff --git a/doc/internal.html b/doc/internal.html new file mode 100644 index 0000000..9192bcb --- /dev/null +++ b/doc/internal.html @@ -0,0 +1,851 @@ + + +PLY Internals + + + +

PLY Internals

+ + +David M. Beazley
+dave@dabeaz.com
+
+ +

+PLY Version: 3.0 +

+ + + + +

1. Introduction

+ +This document describes classes and functions that make up the internal +operation of PLY. Using this programming interface, it is possible to +manually build an parser using a different interface specification +than what PLY normally uses. For example, you could build a gramar +from information parsed in a completely different input format. Some of +these objects may be useful for building more advanced parsing engines +such as GLR. + +

+It should be stressed that using PLY at this level is not for the +faint of heart. Generally, it's assumed that you know a bit of +the underlying compiler theory and how an LR parser is put together. + +

2. Grammar Class

+ +The file ply.yacc defines a class Grammar that +is used to hold and manipulate information about a grammar +specification. It encapsulates the same basic information +about a grammar that is put into a YACC file including +the list of tokens, precedence rules, and grammar rules. +Various operations are provided to perform different validations +on the grammar. In addition, there are operations to compute +the first and follow sets that are needed by the various table +generation algorithms. + +

+Grammar(terminals) + +

+Creates a new grammar object. terminals is a list of strings +specifying the terminals for the grammar. An instance g of +Grammar has the following methods: +
+ +

+g.set_precedence(term,assoc,level) +

+Sets the precedence level and associativity for a given terminal term. +assoc is one of 'right', +'left', or 'nonassoc' and level is a positive integer. The higher +the value of level, the higher the precedence. Here is an example of typical +precedence settings: + +
+g.set_precedence('PLUS',  'left',1)
+g.set_precedence('MINUS', 'left',1)
+g.set_precedence('TIMES', 'left',2)
+g.set_precedence('DIVIDE','left',2)
+g.set_precedence('UMINUS','left',3)
+
+ +This method must be called prior to adding any productions to the +grammar with g.add_production(). The precedence of individual grammar +rules is determined by the precedence of the right-most terminal. + +
+

+g.add_production(name,syms,func=None,file='',line=0) +

+Adds a new grammar rule. name is the name of the rule, +syms is a list of symbols making up the right hand +side of the rule, func is the function to call when +reducing the rule. file and line specify +the filename and line number of the rule and are used for +generating error messages. + +

+The list of symbols in syms may include character +literals and %prec specifiers. Here are some +examples: + +

+g.add_production('expr',['expr','PLUS','term'],func,file,line)
+g.add_production('expr',['expr','"+"','term'],func,file,line)
+g.add_production('expr',['MINUS','expr','%prec','UMINUS'],func,file,line)
+
+ +

+If any kind of error is detected, a GrammarError exception +is raised with a message indicating the reason for the failure. +

+ +

+g.set_start(start=None) +

+Sets the starting rule for the grammar. start is a string +specifying the name of the start rule. If start is omitted, +the first grammar rule added with add_production() is taken to be +the starting rule. This method must always be called after all +productions have been added. +
+ +

+g.find_unreachable() +

+Diagnostic function. Returns a list of all unreachable non-terminals +defined in the grammar. This is used to identify inactive parts of +the grammar specification. +
+ +

+g.infinite_cycle() +

+Diagnostic function. Returns a list of all non-terminals in the +grammar that result in an infinite cycle. This condition occurs if +there is no way for a grammar rule to expand to a string containing +only terminal symbols. +
+ +

+g.undefined_symbols() +

+Diagnostic function. Returns a list of tuples (name, prod) +corresponding to undefined symbols in the grammar. name is the +name of the undefined symbol and prod is an instance of +Production which has information about the production rule +where the undefined symbol was used. +
+ +

+g.unused_terminals() +

+Diagnostic function. Returns a list of terminals that were defined, +but never used in the grammar. +
+ +

+g.unused_rules() +

+Diagnostic function. Returns a list of Production instances +corresponding to production rules that were defined in the grammar, +but never used anywhere. This is slightly different +than find_unreachable(). +
+ +

+g.unused_precedence() +

+Diagnostic function. Returns a list of tuples (term, assoc) +corresponding to precedence rules that were set, but never used the +grammar. term is the terminal name and assoc is the +precedence associativity (e.g., 'left', 'right', +or 'nonassoc'. +
+ +

+g.compute_first() +

+Compute all of the first sets for all symbols in the grammar. Returns a dictionary +mapping symbol names to a list of all first symbols. +
+ +

+g.compute_follow() +

+Compute all of the follow sets for all non-terminals in the grammar. +The follow set is the set of all possible symbols that might follow a +given non-terminal. Returns a dictionary mapping non-terminal names +to a list of symbols. +
+ +

+g.build_lritems() +

+Calculates all of the LR items for all productions in the grammar. This +step is required before using the grammar for any kind of table generation. +See the section on LR items below. +
+ +

+The following attributes are set by the above methods and may be useful +in code that works with the grammar. All of these attributes should be +assumed to be read-only. Changing their values directly will likely +break the grammar. + +

+g.Productions +

+A list of all productions added. The first entry is reserved for +a production representing the starting rule. The objects in this list +are instances of the Production class, described shortly. +
+ +

+g.Prodnames +

+A dictionary mapping the names of nonterminals to a list of all +productions of that nonterminal. +
+ +

+g.Terminals +

+A dictionary mapping the names of terminals to a list of the +production numbers where they are used. +
+ +

+g.Nonterminals +

+A dictionary mapping the names of nonterminals to a list of the +production numbers where they are used. +
+ +

+g.First +

+A dictionary representing the first sets for all grammar symbols. This is +computed and returned by the compute_first() method. +
+ +

+g.Follow +

+A dictionary representing the follow sets for all grammar rules. This is +computed and returned by the compute_follow() method. +
+ +

+g.Start +

+Starting symbol for the grammar. Set by the set_start() method. +
+ +For the purposes of debugging, a Grammar object supports the __len__() and +__getitem__() special methods. Accessing g[n] returns the nth production +from the grammar. + + +

3. Productions

+ +Grammar objects store grammar rules as instances of a Production class. This +class has no public constructor--you should only create productions by calling Grammar.add_production(). +The following attributes are available on a Production instance p. + +

+p.name +

+The name of the production. For a grammar rule such as A : B C D, this is 'A'. +
+ +

+p.prod +

+A tuple of symbols making up the right-hand side of the production. For a grammar rule such as A : B C D, this is ('B','C','D'). +
+ +

+p.number +

+Production number. An integer containing the index of the production in the grammar's Productions list. +
+ +

+p.func +

+The name of the reduction function associated with the production. +This is the function that will execute when reducing the entire +grammar rule during parsing. +
+ +

+p.callable +

+The callable object associated with the name in p.func. This is None +unless the production has been bound using bind(). +
+ +

+p.file +

+Filename associated with the production. Typically this is the file where the production was defined. Used for error messages. +
+ +

+p.lineno +

+Line number associated with the production. Typically this is the line number in p.file where the production was defined. Used for error messages. +
+ +

+p.prec +

+Precedence and associativity associated with the production. This is a tuple (assoc,level) where +assoc is one of 'left','right', or 'nonassoc' and level is +an integer. This value is determined by the precedence of the right-most terminal symbol in the production +or by use of the %prec specifier when adding the production. +
+ +

+p.usyms +

+A list of all unique symbols found in the production. +
+ +

+p.lr_items +

+A list of all LR items for this production. This attribute only has a meaningful value if the +Grammar.build_lritems() method has been called. The items in this list are +instances of LRItem described below. +
+ +

+p.lr_next +

+The head of a linked-list representation of the LR items in p.lr_items. +This attribute only has a meaningful value if the Grammar.build_lritems() +method has been called. Each LRItem instance has a lr_next attribute +to move to the next item. The list is terminated by None. +
+ +

+p.bind(dict) +

+Binds the production function name in p.func to a callable object in +dict. This operation is typically carried out in the last step +prior to running the parsing engine and is needed since parsing tables are typically +read from files which only include the function names, not the functions themselves. +
+ +

+Production objects support +the __len__(), __getitem__(), and __str__() +special methods. +len(p) returns the number of symbols in p.prod +and p[n] is the same as p.prod[n]. + +

4. LRItems

+ +The construction of parsing tables in an LR-based parser generator is primarily +done over a set of "LR Items". An LR item represents a stage of parsing one +of the grammar rules. To compute the LR items, it is first necessary to +call Grammar.build_lritems(). Once this step, all of the productions +in the grammar will have their LR items attached to them. + +

+Here is an interactive example that shows what LR items look like if you +interactively experiment. In this example, g is a Grammar +object. + +

+
+>>> g.build_lritems()
+>>> p = g[1]
+>>> p
+Production(statement -> ID = expr)
+>>>
+
+
+ +In the above code, p represents the first grammar rule. In +this case, a rule 'statement -> ID = expr'. + +

+Now, let's look at the LR items for p. + +

+
+>>> p.lr_items
+[LRItem(statement -> . ID = expr), 
+ LRItem(statement -> ID . = expr), 
+ LRItem(statement -> ID = . expr), 
+ LRItem(statement -> ID = expr .)]
+>>>
+
+
+ +In each LR item, the dot (.) represents a specific stage of parsing. In each LR item, the dot +is advanced by one symbol. It is only when the dot reaches the very end that a production +is successfully parsed. + +

+An instance lr of LRItem has the following +attributes that hold information related to that specific stage of +parsing. + +

+lr.name +

+The name of the grammar rule. For example, 'statement' in the above example. +
+ +

+lr.prod +

+A tuple of symbols representing the right-hand side of the production, including the +special '.' character. For example, ('ID','.','=','expr'). +
+ +

+lr.number +

+An integer representing the production number in the grammar. +
+ +

+lr.usyms +

+A set of unique symbols in the production. Inherited from the original Production instance. +
+ +

+lr.lr_index +

+An integer representing the position of the dot (.). You should never use lr.prod.index() +to search for it--the result will be wrong if the grammar happens to also use (.) as a character +literal. +
+ +

+lr.lr_after +

+A list of all productions that can legally appear immediately to the right of the +dot (.). This list contains Production instances. This attribute +represents all of the possible branches a parse can take from the current position. +For example, suppose that lr represents a stage immediately before +an expression like this: + +
+>>> lr
+LRItem(statement -> ID = . expr)
+>>>
+
+ +Then, the value of lr.lr_after might look like this, showing all productions that +can legally appear next: + +
+>>> lr.lr_after
+[Production(expr -> expr PLUS expr), 
+ Production(expr -> expr MINUS expr), 
+ Production(expr -> expr TIMES expr), 
+ Production(expr -> expr DIVIDE expr), 
+ Production(expr -> MINUS expr), 
+ Production(expr -> LPAREN expr RPAREN), 
+ Production(expr -> NUMBER), 
+ Production(expr -> ID)]
+>>>
+
+ +
+ +

+lr.lr_before +

+The grammar symbol that appears immediately before the dot (.) or None if +at the beginning of the parse. +
+ +

+lr.lr_next +

+A link to the next LR item, representing the next stage of the parse. None if lr +is the last LR item. +
+ +LRItem instances also support the __len__() and __getitem__() special methods. +len(lr) returns the number of items in lr.prod including the dot (.). lr[n] +returns lr.prod[n]. + +

+It goes without saying that all of the attributes associated with LR +items should be assumed to be read-only. Modifications will very +likely create a small black-hole that will consume you and your code. + +

5. LRTable

+ +The LRTable class is used to represent LR parsing table data. This +minimally includes the production list, action table, and goto table. + +

+LRTable() +

+Create an empty LRTable object. This object contains only the information needed to +run an LR parser. +
+ +An instance lrtab of LRTable has the following methods: + +

+lrtab.read_table(module) +

+Populates the LR table with information from the module specified in module. +module is either a module object already loaded with import or +the name of a Python module. If it's a string containing a module name, it is +loaded and parsing data is extracted. Returns the signature value that was used +when initially writing the tables. Raises a VersionError exception if +the module was created using an incompatible version of PLY. +
+ +

+lrtab.bind_callables(dict) +

+This binds all of the function names used in productions to callable objects +found in the dictionary dict. During table generation and when reading +LR tables from files, PLY only uses the names of action functions such as 'p_expr', +'p_statement', etc. In order to actually run the parser, these names +have to be bound to callable objects. This method is always called prior to +running a parser. +
+ +After lrtab has been populated, the following attributes are defined. + +

+lrtab.lr_method +

+The LR parsing method used (e.g., 'LALR') +
+ + +

+lrtab.lr_productions +

+The production list. If the parsing tables have been newly +constructed, this will be a list of Production instances. If +the parsing tables have been read from a file, it's a list +of MiniProduction instances. This, together +with lr_action and lr_goto contain all of the +information needed by the LR parsing engine. +
+ +

+lrtab.lr_action +

+The LR action dictionary that implements the underlying state machine. +The keys of this dictionary are the LR states. +
+ +

+lrtab.lr_goto +

+The LR goto table that contains information about grammar rule reductions. +
+ + +

6. LRGeneratedTable

+ +The LRGeneratedTable class represents constructed LR parsing tables on a +grammar. It is a subclass of LRTable. + +

+LRGeneratedTable(grammar, method='LALR',log=None) +

+Create the LR parsing tables on a grammar. grammar is an instance of Grammar, +method is a string with the parsing method ('SLR' or 'LALR'), and +log is a logger object used to write debugging information. The debugging information +written to log is the same as what appears in the parser.out file created +by yacc. By supplying a custom logger with a different message format, it is possible to get +more information (e.g., the line number in yacc.py used for issuing each line of +output in the log). The result is an instance of LRGeneratedTable. +
+ +

+An instance lr of LRGeneratedTable has the following attributes. + +

+lr.grammar +

+A link to the Grammar object used to construct the parsing tables. +
+ +

+lr.lr_method +

+The LR parsing method used (e.g., 'LALR') +
+ + +

+lr.lr_productions +

+A reference to grammar.Productions. This, together with lr_action and lr_goto +contain all of the information needed by the LR parsing engine. +
+ +

+lr.lr_action +

+The LR action dictionary that implements the underlying state machine. The keys of this dictionary are +the LR states. +
+ +

+lr.lr_goto +

+The LR goto table that contains information about grammar rule reductions. +
+ +

+lr.sr_conflicts +

+A list of tuples (state,token,resolution) identifying all shift/reduce conflicts. state is the LR state +number where the conflict occurred, token is the token causing the conflict, and resolution is +a string describing the resolution taken. resolution is either 'shift' or 'reduce'. +
+ +

+lr.rr_conflicts +

+A list of tuples (state,rule,rejected) identifying all reduce/reduce conflicts. state is the +LR state number where the conflict occurred, rule is the production rule that was selected +and rejected is the production rule that was rejected. Both rule and rejected are +instances of Production. They can be inspected to provide the user with more information. +
+ +

+There are two public methods of LRGeneratedTable. + +

+lr.write_table(modulename,outputdir="",signature="") +

+Writes the LR parsing table information to a Python module. modulename is a string +specifying the name of a module such as "parsetab". outputdir is the name of a +directory where the module should be created. signature is a string representing a +grammar signature that's written into the output file. This can be used to detect when +the data stored in a module file is out-of-sync with the the grammar specification (and that +the tables need to be regenerated). If modulename is a string "parsetab", +this function creates a file called parsetab.py. If the module name represents a +package such as "foo.bar.parsetab", then only the last component, "parsetab" is +used. +
+ + +

7. LRParser

+ +The LRParser class implements the low-level LR parsing engine. + + +

+LRParser(lrtab, error_func) +

+Create an LRParser. lrtab is an instance of LRTable +containing the LR production and state tables. error_func is the +error function to invoke in the event of a parsing error. +
+ +An instance p of LRParser has the following methods: + +

+p.parse(input=None,lexer=None,debug=0,tracking=0,tokenfunc=None) +

+Run the parser. input is a string, which if supplied is fed into the +lexer using its input() method. lexer is an instance of the +Lexer class to use for tokenizing. If not supplied, the last lexer +created with the lex module is used. debug is a boolean flag +that enables debugging. tracking is a boolean flag that tells the +parser to perform additional line number tracking. tokenfunc is a callable +function that returns the next token. If supplied, the parser will use it to get +all tokens. +
+ +

+p.restart() +

+Resets the parser state for a parse already in progress. +
+ +

8. ParserReflect

+ +

+The ParserReflect class is used to collect parser specification data +from a Python module or object. This class is what collects all of the +p_rule() functions in a PLY file, performs basic error checking, +and collects all of the needed information to build a grammar. Most of the +high-level PLY interface as used by the yacc() function is actually +implemented by this class. + +

+ParserReflect(pdict, log=None) +

+Creates a ParserReflect instance. pdict is a dictionary +containing parser specification data. This dictionary typically corresponds +to the module or class dictionary of code that implements a PLY parser. +log is a logger instance that will be used to report error +messages. +
+ +An instance p of ParserReflect has the following methods: + +

+p.get_all() +

+Collect and store all required parsing information. +
+ +

+p.validate_all() +

+Validate all of the collected parsing information. This is a seprate step +from p.get_all() as a performance optimization. In order to +increase parser start-up time, a parser can elect to only validate the +parsing data when regenerating the parsing tables. The validation +step tries to collect as much information as possible rather than +raising an exception at the first sign of trouble. The attribute +p.error is set if there are any validation errors. The +value of this attribute is also returned. +
+ +

+p.signature() +

+Compute a signature representing the contents of the collected parsing +data. The signature value should change if anything in the parser +specification has changed in a way that would justify parser table +regeneration. This method can be called after p.get_all(), +but before p.validate_all(). +
+ +The following attributes are set in the process of collecting data: + +

+p.start +

+The grammar start symbol, if any. Taken from pdict['start']. +
+ +

+p.error_func +

+The error handling function or None. Taken from pdict['p_error']. +
+ +

+p.tokens +

+The token list. Taken from pdict['tokens']. +
+ +

+p.prec +

+The precedence specifier. Taken from pdict['precedence']. +
+ +

+p.preclist +

+A parsed version of the precedence specified. A list of tuples of the form +(token,assoc,level) where token is the terminal symbol, +assoc is the associativity (e.g., 'left') and level +is a numeric precedence level. +
+ +

+p.grammar +

+A list of tuples (name, rules) representing the grammar rules. name is the +name of a Python function or method in pdict that starts with "p_". +rules is a list of tuples (filename,line,prodname,syms) representing +the grammar rules found in the documentation string of that function. filename and line contain location +information that can be used for debugging. prodname is the name of the +production. syms is the right-hand side of the production. If you have a +function like this + +
+def p_expr(p):
+    '''expr : expr PLUS expr
+            | expr MINUS expr
+            | expr TIMES expr
+            | expr DIVIDE expr'''
+
+ +then the corresponding entry in p.grammar might look like this: + +
+('p_expr', [ ('calc.py',10,'expr', ['expr','PLUS','expr']),
+             ('calc.py',11,'expr', ['expr','MINUS','expr']),
+             ('calc.py',12,'expr', ['expr','TIMES','expr']),
+             ('calc.py',13,'expr', ['expr','DIVIDE','expr'])
+           ])
+
+
+ +

+p.pfuncs +

+A sorted list of tuples (line, file, name, doc) representing all of +the p_ functions found. line and file give location +information. name is the name of the function. doc is the +documentation string. This list is sorted in ascending order by line number. +
+ +

+p.files +

+A dictionary holding all of the source filenames that were encountered +while collecting parser information. Only the keys of this dictionary have +any meaning. +
+ +

+p.error +

+An attribute that indicates whether or not any critical errors +occurred in validation. If this is set, it means that that some kind +of problem was detected and that no further processing should be +performed. +
+ + +

9. High-level operation

+ +Using all of the above classes requires some attention to detail. The yacc() +function carries out a very specific sequence of operations to create a grammar. +This same sequence should be emulated if you build an alternative PLY interface. + +
    +
  1. A ParserReflect object is created and raw grammar specification data is +collected. +
  2. A Grammar object is created and populated with information +from the specification data. +
  3. A LRGenerator object is created to run the LALR algorithm over +the Grammar object. +
  4. Productions in the LRGenerator and bound to callables using the bind_callables() +method. +
  5. A LRParser object is created from from the information in the +LRGenerator object. +
+ + + + + + + + + + diff --git a/doc/ply.html b/doc/ply.html index 13a2631..f9fe036 100644 --- a/doc/ply.html +++ b/doc/ply.html @@ -12,7 +12,7 @@

PLY (Python Lex-Yacc)

-PLY Version: 2.5 +PLY Version: 3.0

@@ -97,7 +97,10 @@

1. Introduction

nested scoping, and code generation for the SPARC processor. Approximately 30 different compiler implementations were completed in this course. Most of PLY's interface and operation has been influenced by common -usability problems encountered by students. +usability problems encountered by students. Since 2001, PLY has +continued to be improved as feedback has been received from users. +PLY-3.0 represents a major refactoring of the original implementation +with an eye towards future enhancements.

Since PLY was primarily developed as an instructional tool, you will @@ -245,11 +248,7 @@

3.1 Lex Example

# A regular expression rule with some action code def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -266,11 +265,14 @@

3.1 Lex Example

t.lexer.skip(1) # Build the lexer -lex.lex() +lexer = lex.lex() -To use the lexer, you first need to feed it some input text using its input() method. After that, repeated calls to token() produce tokens. The following code shows how this works: +To use the lexer, you first need to feed it some input text using +its input() method. After that, repeated calls +to token() produce tokens. The following code shows how this +works:
@@ -282,11 +284,11 @@ 

3.1 Lex Example

''' # Give the lexer some input -lex.input(data) +lexer.input(data) # Tokenize -while 1: - tok = lex.token() +while True: + tok = lexer.token() if not tok: break # No more input print tok
@@ -310,7 +312,16 @@

3.1 Lex Example

-The tokens returned by lex.token() are instances +Lexers also support the iteration protocol. So, you can write the above loop as follows: + +
+
+for tok in lexer:
+    print tok
+
+
+ +The tokens returned by lexer.token() are instances of LexToken. This object has attributes tok.type, tok.value, tok.lineno, and tok.lexpos. The following code shows an example of @@ -319,8 +330,8 @@

3.1 Lex Example

 # Tokenize
-while 1:
-    tok = lex.token()
+while True:
+    tok = lexer.token()
     if not tok: break      # No more input
     print tok.type, tok.value, tok.line, tok.lexpos
 
@@ -429,7 +440,7 @@

3.3 Specification of tokens

... } -tokens = ['LPAREN','RPAREN',...,'ID'] + reserved.values() +tokens = ['LPAREN','RPAREN',...,'ID'] + list(reserved.values()) def t_ID(t): r'[a-zA-Z_][a-zA-Z_0-9]*' @@ -530,11 +541,10 @@

3.6 Line numbers and positional information

# input is the input text string # token is a token instance def find_column(input,token): - i = token.lexpos - while i > 0: - if input[i] == '\n': break - i -= 1 - column = (token.lexpos - i)+1 + last_cr = input.rfind('\n',0,token.lexpos) + if last_cr < 0: + last_cr = 0 + column = (token.lexpos - last_cr) + 1 return column
@@ -607,36 +617,34 @@

3.10 Building and using the lexer

To build the lexer, the function lex.lex() is used. This function uses Python reflection (or introspection) to read the the regular expression rules -out of the calling context and build the lexer. Once the lexer has been built, two functions can +out of the calling context and build the lexer. Once the lexer has been built, two methods can be used to control the lexer.

-If desired, the lexer can also be used as an object. The lex() returns a Lexer object that -can be used for this purpose. For example: +The preferred way to use PLY is to invoke the above methods directly on the lexer object returned by the +lex() function. The legacy interface to PLY involves module-level functions lex.input() and lex.token(). +For example:
-lexer = lex.lex()
-lexer.input(sometext)
+lex.lex()
+lex.input(sometext)
 while 1:
-    tok = lexer.token()
+    tok = lex.token()
     if not tok: break
     print tok
 

-This latter technique should be used if you intend to use multiple lexers in your application. Simply define each -lexer in its own module and use the object returned by lex() as appropriate. - -

-Note: The global functions lex.input() and lex.token() are bound to the input() -and token() methods of the last lexer created by the lex module. +In this example, the module-level functions lex.input() and lex.token() are bound to the input() +and token() methods of the last lexer created by the lex module. This interface may go away at some point so +it's probably best not to use it.

3.11 The @TOKEN decorator

@@ -785,11 +793,7 @@

3.14 Alternative specification of lexers

# A regular expression rule with some action code def t_NUMBER(t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -826,7 +830,7 @@

3.14 Alternative specification of lexers

-The object option can be used to define lexers as a class instead of a module. For example: +The module option can also be used to define lexers from instances of a class. For example:
@@ -856,11 +860,7 @@ 

3.14 Alternative specification of lexers

# Note addition of self parameter since we're in a class def t_NUMBER(self,t): r'\d+' - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t # Define a rule so we can track line numbers @@ -878,12 +878,12 @@

3.14 Alternative specification of lexers

# Build the lexer def build(self,**kwargs): - self.lexer = lex.lex(object=self, **kwargs) + self.lexer = lex.lex(module=self, **kwargs) # Test it output def test(self,data): self.lexer.input(data) - while 1: + while True: tok = lexer.token() if not tok: break print tok @@ -895,18 +895,80 @@

3.14 Alternative specification of lexers

-When building a lexer from class, you should construct the lexer from -an instance of the class, not the class object itself. Also, for -reasons that are subtle, you should NOT -invoke lex.lex() inside the __init__() method of -your class. If you do, it may cause bizarre behavior if someone tries -to duplicate a lexer object. + +When building a lexer from class, you should construct the lexer from +an instance of the class, not the class object itself. This is because +PLY only works properly if the lexer actions are defined by bound-methods. + +

+When using the module option to lex(), PLY collects symbols +from the underlying object using the dir() function. There is no +direct access to the __dict__ attribute of the object supplied as a +module value. + +

+Finally, if you want to keep things nicely encapsulated, but don't want to use a +full-fledged class definition, lexers can be defined using closures. For example: + +

+
+import ply.lex as lex
+
+# List of token names.   This is always required
+tokens = (
+  'NUMBER',
+  'PLUS',
+  'MINUS',
+  'TIMES',
+  'DIVIDE',
+  'LPAREN',
+  'RPAREN',
+)
+
+def MyLexer():
+    # Regular expression rules for simple tokens
+    t_PLUS    = r'\+'
+    t_MINUS   = r'-'
+    t_TIMES   = r'\*'
+    t_DIVIDE  = r'/'
+    t_LPAREN  = r'\('
+    t_RPAREN  = r'\)'
+
+    # A regular expression rule with some action code
+    def t_NUMBER(t):
+        r'\d+'
+        t.value = int(t.value)    
+        return t
+
+    # Define a rule so we can track line numbers
+    def t_newline(t):
+        r'\n+'
+        t.lexer.lineno += len(t.value)
+
+    # A string containing ignored characters (spaces and tabs)
+    t_ignore  = ' \t'
+
+    # Error handling rule
+    def t_error(t):
+        print "Illegal character '%s'" % t.value[0]
+        t.lexer.skip(1)
+
+    # Build the lexer from my environment and return it    
+    return lex.lex()
+
+
+

3.15 Maintaining state

-In your lexer, you may want to maintain a variety of state information. This might include mode settings, symbol tables, and other details. There are a few -different ways to handle this situation. One way to do this is to keep a set of global variables in the module -where you created the lexer. For example: +In your lexer, you may want to maintain a variety of state +information. This might include mode settings, symbol tables, and +other details. As an example, suppose that you wanted to keep +track of how many NUMBER tokens had been encountered. + +

+One way to do this is to keep a set of global variables in the module +where you created the lexer. For example:

@@ -915,28 +977,22 @@ 

3.15 Maintaining state

r'\d+' global num_count num_count += 1 - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t
-Alternatively, you can store this information inside the Lexer object created by lex(). To this, you can use the lexer attribute -of tokens passed to the various rules. For example: +If you don't like the use of a global variable, another place to store +information is inside the Lexer object created by lex(). +To this, you can use the lexer attribute of tokens passed to +the various rules. For example:
 def t_NUMBER(t):
     r'\d+'
     t.lexer.num_count += 1     # Note use of lexer attribute
-    try:
-         t.value = int(t.value)    
-    except ValueError:
-         print "Line %d: Number %s is too large!" % (t.lineno,t.value)
-	 t.value = 0
+    t.value = int(t.value)    
     return t
 
 lexer = lex.lex()
@@ -944,17 +1000,20 @@ 

3.15 Maintaining state

-This latter approach has the advantage of storing information inside -the lexer object itself---something that may be useful if multiple instances -of the same lexer have been created. However, it may also feel kind -of "hacky" to the OO purists. Just to put their mind at some ease, all +This latter approach has the advantage of being simple and working +correctly in applications where multiple instantiations of a given +lexer exist in the same application. However, this might also feel +like a gross violation of encapsulation to OO purists. +Just to put your mind at some ease, all internal attributes of the lexer (with the exception of lineno) have names that are prefixed by lex (e.g., lexdata,lexpos, etc.). Thus, -it should be perfectly safe to store attributes in the lexer that -don't have names starting with that prefix. +it is perfectly safe to store attributes in the lexer that +don't have names starting with that prefix or a name that conlicts with one of the +predefined methods (e.g., input(), token(), etc.).

-A third approach is to define the lexer as a class as shown in the previous example: +If you don't like assigning values on the lexer object, you can define your lexer as a class as +shown in the previous section:

@@ -963,11 +1022,7 @@ 

3.15 Maintaining state

def t_NUMBER(self,t): r'\d+' self.num_count += 1 - try: - t.value = int(t.value) - except ValueError: - print "Line %d: Number %s is too large!" % (t.lineno,t.value) - t.value = 0 + t.value = int(t.value) return t def build(self, **kwargs): @@ -975,10 +1030,6 @@

3.15 Maintaining state

def __init__(self): self.num_count = 0 - -# Create a lexer -m = MyLexer() -lexer = lex.lex(object=m)
@@ -986,10 +1037,28 @@

3.15 Maintaining state

going to be creating multiple instances of the same lexer and you need to manage a lot of state. +

+State can also be managed through closures. For example, in Python 3: + +

+
+def MyLexer():
+    num_count = 0
+    ...
+    def t_NUMBER(t):
+        r'\d+'
+        nonlocal num_count
+        num_count += 1
+        t.value = int(t.value)    
+        return t
+    ...
+
+
+

3.16 Lexer cloning

-If necessary, a lexer object can be quickly duplicated by invoking its clone() method. For example: +If necessary, a lexer object can be duplicated by invoking its clone() method. For example:

@@ -1009,9 +1078,15 @@ 

3.16 Lexer cloning

cloned lexers could be used to handle different input files.

-Special considerations need to be made when cloning lexers that also maintain their own -internal state. Namely, you need to be aware that the newly created lexers will share all -of this state with the original lexer. For example, if you defined a lexer as a class and did this: +Creating a clone is different than calling lex.lex() in that +PLY doesn't regenerate any of the internal tables or regular expressions. So, + +

+Special considerations need to be made when cloning lexers that also +maintain their own internal state using classes or closures. Namely, +you need to be aware that the newly created lexers will share all of +this state with the original lexer. For example, if you defined a +lexer as a class and did this:

@@ -1024,8 +1099,9 @@ 

3.16 Lexer cloning

Then both a and b are going to be bound to the same object m and any changes to m will be reflected in both lexers. It's -important to emphasize that clone() is not meant to make a totally new copy of a -lexer. If you want to do that, call lex() again to create a new lexer. +important to emphasize that clone() is only meant to create a new lexer +that reuses the regular expressions and environment of another lexer. If you +need to make a totally new copy of a lexer, then call lex() again.

3.17 Internal lexer state

@@ -1045,8 +1121,9 @@

3.17 Internal lexer state

lexer.lineno

-The current value of the line number attribute stored in the lexer. This can be modified as needed to -change the line number. +The current value of the line number attribute stored in the lexer. PLY only specifies that the attribute +exists---it never sets, updates, or performs any processing with it. If you want to track line numbers, +you will need to add code yourself (see the section on line numbers and positional information).

@@ -1066,7 +1143,6 @@

3.17 Internal lexer state

3.18 Conditional lexing and start conditions

- In advanced parsing applications, it may be useful to have different lexing states. For instance, you may want the occurrence of a certain token or syntactic construct to trigger a different kind of lexing. @@ -1329,9 +1405,10 @@

4. Parsing basics

In the grammar, symbols such as NUMBER, +, -, *, and / are known -as terminals and correspond to raw input tokens. Identifiers such as term and factor refer to more -complex rules, typically comprised of a collection of tokens. These identifiers are known as non-terminals. +as terminals and correspond to raw input tokens. Identifiers such as term and factor refer to +grammar rules comprised of a collection of terminals and other rules. These identifiers are known as non-terminals.

+ The semantic behavior of a language is often specified using a technique known as syntax directed translation. In syntax directed translation, attributes are attached to each symbol in a given grammar @@ -1357,9 +1434,12 @@

4. Parsing basics

-A good way to think about syntax directed translation is to simply think of each symbol in the grammar as some -kind of object. The semantics of the language are then expressed as a collection of methods/operations on these -objects. +A good way to think about syntax directed translation is to +view each symbol in the grammar as a kind of object. Associated +with each symbol is a value representing its "state" (for example, the +val attribute above). Semantic +actions are then expressed as a collection of functions or methods +that operate on the symbols and associated values.

Yacc uses a parsing technique known as LR-parsing or shift-reduce parsing. LR parsing is a @@ -1368,64 +1448,78 @@

4. Parsing basics

grammar symbols are replaced by the grammar symbol on the left-hand-side.

-LR parsing is commonly implemented by shifting grammar symbols onto a stack and looking at the stack and the next -input token for patterns. The details of the algorithm can be found in a compiler text, but the -following example illustrates the steps that are performed if you wanted to parse the expression -3 + 5 * (10 - 20) using the grammar defined above: +LR parsing is commonly implemented by shifting grammar symbols onto a +stack and looking at the stack and the next input token for patterns that +match one of the grammar rules. +The details of the algorithm can be found in a compiler textbook, but the +following example illustrates the steps that are performed if you +wanted to parse the expression +3 + 5 * (10 - 20) using the grammar defined above. In the example, +the special symbol $ represents the end of input. +

 Step Symbol Stack           Input Tokens            Action
 ---- ---------------------  ---------------------   -------------------------------
-1    $                      3 + 5 * ( 10 - 20 )$    Shift 3
-2    $ 3                      + 5 * ( 10 - 20 )$    Reduce factor : NUMBER
-3    $ factor                 + 5 * ( 10 - 20 )$    Reduce term   : factor
-4    $ term                   + 5 * ( 10 - 20 )$    Reduce expr : term
-5    $ expr                   + 5 * ( 10 - 20 )$    Shift +
-6    $ expr +                   5 * ( 10 - 20 )$    Shift 5
-7    $ expr + 5                   * ( 10 - 20 )$    Reduce factor : NUMBER
-8    $ expr + factor              * ( 10 - 20 )$    Reduce term   : factor
-9    $ expr + term                * ( 10 - 20 )$    Shift *
-10   $ expr + term *                ( 10 - 20 )$    Shift (
-11   $ expr + term * (                10 - 20 )$    Shift 10
-12   $ expr + term * ( 10                - 20 )$    Reduce factor : NUMBER
-13   $ expr + term * ( factor            - 20 )$    Reduce term : factor
-14   $ expr + term * ( term              - 20 )$    Reduce expr : term
-15   $ expr + term * ( expr              - 20 )$    Shift -
-16   $ expr + term * ( expr -              20 )$    Shift 20
-17   $ expr + term * ( expr - 20              )$    Reduce factor : NUMBER
-18   $ expr + term * ( expr - factor          )$    Reduce term : factor
-19   $ expr + term * ( expr - term            )$    Reduce expr : expr - term
-20   $ expr + term * ( expr                   )$    Shift )
-21   $ expr + term * ( expr )                  $    Reduce factor : (expr)
-22   $ expr + term * factor                    $    Reduce term : term * factor
-23   $ expr + term                             $    Reduce expr : expr + term
-24   $ expr                                    $    Reduce expr
-25   $                                         $    Success!
-
-
- -When parsing the expression, an underlying state machine and the current input token determine what to do next. -If the next token looks like part of a valid grammar rule (based on other items on the stack), it is generally shifted -onto the stack. If the top of the stack contains a valid right-hand-side of a grammar rule, it is -usually "reduced" and the symbols replaced with the symbol on the left-hand-side. When this reduction occurs, the -appropriate action is triggered (if defined). If the input token can't be shifted and the top of stack doesn't match -any grammar rules, a syntax error has occurred and the parser must take some kind of recovery step (or bail out). - -

-It is important to note that the underlying implementation is built around a large finite-state machine that is encoded -in a collection of tables. The construction of these tables is quite complicated and beyond the scope of this discussion. -However, subtle details of this process explain why, in the example above, the parser chooses to shift a token -onto the stack in step 9 rather than reducing the rule expr : expr + term. - -

5. Yacc reference

- - -This section describes how to use write parsers in PLY. +1 3 + 5 * ( 10 - 20 )$ Shift 3 +2 3 + 5 * ( 10 - 20 )$ Reduce factor : NUMBER +3 factor + 5 * ( 10 - 20 )$ Reduce term : factor +4 term + 5 * ( 10 - 20 )$ Reduce expr : term +5 expr + 5 * ( 10 - 20 )$ Shift + +6 expr + 5 * ( 10 - 20 )$ Shift 5 +7 expr + 5 * ( 10 - 20 )$ Reduce factor : NUMBER +8 expr + factor * ( 10 - 20 )$ Reduce term : factor +9 expr + term * ( 10 - 20 )$ Shift * +10 expr + term * ( 10 - 20 )$ Shift ( +11 expr + term * ( 10 - 20 )$ Shift 10 +12 expr + term * ( 10 - 20 )$ Reduce factor : NUMBER +13 expr + term * ( factor - 20 )$ Reduce term : factor +14 expr + term * ( term - 20 )$ Reduce expr : term +15 expr + term * ( expr - 20 )$ Shift - +16 expr + term * ( expr - 20 )$ Shift 20 +17 expr + term * ( expr - 20 )$ Reduce factor : NUMBER +18 expr + term * ( expr - factor )$ Reduce term : factor +19 expr + term * ( expr - term )$ Reduce expr : expr - term +20 expr + term * ( expr )$ Shift ) +21 expr + term * ( expr ) $ Reduce factor : (expr) +22 expr + term * factor $ Reduce term : term * factor +23 expr + term $ Reduce expr : expr + term +24 expr $ Reduce expr +25 $ Success! + + + +When parsing the expression, an underlying state machine and the +current input token determine what happens next. If the next token +looks like part of a valid grammar rule (based on other items on the +stack), it is generally shifted onto the stack. If the top of the +stack contains a valid right-hand-side of a grammar rule, it is +usually "reduced" and the symbols replaced with the symbol on the +left-hand-side. When this reduction occurs, the appropriate action is +triggered (if defined). If the input token can't be shifted and the +top of stack doesn't match any grammar rules, a syntax error has +occurred and the parser must take some kind of recovery step (or bail +out). A parse is only successful if the parser reaches a state where +the symbol stack is empty and there are no more input tokens. + +

+It is important to note that the underlying implementation is built +around a large finite-state machine that is encoded in a collection of +tables. The construction of these tables is non-trivial and +beyond the scope of this discussion. However, subtle details of this +process explain why, in the example above, the parser chooses to shift +a token onto the stack in step 9 rather than reducing the +rule expr : expr + term. + +

5. Yacc

+ +The ply.yacc module implements the parsing component of PLY. +The name "yacc" stands for "Yet Another Compiler Compiler" and is +borrowed from the Unix tool of the same name.

5.1 An example

- Suppose you wanted to make a grammar for simple arithmetic expressions as previously described. Here is how you would do it with yacc.py: @@ -1475,26 +1569,26 @@

5.1 An example

print "Syntax error in input!" # Build the parser -yacc.yacc() - -# Use this if you want to build the parser using SLR instead of LALR -# yacc.yacc(method="SLR") +parser = yacc.yacc() -while 1: +while True: try: s = raw_input('calc > ') except EOFError: break if not s: continue - result = yacc.parse(s) + result = parser.parse(s) print result -In this example, each grammar rule is defined by a Python function where the docstring to that function contains the -appropriate context-free grammar specification. Each function accepts a single -argument p that is a sequence containing the values of each grammar symbol in the corresponding rule. The values of -p[i] are mapped to grammar symbols as shown here: +In this example, each grammar rule is defined by a Python function +where the docstring to that function contains the appropriate +context-free grammar specification. The statements that make up the +function body implement the semantic actions of the rule. Each function +accepts a single argument p that is a sequence containing the +values of each grammar symbol in the corresponding rule. The values +of p[i] are mapped to grammar symbols as shown here:
@@ -1507,42 +1601,49 @@ 

5.1 An example

-For tokens, the "value" of the corresponding p[i] is the -same as the p.value attribute assigned -in the lexer module. For non-terminals, the value is determined by -whatever is placed in p[0] when rules are reduced. This -value can be anything at all. However, it probably most common for -the value to be a simple Python type, a tuple, or an instance. In this example, we -are relying on the fact that the NUMBER token stores an integer value in its value -field. All of the other rules simply perform various types of integer operations and store -the result. - -

-Note: The use of negative indices have a special meaning in yacc---specially p[-1] does -not have the same value as p[3] in this example. Please see the section on "Embedded Actions" for further -details. -

-The first rule defined in the yacc specification determines the starting grammar -symbol (in this case, a rule for expression appears first). Whenever -the starting rule is reduced by the parser and no more input is available, parsing -stops and the final value is returned (this value will be whatever the top-most rule -placed in p[0]). Note: an alternative starting symbol can be specified using the start keyword argument to +For tokens, the "value" of the corresponding p[i] is the +same as the p.value attribute assigned in the lexer +module. For non-terminals, the value is determined by whatever is +placed in p[0] when rules are reduced. This value can be +anything at all. However, it probably most common for the value to be +a simple Python type, a tuple, or an instance. In this example, we +are relying on the fact that the NUMBER token stores an +integer value in its value field. All of the other rules simply +perform various types of integer operations and propagate the result. +

+ +

+Note: The use of negative indices have a special meaning in +yacc---specially p[-1] does not have the same value +as p[3] in this example. Please see the section on "Embedded +Actions" for further details. +

+ +

+The first rule defined in the yacc specification determines the +starting grammar symbol (in this case, a rule for expression +appears first). Whenever the starting rule is reduced by the parser +and no more input is available, parsing stops and the final value is +returned (this value will be whatever the top-most rule placed +in p[0]). Note: an alternative starting symbol can be +specified using the start keyword argument to yacc(). -

The p_error(p) rule is defined to catch syntax errors. See the error handling section -below for more detail. +

The p_error(p) rule is defined to catch syntax errors. +See the error handling section below for more detail.

-To build the parser, call the yacc.yacc() function. This function -looks at the module and attempts to construct all of the LR parsing tables for the grammar -you have specified. The first time yacc.yacc() is invoked, you will get a message -such as this: +To build the parser, call the yacc.yacc() function. This +function looks at the module and attempts to construct all of the LR +parsing tables for the grammar you have specified. The first +time yacc.yacc() is invoked, you will get a message such as +this:

 $ python calcparse.py
-yacc: Generating LALR parsing table...  
+Generating LALR tables
 calc > 
 
@@ -1554,7 +1655,8 @@

5.1 An example

executions, yacc will reload the table from parsetab.py unless it has detected a change in the underlying grammar (in which case the tables and parsetab.py file are -regenerated). Note: The names of parser output files can be changed if necessary. See the notes that follow later. +regenerated). Note: The names of parser output files can be changed +if necessary. See the PLY Reference for details.

If any errors are detected in your grammar specification, yacc.py will produce @@ -1569,7 +1671,16 @@

5.1 An example

  • Undefined rules and tokens -The next few sections now discuss a few finer points of grammar construction. +The next few sections discuss grammar specification in more detail. + +

    +The final part of the example shows how to actually run the parser +created by +yacc(). To run the parser, you simply have to call +the parse() with a string of input text. This will run all +of the grammar rules and return the result of the entire parse. This +result return is the value assigned to p[0] in the starting +grammar rule.

    5.2 Combining Grammar Rule Functions

    @@ -1640,8 +1751,15 @@

    5.2 Combining Grammar Rule Functions

    -

    5.3 Character Literals

    +If parsing performance is a concern, you should resist the urge to put +too much conditional processing into a single grammar rule as shown in +these examples. When you add checks to see which grammar rule is +being handled, you are actually duplicating the work that the parser +has already performed (i.e., the parser already knows exactly what rule it +matched). You can eliminate this overhead by using a +separate p_rule() function for each grammar rule. +

    5.3 Character Literals

    If desired, a grammar may contain tokens defined as single character literals. For example: @@ -1700,12 +1818,13 @@

    5.4 Empty Productions

    -Note: You can write empty rules anywhere by simply specifying an empty right hand side. However, I personally find that -writing an "empty" rule and using "empty" to denote an empty production is easier to read. +Note: You can write empty rules anywhere by simply specifying an empty +right hand side. However, I personally find that writing an "empty" +rule and using "empty" to denote an empty production is easier to read +and more clearly states your intentions.

    5.5 Changing the starting symbol

    - Normally, the first rule found in a yacc specification defines the starting grammar rule (top level rule). To change this, simply supply a start specifier in your file. For example: @@ -1723,8 +1842,10 @@

    5.5 Changing the starting symbol

    -The use of a start specifier may be useful during debugging since you can use it to have yacc build a subset of -a larger grammar. For this purpose, it is also possible to specify a starting symbol as an argument to yacc(). For example: +The use of a start specifier may be useful during debugging +since you can use it to have yacc build a subset of a larger grammar. +For this purpose, it is also possible to specify a starting symbol as +an argument to yacc(). For example:
    @@ -1735,9 +1856,11 @@ 

    5.5 Changing the starting symbol

    5.6 Dealing With Ambiguous Grammars

    -The expression grammar given in the earlier example has been written in a special format to eliminate ambiguity. -However, in many situations, it is extremely difficult or awkward to write grammars in this format. A -much more natural way to express the grammar is in a more compact form like this: +The expression grammar given in the earlier example has been written +in a special format to eliminate ambiguity. However, in many +situations, it is extremely difficult or awkward to write grammars in +this format. A much more natural way to express the grammar is in a +more compact form like this:
    @@ -1750,15 +1873,18 @@ 

    5.6 Dealing With Ambiguous Grammars

    -Unfortunately, this grammar specification is ambiguous. For example, if you are parsing the string -"3 * 4 + 5", there is no way to tell how the operators are supposed to be grouped. -For example, does the expression mean "(3 * 4) + 5" or is it "3 * (4+5)"? +Unfortunately, this grammar specification is ambiguous. For example, +if you are parsing the string "3 * 4 + 5", there is no way to tell how +the operators are supposed to be grouped. For example, does the +expression mean "(3 * 4) + 5" or is it "3 * (4+5)"?

    -When an ambiguous grammar is given to yacc.py it will print messages about "shift/reduce conflicts" -or a "reduce/reduce conflicts". A shift/reduce conflict is caused when the parser generator can't decide -whether or not to reduce a rule or shift a symbol on the parsing stack. For example, consider -the string "3 * 4 + 5" and the internal parsing stack: +When an ambiguous grammar is given to yacc.py it will print +messages about "shift/reduce conflicts" or "reduce/reduce conflicts". +A shift/reduce conflict is caused when the parser generator can't +decide whether or not to reduce a rule or shift a symbol on the +parsing stack. For example, consider the string "3 * 4 + 5" and the +internal parsing stack:

    @@ -1773,20 +1899,25 @@ 

    5.6 Dealing With Ambiguous Grammars

    -In this case, when the parser reaches step 6, it has two options. One is to reduce the -rule expr : expr * expr on the stack. The other option is to shift the -token + on the stack. Both options are perfectly legal from the rules -of the context-free-grammar. +In this case, when the parser reaches step 6, it has two options. One +is to reduce the rule expr : expr * expr on the stack. The +other option is to shift the token + on the stack. Both +options are perfectly legal from the rules of the +context-free-grammar.

    -By default, all shift/reduce conflicts are resolved in favor of shifting. Therefore, in the above -example, the parser will always shift the + instead of reducing. Although this -strategy works in many cases (including the ambiguous if-then-else), it is not enough for arithmetic -expressions. In fact, in the above example, the decision to shift + is completely wrong---we should have -reduced expr * expr since multiplication has higher mathematical precedence than addition. +By default, all shift/reduce conflicts are resolved in favor of +shifting. Therefore, in the above example, the parser will always +shift the + instead of reducing. Although this strategy +works in many cases (for example, the case of +"if-then" versus "if-then-else"), it is not enough for arithmetic expressions. In fact, +in the above example, the decision to shift + is completely +wrong---we should have reduced expr * expr since +multiplication has higher mathematical precedence than addition. -

    To resolve ambiguity, especially in expression grammars, yacc.py allows individual -tokens to be assigned a precedence level and associativity. This is done by adding a variable +

    To resolve ambiguity, especially in expression +grammars, yacc.py allows individual tokens to be assigned a +precedence level and associativity. This is done by adding a variable precedence to the grammar file like this:

    @@ -1798,17 +1929,19 @@

    5.6 Dealing With Ambiguous Grammars

    -This declaration specifies that PLUS/MINUS have -the same precedence level and are left-associative and that -TIMES/DIVIDE have the same precedence and are left-associative. -Within the precedence declaration, tokens are ordered from lowest to highest precedence. Thus, -this declaration specifies that TIMES/DIVIDE have higher -precedence than PLUS/MINUS (since they appear later in the +This declaration specifies that PLUS/MINUS have the +same precedence level and are left-associative and that +TIMES/DIVIDE have the same precedence and are +left-associative. Within the precedence declaration, tokens +are ordered from lowest to highest precedence. Thus, this declaration +specifies that TIMES/DIVIDE have higher precedence +than PLUS/MINUS (since they appear later in the precedence specification).

    -The precedence specification works by associating a numerical precedence level value and associativity direction to -the listed tokens. For example, in the above example you get: +The precedence specification works by associating a numerical +precedence level value and associativity direction to the listed +tokens. For example, in the above example you get:

    @@ -1819,9 +1952,10 @@ 

    5.6 Dealing With Ambiguous Grammars

    -These values are then used to attach a numerical precedence value and associativity direction -to each grammar rule. This is always determined by looking at the precedence of the right-most terminal symbol. -For example: +These values are then used to attach a numerical precedence value and +associativity direction to each grammar rule. This is always +determined by looking at the precedence of the right-most terminal +symbol. For example:
    @@ -1839,7 +1973,7 @@ 

    5.6 Dealing With Ambiguous Grammars

      -
    1. If the current token has higher precedence, it is shifted. +
    2. If the current token has higher precedence than the rule on the stack, it is shifted.
    3. If the grammar rule on the stack has higher precedence, the rule is reduced.
    4. If the current token and the grammar rule have the same precedence, the rule is reduced for left associativity, whereas the token is shifted for right associativity. @@ -1847,21 +1981,28 @@

      5.6 Dealing With Ambiguous Grammars

      favor of shifting (the default).
    -For example, if "expression PLUS expression" has been parsed and the next token -is "TIMES", the action is going to be a shift because "TIMES" has a higher precedence level than "PLUS". On the other -hand, if "expression TIMES expression" has been parsed and the next token is "PLUS", the action -is going to be reduce because "PLUS" has a lower precedence than "TIMES." +For example, if "expression PLUS expression" has been parsed and the +next token is "TIMES", the action is going to be a shift because +"TIMES" has a higher precedence level than "PLUS". On the other hand, +if "expression TIMES expression" has been parsed and the next token is +"PLUS", the action is going to be reduce because "PLUS" has a lower +precedence than "TIMES."

    -When shift/reduce conflicts are resolved using the first three techniques (with the help of -precedence rules), yacc.py will report no errors or conflicts in the grammar. +When shift/reduce conflicts are resolved using the first three +techniques (with the help of precedence rules), yacc.py will +report no errors or conflicts in the grammar (although it will print +some information in the parser.out debugging file).

    -One problem with the precedence specifier technique is that it is sometimes necessary to -change the precedence of an operator in certain contents. For example, consider a unary-minus operator -in "3 + 4 * -5". Normally, unary minus has a very high precedence--being evaluated before the multiply. -However, in our precedence specifier, MINUS has a lower precedence than TIMES. To deal with this, -precedence rules can be given for fictitious tokens like this: +One problem with the precedence specifier technique is that it is +sometimes necessary to change the precedence of an operator in certain +contexts. For example, consider a unary-minus operator in "3 + 4 * +-5". Mathematically, the unary minus is normally given a very high +precedence--being evaluated before the multiply. However, in our +precedence specifier, MINUS has a lower precedence than TIMES. To +deal with this, precedence rules can be given for so-called "fictitious tokens" +like this:

    @@ -1950,9 +2091,25 @@ 

    5.6 Dealing With Ambiguous Grammars

    the rule assignment : ID EQUALS expression.

    -It should be noted that reduce/reduce conflicts are notoriously difficult to spot -simply looking at the input grammer. To locate these, it is usually easier to look at the -parser.out debugging file with an appropriately high level of caffeination. +It should be noted that reduce/reduce conflicts are notoriously +difficult to spot simply looking at the input grammer. When a +reduce/reduce conflict occurs, yacc() will try to help by +printing a warning message such as this: + +

    +
    +WARNING: 1 reduce/reduce conflict
    +WARNING: reduce/reduce conflict in state 15 resolved using rule (assignment -> ID EQUALS NUMBER)
    +WARNING: rejected rule (expression -> NUMBER)
    +
    +
    + +This message identifies the two rules that are in conflict. However, +it may not tell you how the parser arrived at such a state. To try +and figure it out, you'll probably have to look at your grammar and +the contents of the +parser.out debugging file with an appropriately high level of +caffeination.

    5.7 The parser.out file

    @@ -2212,10 +2369,15 @@

    5.7 The parser.out file

    -In the file, each state of the grammar is described. Within each state the "." indicates the current -location of the parse within any applicable grammar rules. In addition, the actions for each valid -input token are listed. When a shift/reduce or reduce/reduce conflict arises, rules not selected -are prefixed with an !. For example: +The different states that appear in this file are a representation of +every possible sequence of valid input tokens allowed by the grammar. +When receiving input tokens, the parser is building up a stack and +looking for matching rules. Each state keeps track of the grammar +rules that might be in the process of being matched at that point. Within each +rule, the "." character indicates the current location of the parse +within that rule. In addition, the actions for each valid input token +are listed. When a shift/reduce or reduce/reduce conflict arises, +rules not selected are prefixed with an !. For example:
    @@ -2232,10 +2394,19 @@ 

    5.7 The parser.out file

    5.8 Syntax Error Handling

    +If you are creating a parser for production use, the handling of +syntax errors is important. As a general rule, you don't want a +parser to simply throw up its hands and stop at the first sign of +trouble. Instead, you want it to report the error, recover if possible, and +continue parsing so that all of the errors in the input get reported +to the user at once. This is the standard behavior found in compilers +for languages such as C, C++, and Java. -When a syntax error occurs during parsing, the error is immediately +In PLY, when a syntax error occurs during parsing, the error is immediately detected (i.e., the parser does not read any more tokens beyond the -source of the error). Error recovery in LR parsers is a delicate +source of the error). However, at this point, the parser enters a +recovery mode that can be used to try and continue further parsing. +As a general rule, error recovery in LR parsers is a delicate topic that involves ancient rituals and black-magic. The recovery mechanism provided by yacc.py is comparable to Unix yacc so you may want consult a book like O'Reilly's "Lex and Yacc" for some of the finer details. @@ -2407,7 +2578,7 @@

    5.8.3 Signaling an error from a production

     def p_production(p):
         'production : some production ...'
    -    raise yacc.SyntaxError
    +    raise SyntaxError
     
    @@ -2438,8 +2609,9 @@

    5.8.4 General comments on error handling

    5.9 Line Number and Position Tracking

    -Position tracking is often a tricky problem when writing compilers. By default, PLY tracks the line number and position of -all tokens. This information is available using the following functions: +Position tracking is often a tricky problem when writing compilers. +By default, PLY tracks the line number and position of all tokens. +This information is available using the following functions:
    • p.lineno(num). Return the line number for symbol num @@ -2457,9 +2629,11 @@

      5.9 Line Number and Position Tracking

    -As an optional feature, yacc.py can automatically track line numbers and positions for all of the grammar symbols -as well. However, this -extra tracking requires extra processing and can significantly slow down parsing. Therefore, it must be enabled by passing the +As an optional feature, yacc.py can automatically track line +numbers and positions for all of the grammar symbols as well. +However, this extra tracking requires extra processing and can +significantly slow down parsing. Therefore, it must be enabled by +passing the tracking=True option to yacc.parse(). For example:
    @@ -2468,8 +2642,9 @@

    5.9 Line Number and Position Tracking

    -Once enabled, the lineno() and lexpos() methods work for all grammar symbols. In addition, two -additional methods can be used: +Once enabled, the lineno() and lexpos() methods work +for all grammar symbols. In addition, two additional methods can be +used: