Add PEG module documentation (initial version)

xthebat · Jul 8, 2023 · 846aa58 · 846aa58
1 parent e529cca
commit 846aa58
Showing 1 changed file with 194 additions and 0 deletions.
diff --git a/modules/dasPEG/readme.md b/modules/dasPEG/readme.md
@@ -0,0 +1,194 @@
+# The *das-PEG*  parser generator library
+
+The ***das-PEG*** (Parsing Expression Grammar) parser generator library is a
+versatile tool that helps you design deterministic parsers with unlimited
+lookahead capability. PEG is a type of parsing algorithm that describes a
+formal language in terms of an _ordered_ set of rules. 
+
+## How does it work?
+
+The *das-PEG* parser generator interprets your grammar and constructs a
+recursive descent parser from it. The grammar is declared in a specific
+das-like syntax inside a `parse` macro. The PEG algorithm employs a unique
+approach, using **prioritized choice** to handle ambiguity. This feature
+ensures that the first matching rule is always chosen, which is why PEG is
+sometimes referred to as "scannerless" – it integrates lexing and parsing into
+a single activity.
+
+### Example
+
+**The grammar** is a series of of rules; each rule is introduced with `var
+<rule-name>: <rule-type>`. **The types are significant**, as they are used  for
+binding variables.
+
+Following each rule is a series of `rule(...)` calls - the alternatives that
+the rule will try to match. For example, in this simplified arithmetic parser —
+
+```    
+parse calc
+    var add: int
+
+    rule(add as a, "+", mul as m) <|
+        return a + m       // <- a & m are int variables in this context
+                           //     the action should also return an int
+    rule(add as a, "-", mul as m) <|
+        return a - m
+        
+    rule(mul as m) <|
+        return m
+        
+    var mul: int
+    ...
+```
+
+— the rules for plus and minus will be tried in order. If neither of these
+matches then the last rule is checked against. 
+
+**Caching.** The `a` in `add as a` not parsed several times. The parser keeps
+the caches for every rule and reuses its results. This technique is known as
+*packrat parsing.*
+
+### Built-in rules
+
+*das-PEG* provides a number of build-in rules designed to increase the speed
+and convenience of using the parser.
+
+- string_ - stream of charachters enclosed inside `"..."`
+- double_ - floating point number
+- number - decimal integer
+- WS - *whitespace*
+- EOF - *end of file*
+
+The basic building blocks to the more complicated rules are 
+
+- **Literal match**. String literals can be enclosed in `""` and matched against
+- **Range match**. Ranges are specified via `range('1', '9')` syntax
+- **Text extraction**. Text extraction is specified with the interpolation
+  syntax and is used to match the rule and extract its corresponding `string`.
+  `rule("{name}", "=" , Value as v) <| $ { variables[name] = v; }`
+
+### Binding to the rules
+
+If you have the rule `A: type` you can bind its result with the `A as <name>`
+syntax. This allows you to access the name inside the action block with its
+type being the type of rule `A`.
+
+The modifiers can alter the meaning of binding constructs.
+
+### Rule Modifiers 
+
+There are several modifiers that can be applied to the rules. 
+
+- **Repetition** (__*__)
+
+  This modifier matches the rule *zero or more times*. Returns its result
+  inside the array. Take, for example, the array from json parser -
+
+  ```
+     var ElementList: array<JsonValue?>
+
+     rule(*CommaSeparatedElement as els, Element as last) <|
+     print("Parsed element list\n")
+     els |> push <| last
+     return <- els
+
+     var CommaSeparatedElement: JsonValue?
+     ...
+  ```
+
+  - its contents are represented as `ElementList`, a list of comma separated
+  elements ending with just the last `Element` (without comma).
+
+  Here the rule `*CommaSeparatedElement as els` binds several elements and
+  produces `array<JsonValue?>` - because `CommaSeparatedElement` is a
+  `JsonValue?`.
+
+- **Certain repetition** (**+**) - like simple *repetition* but matches *one or more times*
+
+- **Optional** (**MB**)
+
+  Matches the rule zero or more times. Returns its result inside the array.
+
+  ```
+  var Array: array<JsonValue?>
+
+      // Optional element list;
+      // If items are present (not []) then no trailing comma is allowed
+
+  rule("[", WS, MB(ElementList) as list, "]", WS) <|
+        return <- [[array<JsonValue?>]] if list |> empty
+        return <- list[0] // Take the value from optional
+  ```
+
+  Here you can see the way to get the result. If the `optional` was parsed then array is populated with a value, otherwise it's empty.
+
+- **Positive lookahead** (**PEEK**) matches the rule but does not advance the index through the input.
+
+- **Negaitve lookhead** (**!**)  peeks the next rule and checks that it does not match.
+
+
+### Available generator options
+
+The options are specified anywhere in the body of the parse macro with the `options(...)` syntax.
+
+```
+parser multilingual
+    options(utf-8)
+    ...
+```
+
+- `utf-8` - enables utf-8 decoding support
+- `trace` - enable line info tracking and failure reporting
+
+## Performance
+
+PEG parsers, including the *das-PEG* parser generator, have certain performance
+characteristics that can impact the efficiency of your parsing tasks.
+
+**Importance of Ordering**: Unlike some other parser types that may require
+conflict resolution, PEG parsers operate in a deterministic way, always trying
+the first alternative in order. Therefore it is advisable to place alternatives
+in their expected order of frequency. This way, the more common cases will be
+handled faster.
+
+**Linear Time Complexity**: In general, PEG parsers exhibit super-linear time
+complexity with respect to the size of the input. Using caching techniques
+alleviates some of the performance penalties. However, this can be affected by
+the specifics of the grammar. For example, excessive use of repetitions '*'
+(zero or more) or '+' (one or more) operators in the grammar can lead to
+super-linear performance because these operators may require the parser to
+repeatedly attempt the same parsing operation.
+
+**Memory Use**: Due to the use of caching this type of parsers can consume some
+additional memory.
+
+### Benchmarks
+
+...
+
+## Warnings
+
+- **Incomplete left-recursion support**
+
+  *das-PEG* currently provides limited support for indirectly left-recursive grammars.
+
+- **Stack overflow**
+
+  By default in interpreted mode the stack for a das program is quite small
+  (16KB) and can easily overflow. Specify the bigger value in the beginning of
+  the module to overcome this issue. `options stack = 1000000000` - 1MB
+  extremely big.
+
+- **Stateful actions are prohibited**
+
+  Actions should avoid any side effects or dependence on external state. This
+  is because the sequence of action execution is not fixed and could vary due
+  to the **backtracking nature** of PEG parsers.
+
+## Inspiration
+
+The creation of the das PEG parser generator was substantially influenced by
+Guido van Rossum's work on Python's pegen library. His series of detailed
+articles on the internal mechanics of PEG parser generators provided a
+blueprint for generating high-performance parsers, which the das PEG library
+strives to replicate.