Skip to content

Commit

Permalink
Add PEG module documentation (initial version)
Browse files Browse the repository at this point in the history
  • Loading branch information
otakubeam committed Jul 8, 2023
1 parent e529cca commit 846aa58
Showing 1 changed file with 194 additions and 0 deletions.
194 changes: 194 additions & 0 deletions modules/dasPEG/readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# The *das-PEG* parser generator library

The ***das-PEG*** (Parsing Expression Grammar) parser generator library is a
versatile tool that helps you design deterministic parsers with unlimited
lookahead capability. PEG is a type of parsing algorithm that describes a
formal language in terms of an _ordered_ set of rules.

## How does it work?

The *das-PEG* parser generator interprets your grammar and constructs a
recursive descent parser from it. The grammar is declared in a specific
das-like syntax inside a `parse` macro. The PEG algorithm employs a unique
approach, using **prioritized choice** to handle ambiguity. This feature
ensures that the first matching rule is always chosen, which is why PEG is
sometimes referred to as "scannerless" – it integrates lexing and parsing into
a single activity.

### Example

**The grammar** is a series of of rules; each rule is introduced with `var
<rule-name>: <rule-type>`. **The types are significant**, as they are used for
binding variables.

Following each rule is a series of `rule(...)` calls - the alternatives that
the rule will try to match. For example, in this simplified arithmetic parser —

```
parse calc
var add: int
rule(add as a, "+", mul as m) <|
return a + m // <- a & m are int variables in this context
// the action should also return an int
rule(add as a, "-", mul as m) <|
return a - m
rule(mul as m) <|
return m
var mul: int
...
```

— the rules for plus and minus will be tried in order. If neither of these
matches then the last rule is checked against.

**Caching.** The `a` in `add as a` not parsed several times. The parser keeps
the caches for every rule and reuses its results. This technique is known as
*packrat parsing.*

### Built-in rules

*das-PEG* provides a number of build-in rules designed to increase the speed
and convenience of using the parser.

- string_ - stream of charachters enclosed inside `"..."`
- double_ - floating point number
- number - decimal integer
- WS - *whitespace*
- EOF - *end of file*

The basic building blocks to the more complicated rules are

- **Literal match**. String literals can be enclosed in `""` and matched against
- **Range match**. Ranges are specified via `range('1', '9')` syntax
- **Text extraction**. Text extraction is specified with the interpolation
syntax and is used to match the rule and extract its corresponding `string`.
`rule("{name}", "=" , Value as v) <| $ { variables[name] = v; }`

### Binding to the rules

If you have the rule `A: type` you can bind its result with the `A as <name>`
syntax. This allows you to access the name inside the action block with its
type being the type of rule `A`.

The modifiers can alter the meaning of binding constructs.

### Rule Modifiers

There are several modifiers that can be applied to the rules.

- **Repetition** (__*__)

This modifier matches the rule *zero or more times*. Returns its result
inside the array. Take, for example, the array from json parser -

```
var ElementList: array<JsonValue?>
rule(*CommaSeparatedElement as els, Element as last) <|
print("Parsed element list\n")
els |> push <| last
return <- els
var CommaSeparatedElement: JsonValue?
...
```

- its contents are represented as `ElementList`, a list of comma separated
elements ending with just the last `Element` (without comma).

Here the rule `*CommaSeparatedElement as els` binds several elements and
produces `array<JsonValue?>` - because `CommaSeparatedElement` is a
`JsonValue?`.

- **Certain repetition** (**+**) - like simple *repetition* but matches *one or more times*

- **Optional** (**MB**)

Matches the rule zero or more times. Returns its result inside the array.

```
var Array: array<JsonValue?>
// Optional element list;
// If items are present (not []) then no trailing comma is allowed
rule("[", WS, MB(ElementList) as list, "]", WS) <|
return <- [[array<JsonValue?>]] if list |> empty
return <- list[0] // Take the value from optional
```

Here you can see the way to get the result. If the `optional` was parsed then array is populated with a value, otherwise it's empty.

- **Positive lookahead** (**PEEK**) matches the rule but does not advance the index through the input.

- **Negaitve lookhead** (**!**) peeks the next rule and checks that it does not match.


### Available generator options

The options are specified anywhere in the body of the parse macro with the `options(...)` syntax.

```
parser multilingual
options(utf-8)
...
```

- `utf-8` - enables utf-8 decoding support
- `trace` - enable line info tracking and failure reporting

## Performance

PEG parsers, including the *das-PEG* parser generator, have certain performance
characteristics that can impact the efficiency of your parsing tasks.

**Importance of Ordering**: Unlike some other parser types that may require
conflict resolution, PEG parsers operate in a deterministic way, always trying
the first alternative in order. Therefore it is advisable to place alternatives
in their expected order of frequency. This way, the more common cases will be
handled faster.

**Linear Time Complexity**: In general, PEG parsers exhibit super-linear time
complexity with respect to the size of the input. Using caching techniques
alleviates some of the performance penalties. However, this can be affected by
the specifics of the grammar. For example, excessive use of repetitions '*'
(zero or more) or '+' (one or more) operators in the grammar can lead to
super-linear performance because these operators may require the parser to
repeatedly attempt the same parsing operation.

**Memory Use**: Due to the use of caching this type of parsers can consume some
additional memory.

### Benchmarks

...

## Warnings

- **Incomplete left-recursion support**

*das-PEG* currently provides limited support for indirectly left-recursive grammars.

- **Stack overflow**

By default in interpreted mode the stack for a das program is quite small
(16KB) and can easily overflow. Specify the bigger value in the beginning of
the module to overcome this issue. `options stack = 1000000000` - 1MB
extremely big.

- **Stateful actions are prohibited**

Actions should avoid any side effects or dependence on external state. This
is because the sequence of action execution is not fixed and could vary due
to the **backtracking nature** of PEG parsers.

## Inspiration

The creation of the das PEG parser generator was substantially influenced by
Guido van Rossum's work on Python's pegen library. His series of detailed
articles on the internal mechanics of PEG parser generators provided a
blueprint for generating high-performance parsers, which the das PEG library
strives to replicate.

0 comments on commit 846aa58

Please sign in to comment.