Parsing expression grammar
A parsing expression grammar, or PEG, is a type of analytic formal grammar that describes a formal language in terms of a set of rules for recognizing strings in the language. A parsing expression grammar essentially represents a recursive descent parser in a pure schematic form that only expresses syntax and is independent of the way an actual parser might be implemented or what it might be used for.
Definition
Formally, a parsing expression grammar consists of:
- A finite set N of nonterminal symbols.
- A finite set Σ of terminal symbols that is disjoint from N.
- A finite set P of parsing rules.
Each parsing rule in P has the form A ← e, where A is a nonterminal and e is a parsing expression. A parsing expression is a hierarchical expression similar to a regular expression, which is constructed in the following fashion:
- An atomic parsing expression consists of:
- any terminal,
- any nonterminal, or
- the empty string ε.
- Given any existing parsing expressions e, e1, and e2, a new parsing expression can be constructed using the following operators:
- Sequence: e1 e2
- Ordered choice: e1 / e2
- Zero-or-more: e*
- One-or-more: e+
- Optional: e?
- And-predicate: &e
- Not-predicate: !e
Unlike in a context-free grammar or other generative grammars, in a parsing expression grammar there must be exactly one rule in the grammer having a given nonterminal on its left-hand side. That is, rules act as definitions in a PEG, and each nonterminal must have one and only one definition.
Interpretation of parsing expressions
Each nonterminal in a parsing expression grammar essentially represents a parsing function in a recursive descent parser, and the corresponding parsing expression represents the "code" comprising the function. Each parsing function conceptually takes an input string as its argument, and yields one of the following results:
- success, in which the function may optionally move forward or "consume" one or more characters of the input string supplied to it, or
- failure, in which case no input is consumed.
A nonterminal may succeed without actually consuming any input, and this is considered an outcome distinct from failure.
An atomic parsing expression consisting of a single terminal succeeds if the first character of the input string matches that terminal, and in that case consumes the input character; otherwise the expression yields a failure result. An atomic parsing expression consisting of the empty string always trivially succeeds without consuming any input. An atomic parsing expression consisting of a nonterminal A represents a recursive call to the nonterminal-function A.
The sequence operator e1 e2 first invokes e1, and if e1 succeeds, subsequently invokes e2 on the remainder of the input string left unconsumed by e1, and returns the result. If either e1 or e2 fails, then the sequence expression e1 e2 fails.
The choice operator e1 / e2 first invokes e1, and if e1 succeeds, returns its result immediately. Otherwise, if e1 fails, then the choice operator backtracks to the original input position at which it invoked e1, but then calls e2 instead, returning e2's result.
The zero-or-more, one-or-more, and optional operators consume zero or more, one or more, or zero or one consecutive repetitions of their sub-expression e, respectively. Unlike in regular expressions or context-free grammars, however, these operators always behave greedily, consuming as much input as possible. For example, the expression a* will always consume as many a's as are consecutively available in the input string, and the expression (a* a) will always fail because the first part (a*) will never leave any a's for the second part to match.
Finally, the and-predicate and not-predicate operators implement syntactic predicates. The expression &e invokes the sub-expression e, and then succeeds if e succeeds and fails if e fails, but in either case never consumes any input. Conversely, the expression !e succeeds if e fails and fails if e succeeds, again consuming no input in either case. Because these can use an arbitrarily complex sub-expression e to "look ahead" into the input string without actually consuming it, they provide a powerful syntactic lookahead and disambiguation facility.
Examples
The parsing expression (a/b)* matches and consumes an arbitrary-length sequence of a's and b's. The rule S ← a S b describes the simple context-free "matching language" {anbn}. The following parsing expression grammar describes the classic non-context-free language {anbncn} where n > 0:
S ← &(A b) a* B !c
A ← (a A b)?
B ← (b B c)?
The following recursive rule matches standard C-style if/then/else statements in such a way that the optional "else" clause always binds to the innermost "if", because of the implicit prioritization of the '/' operator. (In a context-free grammar, this construct yields the classic dangling else ambiguity.)
S ← if C then S else S / if C then S
The parsing expression foo &(bar) matches and consumes the text "foo" but only if it is followed by the text "bar". The parsing expression foo !bar matches the text "foo" but only if it is not followed by the text "bar". The expression !(a+ b) a matches a single "a" but only if it is not the first in an arbitrarily-long sequence of a's followed by a b.
The following recursive rule matches Pascal-style nested comment syntax, (* which can (* nest *) like this *):
C ← "(*" I* "*)" I ← C / (!"(*" Z) Z ← any single character
Implementing parsers from parsing expression grammars
Any parsing expression grammar can of course be converted directly into a recursive descent parser. Due to the unlimited lookahead capability that the grammar formalism provides, however, the resulting parser could exhibit exponential time performance in the worst case.
By memoizing the results of intermediate parsing steps and ensuring that each parsing function is only invoked at most once at a given input position, however, it is possible to convert any parsing expression grammar into a packrat parser, which always runs in linear time at the cost of substantially greater storage space requirements.
It is also possible to build LL parsers and LR parsers from parsing expression grammars, but the unlimited lookahead capability of the grammar formalism is lost in this case.