Parsing Expression Grammar

A parsing expression grammar (PEG) is a type of formal grammar for unambiguous languages, devised by Bryan Ford. PEGs are generally used to parse computer languages, and they are particularly suited to this purpose because of their lack of ambiguity.

Each rule in a PEG has the general form

name ← expression

where the expression is a sequence of terminals, nonterminals, and EBNF-like tokens that specifies how to parse a string as the nonterminal named name.

Many expressions will be of the form

expression / expression / ...

which gives a list of options for parsing this expression. This instructs a parser to try parsing a string using the first option, and proceed to a subsequent option only if all previous options failed.

PEGs are similar to context-free grammars; the significant difference is the prioritization of choices. When recognizing a string using a PEG, you must always take the first available option. Thus, there will be at most one way to parse any given string.

PEGs are also similar to regular expressions as they are implemented within programming languages, because they match a string against an expression, deterministically.

Example

This is a PEG for mathematical formulas that use the basic four operations:

Value <- [0-9.]+ / '(' Expr ')'
Sum <- Product (('+' / '-') Product)*
Product <- Value (('*' / '/') Value)*
Expr <- Sum

Advantages

Any PEG can be parsed in linear time by using a packrat parser, which memoizes the results of matching each expression at each position in the string.

Parsers for languages expressed as a CFG, such as LR parsers, require a separate tokenization step to be done first, which breaks up the input based on the location of spaces, punctuation, etc. The tokenization is necessary because of the way these parsers use lookahead to parse CFGs that meet certain requirements in linear time. PEGs do not require tokenization to be a separate step, and tokenization rules can be written in the same way as any other grammar rule.

Many CFGs contain inherent ambiguities, even when they're intended to describe unambiguous languages. The "dangling else" problem in C, C++, and Java is one example. These problems are often resolved by applying a rule outside of the grammar. In a PEG, these ambiguities never arise, because of prioritization.

Disadvantages

PEGs cannot express left-recursive rules, where a rule refers to itself without moving forward in the string. For example, in the arithmetic grammar above, it would be tempting to move some rules around so that the precedence order of products and sums could be expressed in one line:

Value <- [0-9.]+ / '(' Expr ')'
Sum <- Expr (('+' / '-') Expr)*
Product <- Expr (('*' / '/') Expr)*
Expr <- Product / Sum / Value

The problem with this is that it says that to match an Expr, you need to first see if a Product matches there, and to match a Product, you need to first see if an Expr matches there. This is not possible.

However, left-recursive rules can always be rewritten to not be left-recursive. In the simplest case, for example, a left-recursive rule repeats a certain expression indefinitely, as in the CFG rule

 string-of-a -> string-of-a 'a' | 'a'

But this can be rewritten in a PEG using the plus operator:

 string-of-a <- 'a'+

PEG expressions

The syntax for specifying a PEG is based on EBNF. Specifically, a PEG expression is one of:

The empty string.
A terminal (a string literal or a range of characters).
A nonterminal (a reference to another rule).
A sequence of expressions: expr1 expr2...
A prioritized choice of expressions: expr1 / expr2 / ...
An expression followed by an operator:
- expr?, which first tries to match the expression, and otherwise matches the empty string.
- expr*, which matches 0 or more instances of the expression, with longer matches getting higher priority.
- expr+, which matches 1 or more instances of the expression, with longer matches getting higher priority.
&expr, which matches expr but consumes no input. (In other words, it matches the empty string if and only if expr matches at this point in the string.)
!expr, which matches the empty string, but only if expr does not match at this point in the string.
., which matches a single character.

A more formal description of the syntax of a PEG, including a PEG for parsing PEGs, appears in the paper that introduced PEGs.

Parsing Expression Languages vs. Context-Free Languages

The only context that is used in parsing a string with a PEG is the knowledge of whether an option with a higher priority succeeded. Thus, there are some context-sensitive languages that can be recognized by a PEG, but not by a CFG (by definition). On the other hand, any ambiguous language can only be recognized by a CFG, not a PEG.

PEG Parser Generators

Currently, there are two programs that can be used to create a PEG parser:

Pappy, a parser generator for Haskell.
Rats!, a parser generator for Java.

Both of these parser generators build semantic actions into the PEG syntax: after an expression, you can write a segment of code that specifies what value that expression should have. The alternative is to have the parser output an abstract syntax tree, and then write code that processes that tree to arrive at a value.

External Links

Parsing Expression Grammars: A Recognition-Based Syntactic Foundation
A slideshow presentation by Bryan Ford about PEGs
The PEG Board at the Institute for End User Computing, promoting the use of PEGs instead of regular expressions.
Pappy
Rats!