Scannerless parsing
![]() | This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
|
In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical (word level) and phrase level structure of the language.
Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most wiki grammars, makefiles, simple application-specific scripting languages, and Raku.
Advantages
- Only one metalanguage is needed
- Non-regular lexical structure is handled easily
- "Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language reserved words (such as "while" in C)
- Grammars can be compositional (can be merged without human intervention) [1]
Disadvantages
- Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be harder to understand and debug for more-complex languages
Implementations
- SGLR is a parser for the modular Syntax Definition Formalism SDF, and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
- JSGLR, a pure Java implementation of SGLR, also based on SDF.
- TXL supports character-level parsing.
- dparser generates ANSI C code for scannerless GLR parsers.
- Spirit allows for both scannerless and scanner-based parsing.
- SBP is a scannerless parser for boolean grammars (a superset of context-free grammars), written in Java.
- Laja is a two phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
- The Raku Grammars feature of the general purpose programming language Raku.
- PyParsing is a scannerless parser written in pure Python.
- META II Has built in token parsers functions.
- TREE-META Like META II also is scannerless having builtin lexer functions.
- CWIC Compiler for Writing and Implementing Compilers. Has token rules as part of its language. Rules in CWIC were compiled into boolean functions returning success or failure.
Notes
- ^ This is because parsing at the character level makes the language recognized by the parser a single context-free language defined on characters, as opposed to a context-free language of sequences of strings in regular languages. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.
Further reading
- Visser, E. (August 1997). Scannerless Generalized-LR Parsing. The Netherlands: University of Amsterdam. CiteSeerX 10.1.1.37.7828.