Bottom-up parsing
In computer science, bottom-up parsing is any method of discerning the hierarchical structure and meaning of a linear input text, by identifying and processing its lowest-level small details first, before its mid-level structures, and leaving the highest-level overall structure to last.
The bottom-up name comes from the concept of a parse tree, in which the most detailed parts are at the bushy bottom of the (upside-down) tree, and larger structures composed from them are in successively higher layers, until at the top or "root" of the tree a single unit describes the entire input stream. A bottom-up parse discovers and processes that tree starting from the bottom left end, and incrementally works its way upwards and rightwards. A parser may act on the structure hierarchy's low, mid, and highest levels without ever creating an actual data tree; the tree is then merely implicit in the parser's actions.
The opposite of this are top-down parsing methods, in which the input's overall structure is decided (or guessed at) first, before dealing with mid-level parts, leaving the lowest-level small details to last. A top-down parse discovers and processes the hierarchical tree starting from the top, and incrementally works its way downwards and rightwards.
Bottom-up parsing lazily waits until it has scanned and parsed all parts of some construct before committing to what the combined construct is. Top-down parsing eagerly decides what a construct is much earlier, when it has only scanned the leftmost symbol of that construct and has not yet parsed any of its parts. If a language grammar has multiple rules that may start with the same leftmost symbols but have different endings, then that grammar can be efficiently handled by a deterministic bottom-up parse but cannot be handled top-down without guesswork and backtracking. So bottom-up parsers handle a somewhat larger range of computer language grammars than do deterministic top-down parsers.
Generic Parsing Example
Besides computer languages, bottom-up parsing can also be applied to human natural languages. The parse tree of a sentence in English tends to be quite shallow. The difference between bottom-up and top-down methods is hard to see with shallow examples. So the somewhat-deeper example here will use a computer language.
Consider the Java program fragment:
- A = B + C*2; E = 1;
- (Need a pretty picture of the tree here.)
A bottom-up parser will locate and process the pieces of the example in the following order:
- variable A
- variable B
- value B
- product expression B
- sum expression B
- product expression B
- value B
- variable C
- value C
- product expression C
- value C
- constant 2
- value 2
- product expression C*2
- sum expression B + (C*2)
- assignment, storing value of B + (C*2) into A
- a statement, the assignment A = B + (C*2)
- statement sequence, of one statement
- a statement, the assignment A = B + (C*2)
- assignment, storing value of B + (C*2) into A
- sum expression B + (C*2)
- product expression C*2
- value 2
- variable E
- constant 1
- value 1
- product expression 1
- sum expression 1
- assignment, storing value 1 into E
- a statement, the assignment E = 1
- assignment, storing value 1 into E
- sum expression 1
- product expression 1
- value 1
- statement sequence, of two statements
Whereas a top-down parser will locate and process the pieces of this example in the order:
- statement sequence, of an unknown number of statements
- statement sequence's next (first) statement
- assignment action, some destination gets sum expression value e1
- variable A is the destination
- sum expression e1's first part is product subexpression e2
- product expression e2's first part is value e3
- value e3 is variable B
- product expression e2's first part is value e3
- e1 is a sum of e2 (B) and product subexpression e4
- product expression e4's first part is value e5
- value e5 is variable C
- e4 is a product of e5 (C) and value e6
- value e6 is the constant 2
- product expression e4's first part is value e5
- assignment action, some destination gets sum expression value e1
- statement sequence's next (second) statement
- assignment action, some destination gets sum expression value e7
- variable E is the destination
- sum expression e7's first part is product subexpression e8
- product expression e8's first part is value e9
- e9 is the constant 1
- product expression e8's first part is value e9
- assignment action, some destination gets sum expression value e7
The final parse trees (if written out) are similar. The details will vary with different grammars for the same language. But they will be close to the above patterns.
Type of bottom-up parsers
It is common for bottom-up parsers to take the form of general parsing engines, which can either parse, or generate a parser for, a specific programming language given a specification of its grammar. Perhaps the most well known generalized parser generators are YACC and GNU bison.
The common classes of bottom-up parsers are:
- LR parser
- LR(0) - No lookahead symbol
- SLR(1) - Simple with one lookahead symbol. Used by XPL.
- LALR(1) - Lookahead bottom up, not as powerful as full LR(1) but simpler to implement. YACC deals with this kind of grammar.
- LR(1) - Most general grammar, but most complex to implement.
- LR() - (where is a positive integer) indicates an LR parser with lookahead symbols; while grammars can be designed that require more than 1 lookahead, practical grammars try to avoid this because increasing can theoretically require exponentially more code and data space (in practice, this may not be as bad). Also, the class of LR() languages is the same as that of LR(1) languages.
- Precedence parsers
- Simple precedence parser
- Operator-precedence parser
- Extended precedence parser
- Mixed Strategy Precedence parser, used by the original version of XPL.
Bottom-up parsing can also be done by backtracking.
Shift-reduce parsers
The most common bottom-up parsers are the shift-reduce parsers. These parsers examine the input tokens and either shift (push) them onto a stack or reduce elements at the top of the stack, replacing a right-hand side by a left-hand side.
Action table
Often an action (or parse) table is constructed which helps the parser determine what to do next. The following is a description of what can be held in an action table.
Actions
- Shift - push token onto stack
- Reduce - remove handle from stack and push on corresponding nonterminal
- Accept - recognize sentence when stack contains only the distinguished symbol and input is empty
- Error - happens when none of the above is possible; means original input was not a sentence
Shift and reduce
A shift-reduce parser uses a stack to hold the grammar symbols while awaiting reduction. During the operation of the parser, symbols from the input are shifted onto the stack. If a prefix of the symbols on top of the stack matches the right-hand side (RHS) of a grammar rule which is the correct rule to use within the current context, then the parser reduces the RHS of the rule to its left-hand side (LHS), replacing the RHS symbols on top of the stack with the nonterminal occurring on the LHS of the rule. This shift-reduce process continues until the parser terminates, reporting either success or failure. It terminates with success when the input is legal and is accepted by the parser. It terminates with failure if an error is detected in the input.
The parser is a stack automaton which is in one of several discrete states. In reality, in the case of LR parsing, the parse stack contains states, rather than grammar symbols. However, since each state corresponds to a unique grammar symbol, the state stack can be mapped onto the grammar symbol stack mentioned earlier.
Algorithm: Shift-reduce parsing
- Start with the sentence to be parsed as the initial sentential form
- Until the sentential form is the start symbol do:
- Scan through the input until we recognise something that corresponds to the RHS of one of the production rules (this is called a handle)
- Apply a production rule in reverse; i.e., replace the RHS of the rule which appears in the sentential form with the LHS of the rule (an action known as a reduction)
In step 2.1 above we are "shifting" the input symbols to one side as we move through them; hence a parser which operates by repeatedly applying steps 2.1 and 2.2 above is known as a shift-reduce parser.
A shift-reduce parser is most commonly implemented using a stack, where we proceed as follows:
- start with an empty stack
- a "shift" action corresponds to pushing the current input symbol onto the stack
- a "reduce" action occurs when we have a handle on top of the stack. To perform the reduction, we pop the handle off the stack and replace it with the nonterminal on the LHS of the corresponding rule.
Example
Take the grammar:
Sentence --> NounPhrase VerbPhrase NounPhrase --> Art Noun VerbPhrase --> Verb | Adverb Verb Art --> the | a | ... Verb --> jumps | sings | ... Noun --> dog | cat | ...
And the input:
- the dog jumps
Then the bottom up parsing is:
Stack Input Sequence () (the dog jumps) (the) (dog jumps) SHIFT word onto stack (Art) (dog jumps) REDUCE using grammar rule (Art dog) (jumps) SHIFT.. (Art Noun) (jumps) REDUCE.. (NounPhrase) (jumps) REDUCE (NounPhrase jumps) () SHIFT (NounPhrase Verb) () REDUCE (NounPhrase VerbPhrase)() REDUCE (Sentence) () SUCCESS
See also
External links
- Parsing Simulator This simulator is used to generate examples of shift-reduce parsing
- An example of shift-reduce parsing (which is a type of bottom up parsing), with a small grammar, state diagram, and C language code to implement the parser
- Course notes on shift reduce parsing
- A good non-technical tutorial in the context of natural (human) languages [dead link] (Instead use an archived version of the page)
- A discussion of shift-reduce conflicts in bottom up parsers. A knowledgeable but technical article.
- Yet another bottom-up parsing illustration