Bottom-up parsing
In computer science, bottom-up parsing is any method of discerning the hierarchical structure and meaning of a linear input text, by identifying and processing its lowest-level small details first, before its mid-level structures, and leaving the highest-level overall structure to last.
The bottom-up name comes from the concept of a parse tree, in which the most detailed parts are at the bushy bottom of the (upside-down) tree, and larger structures composed from them are in successively higher layers, until at the top or "root" of the tree a single unit describes the entire input stream. A bottom-up parse discovers and processes that tree starting from the bottom left end, and incrementally works its way upwards and rightwards. A parser may act on the structure hierarchy's low, mid, and highest levels without ever creating an actual data tree; the tree is then merely implicit in the parser's actions.
The opposite of this are top-down parsing methods, in which the input's overall structure is decided (or guessed at) first, before dealing with mid-level parts, leaving the lowest-level small details to last. A top-down parse discovers and processes the hierarchical tree starting from the top, and incrementally works its way downwards and rightwards.
Bottom-up parsing lazily waits until it has scanned and parsed all parts of some construct before committing to what the combined construct is. Top-down parsing eagerly decides what a construct is much earlier, when it has only scanned the leftmost symbol of that construct and has not yet parsed any of its parts. If a language grammar has multiple rules that may start with the same leftmost symbols but have different endings, then that grammar can be efficiently handled by a deterministic bottom-up parse but cannot be handled top-down without guesswork and backtracking. So bottom-up parsers handle a somewhat larger range of computer language grammars than do deterministic top-down parsers.
Generic Parsing Example
Besides computer languages, bottom-up parsing can also be applied to human natural languages. The parse tree of a sentence in English tends to be quite shallow. The difference between bottom-up and top-down methods is hard to see with shallow examples. So the somewhat-deeper example here will use a computer language.
Consider the Java program fragment:
- A = B + C*2; E = 1;
To avoid having to manually write lots of parentheses, the groupings of mathematical subexpressions are implied by the "precedence" of the operators. The * operator is given grouping precedence or priority over +, which has grouping precedence over assignment (=), which has precedence over statement separators (;). The above is an abbreviation for and is equivalent to
- A = (B + (C*2)); E = 1;
- (Need a pretty picture of the tree here.)
A bottom-up parser will locate and process the pieces of the original example in the following order:
- variable A
- variable B
- fetch B
- expression B
- variable C
- fetch C
- expression C
- constant 2
- expression 2
- expression C*2
- expression B + (C*2)
- assignment, storing value of B + (C*2) at address of A
- one statement, the assignment A = B + (C*2)
- variable E
- constant 1
- expression 1
- assignment, storing value 1 at address of E
- one statement, the assignment E = 1
- statement sequence, of two statements
Whereas a top-down parser will locate and process the pieces of this example in the order:
- statement sequence, of an unknown number of statements
- statement sequence's next (first) statement
- assignment action, of unknown expression value e1 to unknown destination
- variable A is the destination
- e1's first part is unknown subexpression e2
- e2 is fetched value of something
- variable B, fetched
- e2 is fetched value of something
- e1 is a sum of e2 (B) and unknown sub-expression e3
- e3's first part is unknown subexpression e4
- e4 is fetched value of something
- variable C, fetched
- e4 is fetched value of something
- e3 is a product of e4 (C) and unknown sub-expression e5
- e5's first (and only) part is unknown subexpression e6
- e6 is the constant 2
- e5 is 2
- e5's first (and only) part is unknown subexpression e6
- e3 is C*2
- e3's first part is unknown subexpression e4
- e1 is B + (C*2)
- assignment, storing value of B + (C*2) at address of A
- assignment action, of unknown expression value e1 to unknown destination
- statement sequence's next (second) statement
- assignment action, of unknown expression value e7 to unknown destination
- variable E is the destination
- expression e7 with unknown parts and unknown value
- e7's first (and only) part is unknown subexpression e8
- e8 is the constant 1
- e7 is 1
- e7's first (and only) part is unknown subexpression e8
- assignment, storing value 1 at address of E
- assignment action, of unknown expression value e7 to unknown destination
The final parse trees (if written out) are similar. The details will vary with different grammars for the same language, and with different bottom-up or top-down methods. But they will all be close to the above patterns.
In both bottom-up and top-down, the conclusive action steps are in bottom-up order and are very close to the output code desired for stack machines such as the Java Virtual Machine.
Type of bottom-up parsers
It is common for bottom-up parsers to take the form of general parsing engines, which can either parse, or generate a parser for, a specific programming language given a specification of its grammar. Perhaps the most well known generalized parser generators are YACC and GNU bison.
The common classes of bottom-up parsers are:
- LR parser
- LR(0) - No lookahead symbol
- SLR(1) - Simple with one lookahead symbol. Used by XPL.
- LALR(1) - Lookahead bottom up, not as powerful as full LR(1) but simpler to implement. YACC deals with this kind of grammar.
- LR(1) - Most general grammar, but most complex to implement.
- LR() - (where is a positive integer) indicates an LR parser with lookahead symbols; while grammars can be designed that require more than 1 lookahead, practical grammars try to avoid this because increasing can theoretically require exponentially more code and data space (in practice, this may not be as bad). Also, the class of LR() languages is the same as that of LR(1) languages.
- Precedence parsers
- Simple precedence parser
- Operator-precedence parser
- Extended precedence parser
- Mixed Strategy Precedence parser, used by the original version of XPL.
Bottom-up parsing can also be done by backtracking.
Shift-reduce parsers
The most common bottom-up parsers are the shift-reduce parsers. These parsers examine the input tokens and either shift (push) them onto a stack or reduce elements at the top of the stack, replacing a right-hand side by a left-hand side.
Action table
Often an action (or parse) table is constructed which helps the parser determine what to do next. The following is a description of what can be held in an action table.
Actions
- Shift - push token onto stack
- Reduce - remove handle from stack and push on corresponding nonterminal
- Accept - recognize sentence when stack contains only the distinguished symbol and input is empty
- Error - happens when none of the above is possible; means original input was not a sentence
Shift and reduce
A shift-reduce parser uses a stack to hold the grammar symbols while awaiting reduction. During the operation of the parser, symbols from the input are shifted onto the stack. If a prefix of the symbols on top of the stack matches the right-hand side (RHS) of a grammar rule which is the correct rule to use within the current context, then the parser reduces the RHS of the rule to its left-hand side (LHS), replacing the RHS symbols on top of the stack with the nonterminal occurring on the LHS of the rule. This shift-reduce process continues until the parser terminates, reporting either success or failure. It terminates with success when the input is legal and is accepted by the parser. It terminates with failure if an error is detected in the input.
The parser is a stack automaton which is in one of several discrete states. In reality, in the case of LR parsing, the parse stack contains states, rather than grammar symbols. However, since each state corresponds to a unique grammar symbol, the state stack can be mapped onto the grammar symbol stack mentioned earlier.
Algorithm: Shift-reduce parsing
- Start with the sentence to be parsed as the initial sentential form
- Until the sentential form is the start symbol do:
- Scan through the input until we recognise something that corresponds to the RHS of one of the production rules (this is called a handle)
- Apply a production rule in reverse; i.e., replace the RHS of the rule which appears in the sentential form with the LHS of the rule (an action known as a reduction)
In step 2.1 above we are "shifting" the input symbols to one side as we move through them; hence a parser which operates by repeatedly applying steps 2.1 and 2.2 above is known as a shift-reduce parser.
A shift-reduce parser is most commonly implemented using a stack, where we proceed as follows:
- start with an empty stack
- a "shift" action corresponds to pushing the current input symbol onto the stack
- a "reduce" action occurs when we have a handle on top of the stack. To perform the reduction, we pop the handle off the stack and replace it with the nonterminal on the LHS of the corresponding rule.
Example
Take the grammar:
Sentence --> NounPhrase VerbPhrase NounPhrase --> Art Noun VerbPhrase --> Verb | Adverb Verb Art --> the | a | ... Verb --> jumps | sings | ... Noun --> dog | cat | ...
And the input:
- the dog jumps
Then the bottom up parsing is:
Stack Input Sequence () (the dog jumps) (the) (dog jumps) SHIFT word onto stack (Art) (dog jumps) REDUCE using grammar rule (Art dog) (jumps) SHIFT.. (Art Noun) (jumps) REDUCE.. (NounPhrase) (jumps) REDUCE (NounPhrase jumps) () SHIFT (NounPhrase Verb) () REDUCE (NounPhrase VerbPhrase)() REDUCE (Sentence) () SUCCESS
See also
External links
- Parsing Simulator This simulator is used to generate examples of shift-reduce parsing
- An example of shift-reduce parsing (which is a type of bottom up parsing), with a small grammar, state diagram, and C language code to implement the parser
- Course notes on shift reduce parsing
- A good non-technical tutorial in the context of natural (human) languages [dead link] (Instead use an archived version of the page)
- A discussion of shift-reduce conflicts in bottom up parsers. A knowledgeable but technical article.
- Yet another bottom-up parsing illustration