Top Down vs. Bottom Up Parsing Top Down vs. Bottom Up Parsing
Top Down vs. Bottom Up Parsing Top Down vs. Bottom Up Parsing
A grammar describes the strings of tokens that are syntactically legal in a PL A recogniser simply accepts or rejects strings. A generator produces sentences in the language described by the grammar A parser construct a derivation or parse tree for a sentence (if possible) Two common types of parsers:
UMBC
CSEE
UMBC
CSEE
UMBC
CSEE
The first one, with its left recursion, causes problems for top down parsers. For a given parsing technique, we may have to transform the grammar to work with it. UMBC
4
CSEE
How hard is the parsing task? Parsing an arbitrary Context Free Grammar is O(n3), e.g., it can take time proportional the cube of the number of symbols in the input. This is bad! If we constrain the grammar somewhat, we can always parse in linear time. This is good! LL(n) : Left to right, Linear-time parsing Leftmost derivation, look ahead at most n LL parsers symbols. Recognize LL grammar LR(n) : Left to right, Use a top-down strategy Right derivation, look ahead at most n LR parsers symbols. Recognize LR grammar Use a bottom-up strategy 5 CSEE UMBC
Simplest method is a full-backup, recursive descent parser Often used for parsing simple languages Write recursive recognizers (subroutines) for each grammar rule If rules succeeds perform some action (i.e., build a tree node, emit code, etc.) If rule fails, return failure. Caller may try another choice or fail On failure it backs up
UMBC
CSEE
We could use the following recursive descent parsing subprogram (this one is written in C)
void term() { factor(); /* parse first factor*/ while (next_token == ast_code || next_token == slash_code) { lexical(); /* get next token */ factor(); /* parse next factor */ } }
UMBC
CSEE
UMBC
CSEE
Left-recursive grammars
Some grammars cause problems for top down parsers. Top down parsers do not work with leftrecursive grammars.
E.g., one with a rule like: E -> E + T We can transform a left-recursive grammar into one which is not. A grammar is left recursive if it has rules like
X -> X Or if it has indirect left recursion, as in X -> A A -> X
A top down grammar can limit backtracking if it only has one rule per non-terminal
The technique of rule factoring can be used to eliminate multiple rules for a non-terminal.
We can manually or automatically rewrite a grammar to remove left-recursion, making it suitable for a top-down parser.
UMBC
CSEE
UMBC
10
CSEE
Consider the left-recursive grammar SS| S generates all strings starting with a and followed by a number of Can rewrite using right-recursion S S S S |
In general S S 1 | | S n | 1 | | m All strings derived from S start with one of 1,,m and continue with several instances of 1,,n Rewrite as S 1 S | | m S S 1 S | | n S |
UMBC
11
CSEE
UMBC
12
CSEE
The grammar SA| AS is also left-recursive because S + S where ->+ means can be rewritten in one or more steps This indirect left-recursion can also be automatically eliminated
In practice, backtracking is eliminated by restricting the grammar, allowing us to successfully predict which rule to use.
UMBC
13
CSEE
UMBC
14
CSEE
Predictive Parser
A predictive parser uses information from the first terminal symbol of each expression to decide which production to use. A predictive parser is also known as an LL(k) parser because it does a Left-to-right parse, a Leftmost-derivation, and k-symbol lookahead. A grammar in which it is possible to decide which production to use examining only the first token (as in the previous example) are called LL(1) LL(1) grammars are widely used in practice.
The syntax of a PL can be adjusted to enable it to be described with an LL(1) grammar. Example: consider the grammar S if E then S else S S begin S L S print E L end L;SL E num = num
An S expression starts either with an IF, BEGIN, or PRINT token, and an L expression start with an END or a SEMICOLON token, and an E expression has only one production.
UMBC
15
CSEE
UMBC
16
CSEE
UMBC
17
CSEE
A grammar must be left-factored before use for predictive parsing Left-factoring involves rewriting the rules so that, if a non-terminal has more than one rule, each begins with a terminal. CSEE UMBC
18
Consider a rule of the form A -> a B1 | a B2 | a B3 | a Bn A top down parser generated from this grammar is not efficient as it requires backtracking. To avoid this problem we left factor the grammar. collect all productions with the same left hand side and begin with the same symbols on the right hand side combine the common strings into a single production and then append a new non-terminal symbol to the end of this new production create new productions using this new non-terminal for each of the suffixes to the common production. After left factoring the above grammar is transformed into: A > a A1 A1 -> B1 | B2 | B3 Bn
CSEE
UMBC
20
CSEE
LL(1) means that for each non-terminal and token there is only one production Can be specified via 2D tables
One dimension for current non-terminal to expand One dimension for next token A table entry contains one production
Left-factored grammar
ETX T ( E ) | int Y X+E| Y*T|
We use a stack to keep track of pending non-terminals We reject when we encounter an error state We accept when we encounter end-of-input
UMBC
21
CSEE
UMBC
22
CSEE
Consider the [E, int] entry When current non-terminal is E and next input is int, use production E T X This production can generate an int in the first place Consider the [Y, +] entry When current non-terminal is Y and current token is +, get rid of Y Y can be followed by + only in a derivation in which Y Blank entries indicate error situations Consider the [E,*] entry There is no way to derive a string starting with * from non-terminal E
YACC uses bottom up parsing. There are two important operations that bottom-up parsers use. They are namely shift and reduce.
(In abstract terms, we do a simulation of a Push Down Automata as a finite state automata.)
Input: given string to be parsed and the set of productions. Goal: Trace a rightmost derivation in reverse by starting with the input string and working backwards to the start symbol. UMBC CSEE
UMBC
23
CSEE