Parsing
Parsing
A parser is an algorithm that determines whether a given input string is in a language and, as a side-effect,
usually produces a parse tree for the input. There is a procedure for generating a parser from a given context-free
grammar.
Recursive-Descent Parsing
Recursive-descent parsing is one of the simplest parsing techniques that is used in practice. Recursive-descent
parsers are also called top-down parsers, since they construct the parse tree top down (rather than bottom up).
The basic idea of recursive-descent parsing is to associate each non-terminal with a procedure. The goal of each
such procedure is to read a sequence of input characters that can be generated by the corresponding non-
terminal, and return a pointer to the root of the parse tree for the non-terminal. The structure of the procedure is
dictated by the productions for the corresponding non-terminal.
The procedure attempts to "match" the right hand side of some production for a non-terminal.
To match a terminal symbol, the procedure compares the terminal symbol to the input; if they agree, then
the procedure is successful, and it consumes the terminal symbol in the input (that is, moves the input
cursor over one symbol).
To match a non-terminal symbol, the procedure simply calls the corresponding procedure for that non-
terminal symbol (which may be a recursive call, hence the name of the technique).
Consider the following grammar for expressions (we'll look at the reasons for the peculiar structure of this
grammar later):
We create procedures for each of the non-terminals. According to production 1, the procedure to match
expressions (<E>) must match a term (by calling the procedure for <T>), and then more expressions (by calling
the procedure <E*>).
procedure E;
T; Estar;
Some procedures, such as <E*>, must examine the input to determine which production to choose.
procedure Estar;
if NextInputChar = "+" or "-" then
read(NextInputChar);
T; Estar;
We will append a special marker symbol (ENDM) to the input string; this marker symbol notifies the parser that
the entire input has been seen. We should also modify the procedure for the start symbol, E, to recognize the end
marker after seeing an expression.
procedure Estar;
if NextInputChar = "+" or "-" then
read(NextInputChar);
T; Estar;
procedure T;
F; Tstar;
procedure Tstar;
if NextInputChar = "*" or "/" then
read(NextInputChar);
F; Tstar;
procedure F;
if NextInputChar = "(" then
read(NextInputChar);
E;
if NextInputChar = ")" then
read(NextInputChar)
else print("syntax error");
else if NextInputChar = number then
read(NextInputChar)
else print("syntax error");
As an example, consider the following input: 1 + (2 * 3) / 4. We just call the procedure corresponding to the start
symbol.
NextInputChar = "1"
Call E
Call T
Call F
NextInputChar = "+" /* Match 1 with F */
Call Tstar /* Match epsilon */
Call Estar
NextInputChar = "(" /* Match + */
Call T
Call F
/* Match (, looking for E ) */
NextInputChar = "2"
Call E
Call T
Call F
/* Match 2 with F */
NextInputChar = "*"
Call Tstar
/* Match * */
NextInputChar = "3"
Call F
/* Match 3 with F */
NextInputChar = ")"
Call Tstar
/* Match epsilon */
Call Estar /* Match epsilon */
NextInputChar = "/" /* Match ")" */
Call Tstar
NextInputChar = "4" /* Match "/" */
Call F
/* Match 4 with F */
NextInputChar = ENDM
Call Tstar /* Match epsilon */
Call Tstar /* Match epsilon */
Call Estar /* Match epsilon */
/* Match ENDM */
In our expression parser, we only choose the epsilon production if the NextInputChar doesn't match the
first terminal on the right hand side of the production.
We never attempt to read beyond the end marker (ENDM), which is matched only at the end of an
expression. In all other circumstances, the presence of the end marker signals a syntax error.
As written, our recursive-descent parser only determines whether or not the input string is in the language
of the grammar; it does not give the structure of the string according to the grammar. We could easily
build a parse tree incrementally during parsing.
In order to implement a recursive-descent parser for a grammar, for each nonterminal in the grammar, it must be
possible to determine which production to apply for that non-terminal by looking only at the current input
symbol. (We want to avoid having the compiler or other text processing program scan ahead in the input to
determine what action to take next.)
The lookahead symbol is simply the next terminal that we will try to match in the input. We use a single
lookahead symbol to decide what production to match.
Consider a production: A --> X1...Xm. We need to know the set of possible lookahead symbols that indicate this
production is to be chosen.
This set is clearly those terminal symbols that can be produced by the symbols X1...Xm (which may be
either terminals or non-terminals).
Since a lookahead is only a single terminal symbol, we want the first (i.e., leftmost) symbol that could be
produced by X1...Xm.
We donote the set of symbols that could be produced first by X1...Xm as First(X1...Xm).
First Sets
To distinguish two productions with the same non-terminal on the left hand side, we examine the First sets for
their corresponding right hand sides. Given the production A --> X1...Xm we must determine First(X1...Xm).
If X1 can generate epsilon, then X1 can (in effect) be erased, and First(X1...Xm) depends on X2.
Similarly, if both X1 and X2 can produce epsilon, we consider X3, then X4, etc.
Follow Sets
Suppose we are attempting to compute the lookahead symbols that suggest the production A --> X1...Xm. What
if each of the Xi can produce epsilon?
If the entire right hand side of a production can produce epsilon, then the lookahead for A is determined by those
terminal symbols that can follow A in a parse. We denote the set of terminal symbols that can follow a non-
terminal A in a parse as Follow(A).
We inspect the grammar for all occurences of the non-terminal A. In each production, A is either:
at the end of a production for some non-terminal S (as in S -> Y1...YmA), in which case Follow(A)
includes Follow(S).
First(<F>) {(,number}
That is, any symbol that can be the first symbol produced by the right hand side of a production will predict that
production. Further, if the entire right hand side can produce epsilon, then symbols that can immediately follow
the left hand side of a production will also predict that production.
1. A --> X1...Xm
2. A --> Y1...Yn
then we cannot in general know which production to select by looking at a single input symbol.
Recursive-descent parsing can only parse those CFG's that have disjoint predict sets for productions that share a
common left hand side. CFG's that obey this restriction are called LL(1).
From experience we know that it is usually possible to create an LL(1) CFG for a programming language.
However, not all CFG's are LL(1) and a CFG that is not LL(1) may be parsable using some other (usually more
complex) parsing technique.
Recursive-descent parsing can only parse grammars that have disjoint predict sets for productions that share a
common left hand side.
Common prefix: any grammar containing two productions for the same non-terminal that share a common
prefix on the right hand side cannot be LL(1). The problem is that any symbol that predicts the first
production must also predict the second; since the predict sets for the two productions are not disjoint, the
grammar is not LL(1).
This grammar has left recursion, and therefore cannot be LL(1). We can replace the use of left recursion with
right recursion as follows:
The resulting grammar is still not LL(1); productions 1-3 share a common prefix, as do productions 4-6. We can
eliminate the common prefix by defering the decision as to which production to pick until after seeing the
common prefix. This technique is called factoring the common prefix.
Table-Driven Parsing
In recursive-descent parsing, the decision as to which production to choose for a particular non-terminal is hard-
coded into the procedure for the non-terminal. The procedure uses the Predict sets (computed from the First and
Follow sets) for the grammar to decide which production to choose based on the lookahead symbol.
The problem with recursive-descent parsing is that it is inflexible; changes in the grammar can cause significant
(and in some cases non-obvious) changes to the parser.
Since recursive-descent parsing uses an implicit stack of procedure calls, it is possible to replace the parsing
procedures and implicit stack with an explicit stack and a single parsing procedure that manipulates the stack.
In this scheme, we encode the actions the parsing procedure should take in a table. This table can be generated
automatically (with the grammar as input), which is why this approach adapts more easily to changes in the
grammar.
A Table-Driven Parser
The parse table encodes the choice of production as a function of the current non-terminal of interest and the
lookahead symbol.
T: Non-terminals x Terminals -> Productions U {Error}
The entry T[A,x] gives the production number to choose when A is the non-terminal of interest and x is the
current input symbol. The table is a mapping from non-terminals x terminals to productions.
T[A,x] == A -> X1..Xm if x in Predict(A->X1..Xm)
otherwise T[A,x] == Error
The driver procedure is very simple. It stacks symbols that are to be matched or expanded. Terminal symbols on
the stack must match an input symbol; non-terminal symbols are expanded via the Predict function (which is
encoded in the parse table).
The table for this expression grammar is (where a blank entry corresponds to an error):
( ) + - * / Number ENDM
-------------------------------------------
S 1 1
-------------------------------------------
E 2 2
-------------------------------------------
E* 5 3 4 5
-------------------------------------------
T 6 6
-------------------------------------------
T* 9 9 9 7 8 9
-------------------------------------------
F 10 11
Driver Procedure
Under table-driven parsing, there is a single procedure that "interprets" the parse table. This "driver" procedure
takes the following form:
procedure Parser;
/* Push the start symbol S onto the stack */
Push(S,stack)
/* Initialize lookahead symbol */
scanner(NextInputSymbol)
while not Empty(stack) do
top = Top(stack)
if top is a nonterminal then
action = ParseTable[top,NextInputSymbol]
if action > 0 then
/* Pop top symbol *
Pop(stack)
/* Push RHS of production */
for each symbol on RHS #action do
Push(symbol)
else print("syntax error")
else if NextInputSymbol == top then
/* Match terminal symbol in input */
Pop(stack)
/* Get next terminal symbol in input */
scanner(NextInputSymbol)
else print("syntax error")
Example Parse