0% found this document useful (0 votes)
16 views

Parsing

Recursive-descent parsing is a top-down parsing technique that uses procedures associated with each non-terminal in a context-free grammar. Each procedure attempts to match the right-hand side of productions for its non-terminal by calling other procedures or comparing terminals to input. Lookahead is used to determine which production to choose when alternatives exist. First and follow sets are computed to determine the lookahead symbols for each non-terminal. A grammar is LL(1) if lookahead of one symbol is sufficient to determine the production to select.

Uploaded by

Washington Brown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Parsing

Recursive-descent parsing is a top-down parsing technique that uses procedures associated with each non-terminal in a context-free grammar. Each procedure attempts to match the right-hand side of productions for its non-terminal by calling other procedures or comparing terminals to input. Lookahead is used to determine which production to choose when alternatives exist. First and follow sets are computed to determine the lookahead symbols for each non-terminal. A grammar is LL(1) if lookahead of one symbol is sufficient to determine the production to select.

Uploaded by

Washington Brown
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Parsing

A parser is an algorithm that determines whether a given input string is in a language and, as a side-effect,
usually produces a parse tree for the input. There is a procedure for generating a parser from a given context-free
grammar.

Recursive-Descent Parsing
Recursive-descent parsing is one of the simplest parsing techniques that is used in practice. Recursive-descent
parsers are also called top-down parsers, since they construct the parse tree top down (rather than bottom up).

The basic idea of recursive-descent parsing is to associate each non-terminal with a procedure. The goal of each
such procedure is to read a sequence of input characters that can be generated by the corresponding non-
terminal, and return a pointer to the root of the parse tree for the non-terminal. The structure of the procedure is
dictated by the productions for the corresponding non-terminal.

The procedure attempts to "match" the right hand side of some production for a non-terminal.

To match a terminal symbol, the procedure compares the terminal symbol to the input; if they agree, then
the procedure is successful, and it consumes the terminal symbol in the input (that is, moves the input
cursor over one symbol).

To match a non-terminal symbol, the procedure simply calls the corresponding procedure for that non-
terminal symbol (which may be a recursive call, hence the name of the technique).

Recursive-Descent Parser for Expressions

Consider the following grammar for expressions (we'll look at the reasons for the peculiar structure of this
grammar later):

1. <E> --> <T> <E*>


2. <E*> --> + <T> <E*> | - <T> <E*> | epsilon
3. <T> --> <F> <T*>
4. <T*> --> * <F> <T*> | / <F> <T*> | epsilon
5. <F> --> ( <E> ) | number

We create procedures for each of the non-terminals. According to production 1, the procedure to match
expressions (<E>) must match a term (by calling the procedure for <T>), and then more expressions (by calling
the procedure <E*>).
procedure E;
T; Estar;

Some procedures, such as <E*>, must examine the input to determine which production to choose.
procedure Estar;
if NextInputChar = "+" or "-" then
read(NextInputChar);
T; Estar;
We will append a special marker symbol (ENDM) to the input string; this marker symbol notifies the parser that
the entire input has been seen. We should also modify the procedure for the start symbol, E, to recognize the end
marker after seeing an expression.

Top-Down Parser for Expressions


procedure E;
T; Estar;
if NextInputChar = ENDM then /* done */
else print("syntax error")

procedure Estar;
if NextInputChar = "+" or "-" then
read(NextInputChar);
T; Estar;

procedure T;
F; Tstar;

procedure Tstar;
if NextInputChar = "*" or "/" then
read(NextInputChar);
F; Tstar;

procedure F;
if NextInputChar = "(" then
read(NextInputChar);
E;
if NextInputChar = ")" then
read(NextInputChar)
else print("syntax error");
else if NextInputChar = number then
read(NextInputChar)
else print("syntax error");

Tracing the Parser

As an example, consider the following input: 1 + (2 * 3) / 4. We just call the procedure corresponding to the start
symbol.

NextInputChar = "1"
Call E
Call T
Call F
NextInputChar = "+" /* Match 1 with F */
Call Tstar /* Match epsilon */
Call Estar
NextInputChar = "(" /* Match + */
Call T
Call F
/* Match (, looking for E ) */
NextInputChar = "2"
Call E
Call T
Call F
/* Match 2 with F */
NextInputChar = "*"
Call Tstar
/* Match * */
NextInputChar = "3"
Call F
/* Match 3 with F */
NextInputChar = ")"
Call Tstar
/* Match epsilon */
Call Estar /* Match epsilon */
NextInputChar = "/" /* Match ")" */
Call Tstar
NextInputChar = "4" /* Match "/" */
Call F
/* Match 4 with F */
NextInputChar = ENDM
Call Tstar /* Match epsilon */
Call Tstar /* Match epsilon */
Call Estar /* Match epsilon */
/* Match ENDM */

Observations about Recursive-Descent Parser


In procedure Estar and Tstar, we match one of the productions with an arithmetic operator if we see such
an operator in the input; otherwise we simply return. A procedure that returns without matching any
symbols is, in effect, choosing the epsilon production.

In our expression parser, we only choose the epsilon production if the NextInputChar doesn't match the
first terminal on the right hand side of the production.

We never attempt to read beyond the end marker (ENDM), which is matched only at the end of an
expression. In all other circumstances, the presence of the end marker signals a syntax error.

As written, our recursive-descent parser only determines whether or not the input string is in the language
of the grammar; it does not give the structure of the string according to the grammar. We could easily
build a parse tree incrementally during parsing.

Lookahead in Recursive-Descent Parsing

In order to implement a recursive-descent parser for a grammar, for each nonterminal in the grammar, it must be
possible to determine which production to apply for that non-terminal by looking only at the current input
symbol. (We want to avoid having the compiler or other text processing program scan ahead in the input to
determine what action to take next.)

The lookahead symbol is simply the next terminal that we will try to match in the input. We use a single
lookahead symbol to decide what production to match.

Consider a production: A --> X1...Xm. We need to know the set of possible lookahead symbols that indicate this
production is to be chosen.

This set is clearly those terminal symbols that can be produced by the symbols X1...Xm (which may be
either terminals or non-terminals).

Since a lookahead is only a single terminal symbol, we want the first (i.e., leftmost) symbol that could be
produced by X1...Xm.

We donote the set of symbols that could be produced first by X1...Xm as First(X1...Xm).
First Sets
To distinguish two productions with the same non-terminal on the left hand side, we examine the First sets for
their corresponding right hand sides. Given the production A --> X1...Xm we must determine First(X1...Xm).

We first consider the leftmost symbol, X1.

If this is a terminal symbol, then First(X1...Xm) = X1.


If X1 is a non-terminal, then we compute the First sets for each right hand side corresponding to X1.

In our expression grammar above:


First(<E>) = First(<T> <E*>)
First(<T> <E*>) = First(<T>)
First(<T>) = First(<F> <T*>)
First(<F> <T*>) = First(<F>) = {(,number}

If X1 can generate epsilon, then X1 can (in effect) be erased, and First(X1...Xm) depends on X2.

If X2 is a terminal, it is included in First(X1...Xm).


If X2 is a non-terminal, we compute the First sets for each of its corresponding right hand sides.

Similarly, if both X1 and X2 can produce epsilon, we consider X3, then X4, etc.

Follow Sets

Suppose we are attempting to compute the lookahead symbols that suggest the production A --> X1...Xm. What
if each of the Xi can produce epsilon?

If the entire right hand side of a production can produce epsilon, then the lookahead for A is determined by those
terminal symbols that can follow A in a parse. We denote the set of terminal symbols that can follow a non-
terminal A in a parse as Follow(A).

We inspect the grammar for all occurences of the non-terminal A. In each production, A is either:

followed by a terminal symbol x, so x is in Follow(A).

followed by a non-terminal symbol B, so Follow(A) includes First(B).

at the end of a production for some non-terminal S (as in S -> Y1...YmA), in which case Follow(A)
includes Follow(S).

First and Follow Sets for Expression Grammar


Computing the First and Follow sets for our expression grammar (as augmented with a new start symbol that
includes the ENDM in the production):

1. <S> --> <E> ENDM


2. <E> --> <T> <E*>
3. <E*> --> + <T> <E*> | - <T> <E*> | epsilon
4. <T> --> <F> <T*>
5. <T*> --> * <F> <T*> | / <F> <T*> | epsilon
6. <F> --> ( <E> ) | number

First(<E>) = First(<T> <E*>) = First(<T>)

First(<E*>) = {+} U {-} U Follow(<E*>)


Follow(<E*>) = Follow(<E>) = {),ENDM}
First(<E*>) = {+,-,),ENDM}

First(<T>) = First(<F> <T*>) = First(<F>)

First(<T*>) = {*} U {/} U Follow(<T*>)


Follow(<T*>) = Follow(<T>) = First(<E*>)
First(<T*>) = {*,/,+,-,),ENDM}

First(<F>) {(,number}

LL(1) Grammars for Recursive-Descent Parsing


The set of lookahead symbols that will cause the selection (ie., prediction) of the production A --> X1...Xm is
Predict(A --> X1...Xm) = First(X1...Xm) U
If X1...Xm --> epsilon then Follow(A) else null

That is, any symbol that can be the first symbol produced by the right hand side of a production will predict that
production. Further, if the entire right hand side can produce epsilon, then symbols that can immediately follow
the left hand side of a production will also predict that production.

If, for two productions

1. A --> X1...Xm
2. A --> Y1...Yn

we have some symbol s for which

1. s is in Predict(A --> X1...Xm)


2. s is in Predict(A --> Y1...Yn)

then we cannot in general know which production to select by looking at a single input symbol.

Recursive-descent parsing can only parse those CFG's that have disjoint predict sets for productions that share a
common left hand side. CFG's that obey this restriction are called LL(1).

From experience we know that it is usually possible to create an LL(1) CFG for a programming language.
However, not all CFG's are LL(1) and a CFG that is not LL(1) may be parsable using some other (usually more
complex) parsing technique.

Creating LL(1) Grammars

Recursive-descent parsing can only parse grammars that have disjoint predict sets for productions that share a
common left hand side.

Two common properties of grammars that violate this condition are:


Left recursion: any grammar containing productions with left recursion, that is, productions of the form A
--> A X1...Xm, cannot be LL(1). The problem is that any symbol that predicts this production the first
time will, of necessity, continue to predict this production forever (and never be matched).

Common prefix: any grammar containing two productions for the same non-terminal that share a common
prefix on the right hand side cannot be LL(1). The problem is that any symbol that predicts the first
production must also predict the second; since the predict sets for the two productions are not disjoint, the
grammar is not LL(1).

Creating an LL(1) Grammar


Consider the following grammar for expressions:

1. <E> --> <E> + <T>


2. <E> --> <E> - <T>
3. <E> --> <T>
4. <T> --> <T> * <F>
5. <T> --> <T> / <F>
6. <T> --> <F>
7. <F> --> ( <E> )
8. <F> --> number

This grammar has left recursion, and therefore cannot be LL(1). We can replace the use of left recursion with
right recursion as follows:

1. <E> --> <T> + <E>


2. <E> --> <T> - <E>
3. <E> --> <T>
4. <T> --> <F> * <T>
5. <T> --> <F> / <T>
6. <T> --> <F>
7. <F> --> ( <E> )
8. <F> --> number

The resulting grammar is still not LL(1); productions 1-3 share a common prefix, as do productions 4-6. We can
eliminate the common prefix by defering the decision as to which production to pick until after seeing the
common prefix. This technique is called factoring the common prefix.

1. <E> --> <T> <E*>


2. <E*> --> + <T> <E*> | - <T> <E*> | epsilon
3. <T> --> <F> <T*>
4. <T*> --> * <F> <T*> | / <F> <T*> | epsilon
5. <F> --> ( <E> ) | number

Table-Driven Parsing

In recursive-descent parsing, the decision as to which production to choose for a particular non-terminal is hard-
coded into the procedure for the non-terminal. The procedure uses the Predict sets (computed from the First and
Follow sets) for the grammar to decide which production to choose based on the lookahead symbol.

The problem with recursive-descent parsing is that it is inflexible; changes in the grammar can cause significant
(and in some cases non-obvious) changes to the parser.
Since recursive-descent parsing uses an implicit stack of procedure calls, it is possible to replace the parsing
procedures and implicit stack with an explicit stack and a single parsing procedure that manipulates the stack.

In this scheme, we encode the actions the parsing procedure should take in a table. This table can be generated
automatically (with the grammar as input), which is why this approach adapts more easily to changes in the
grammar.

A Table-Driven Parser
The parse table encodes the choice of production as a function of the current non-terminal of interest and the
lookahead symbol.
T: Non-terminals x Terminals -> Productions U {Error}

The entry T[A,x] gives the production number to choose when A is the non-terminal of interest and x is the
current input symbol. The table is a mapping from non-terminals x terminals to productions.
T[A,x] == A -> X1..Xm if x in Predict(A->X1..Xm)
otherwise T[A,x] == Error

The driver procedure is very simple. It stacks symbols that are to be matched or expanded. Terminal symbols on
the stack must match an input symbol; non-terminal symbols are expanded via the Predict function (which is
encoded in the parse table).

Parse Table for Expressions

Here is an LL(1) expression grammar, augmented to include the end marker:

1. <S> --> <E> ENDM


2. <E> --> <T> <E*>
3. <E*> --> + <T> <E*>
4. <E*> --> - <T> <E*>
5. <E*> --> epsilon
6. <T> --> <F> <T*>
7. <T*> --> * <F> <T*>
8. <T*> --> / <F> <T*>
9. <T*> --> epsilon
10. <F> --> ( <E> )
11. <F> --> number

The table for this expression grammar is (where a blank entry corresponds to an error):

( ) + - * / Number ENDM
-------------------------------------------
S 1 1
-------------------------------------------
E 2 2
-------------------------------------------
E* 5 3 4 5
-------------------------------------------
T 6 6
-------------------------------------------
T* 9 9 9 7 8 9
-------------------------------------------
F 10 11

This table is constructed from the Predict sets described earlier.

Driver Procedure
Under table-driven parsing, there is a single procedure that "interprets" the parse table. This "driver" procedure
takes the following form:
procedure Parser;
/* Push the start symbol S onto the stack */
Push(S,stack)
/* Initialize lookahead symbol */
scanner(NextInputSymbol)
while not Empty(stack) do
top = Top(stack)
if top is a nonterminal then
action = ParseTable[top,NextInputSymbol]
if action > 0 then
/* Pop top symbol *
Pop(stack)
/* Push RHS of production */
for each symbol on RHS #action do
Push(symbol)
else print("syntax error")
else if NextInputSymbol == top then
/* Match terminal symbol in input */
Pop(stack)
/* Get next terminal symbol in input */
scanner(NextInputSymbol)
else print("syntax error")

Example Parse

Let's trace the parse for the input 1 + (2 * 3) / 4 ENDM:


Stack Contents Current input Action
1: S 1 + (2 * 3) / 4 ENDM 1
2: E ENDM 1 + (2 * 3) / 4 ENDM 2
3: T E* ENDM 1 + (2 * 3) / 4 ENDM 6
4: F T* E* ENDM 1 + (2 * 3) / 4 ENDM 11
5: N T* E* ENDM 1 + (2 * 3) / 4 ENDM Pop
6: T* E* ENDM + (2 * 3) / 4 ENDM 9
7: E* ENDM + (2 * 3) / 4 ENDM 3
8: + T E* ENDM + (2 * 3) / 4 ENDM Pop
9: T E* ENDM (2 * 3) / 4 ENDM 6
10: F T* E* ENDM (2 * 3) / 4 ENDM 10
11: ( E ) T* E* ENDM (2 * 3) / 4 ENDM Pop
12: E ) T* E* ENDM 2 * 3) / 4 ENDM 2
13: T E* ) T* E* ENDM 2 * 3) / 4 ENDM 6
14: F T* E* ) T* E* ENDM 2 * 3) / 4 ENDM 11
15: N T* E* ) T* E* ENDM 2 * 3) / 4 ENDM Pop
16: T* E* ) T* E* ENDM * 3) / 4 ENDM 7
17: * F T* E* ) T* E* ENDM * 3) / 4 ENDM Pop
18: F T* E* ) T* E* ENDM 3) / 4 ENDM 11
19: N T* E* ) T* E* ENDM 3) / 4 ENDM Pop
20: T* E* ) T* E* ENDM ) / 4 ENDM 9
21: E* ) T* E* ENDM ) / 4 ENDM 5
22: ) T* E* ENDM ) / 4 ENDM Pop
23: T* E* ENDM / 4 ENDM 8
24: / F T* E* ENDM / 4 ENDM Pop
25: F T* E* ENDM 4 ENDM 11
26: N T* E* ENDM 4 END Pop
27: T* E* ENDM ENDM 9
28: E* ENDM ENDM 5
29: ENDM ENDM Pop
30: Done!

You might also like