0% found this document useful (0 votes)
51 views

Module 3

This document discusses context free grammars and their use in compiler design. It defines the key components of a context free grammar - variables, terminals, production rules, and a start symbol. Examples are given of grammars for simple languages and how to derive strings from a grammar. Methods for leftmost and rightmost derivations are described. Parse trees are introduced as a way to visualize derivations. Exercises are provided for constructing grammars for specific languages.

Uploaded by

indrajvyadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Module 3

This document discusses context free grammars and their use in compiler design. It defines the key components of a context free grammar - variables, terminals, production rules, and a start symbol. Examples are given of grammars for simple languages and how to derive strings from a grammar. Methods for leftmost and rightmost derivations are described. Parse trees are introduced as a way to visualize derivations. Exercises are provided for constructing grammars for specific languages.

Uploaded by

indrajvyadav
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

AUTOMATA THEORY
AND COMPILER
DESIGN- 21CS51

Dr. Sampada K S, Associate Professor


DEPT. OF CSE | RNSIT

DR. SAMPADA K S, CSE, RNSIT 1


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

MODULE 3
Context Free Grammars: Definition and designing CFGs, Derivations Using a Grammar, Parse Trees,
Ambiguity and Elimination of Ambiguity, Elimination of Left Recursion, Left Factoring.

Syntax Analysis Phase of Compilers: part-1: Role of Parser, Top-Down Parsing

PART 1: CONTEXT FREE GRAMMARS


The pumping lemma showed there are languages that are not regular. There are many classes “larger”
than that of regular languages. One of these classes are called “Context Free” languages. Context-free
languages because of the property that we can substitute strings for variables regardless of context
(implies context sensitive languages exist).
CFG’s are useful in many applications
– Describing syntax of programming languages
– Parsing
– Structure of documents, e.g.XML
A CFG G may then be represented by these four components, denoted G= (V,T,P,S) where
– V is the set of variables
– T is the set of terminals
– P is the set of production rules
– S is the start symbol
CFG Examples: Language of palindromes
– The language L = { w | w = wR } is defined by the following context-free grammar over the
alphabet {0,1}:
P
P0
P1
P  0P0
P  1P1

FORMAL DEFINITIONS:
• There is a finite set of symbols that form the strings, i.e. there is a finite alphabet. The alphabet
symbols are called terminals.
• There is a finite set of variables, sometimes called non-terminals or syntactic categories. Each
variable represents a language (i.e. a set of strings).
– In the palindrome example, the only variable is P.

DR. SAMPADA K S, CSE, RNSIT 2


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
• One of the variables is the start symbol. Other variables may exist to help define the language.
• There is a finite set of productions or production rules that represent the recursive definition of
the language. Each production is defined:
– Has a single variable that is being defined to the left of the production
– Has the production symbol 
– Has a string of zero or more terminals or variables, called the body of the production. To
form strings we can substitute each variable’s production in for the body where it appears.
Sample CFG
1. EI // Expression is an identifier
2. EE+E // Add two expressions
3. EE*E // Multiply two expressions
4. E(E) // Add parenthesis
5. I L // Identifier is a Letter
6. I ID // Identifier + Digit
7. I IL // Identifier + Letter
8. D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 // Digits
9. L a|b|c|…A|B|…Z // Letters
Grammar from finite automata:
Procedure:
1. If δ(qi, a)= qj then introduce a production qi aqj
2. if q ∈ F if q is the final state in FA , then introduce the production q Ɛ
Examples (Solutions can be found in subsequent sheets)
1. Obtain grammar consisting of any number of a’s.
2. Obtain grammar to generate string consisting of atleast one a.
3. Obtain grammar consisting of any number of a’s and b’s.
4. Obtain grammar consisting of atleast two a’s.
5. Obtain a grammar consisting of even number of a’s.
6. Obtain grammar consisting of multiple of 3 a’s.
7. Obtain grammar to generate strings os a’s and b’s such that length of string is multiple of 3.
8. Obtain grammar to generate string consisting of any number of a’s and b’s with atleast one a.
9. Obtain grammar to generate string consisting of any number of a’s and b’s with atleast one b.
10. Obtain grammar to generate string consisting of any number of a’s and b’s with atleast one b
and one a.
11. Obtain grammar to accept the language L= {w| |w|mod 3>0 where w ∈{a}*}

DR. SAMPADA K S, CSE, RNSIT 3


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Grammar for regular expression
Examples: (Solutions can be found in subsequent sheets)
1. Obtain grammar to generate strings of a’s and b’s having substring ab.
2. Obtain grammar to generate strings of a’s and b’s ending with string ab.
3. Obtain grammar to generate string of a’s and b’s starting with ab
4. Obtain grammar to generate the language: L={w: na(w) mod 2=0 where w ∈ {a,b}*}
Derivation
The process of obtaining strings of terminals/non-terminals from the start symbol by applying some or all
productions is called derivation.
Let AαBγ and Bβ are productions in grammar G, where α,β and γ are strings of terminals and/or non-
terminals, A and B are non-terminals.
– If a string is obtained by applying only one production it is called one-step derivation and
is denoted by 
– If a string is obtained by applying one or more productions it is denoted by +
– If a string is obtained by applying one or more productions it is denoted by *
For example, consider the grammar
1. EI
2. EE+E
3. EE*E
4. E(E)
5. I L
6. I ID
7. I IL
8. D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
9. L  a | b | c | … A | B | … Z
The string a*(a+b1) can be derived as:
– E  E*E  I*E  L*E  a*E  a*(E)  a*(E+E)  a*(I+E)  a*(L+E)  a*(a+E)
 a*(a+I)  a*(a+ID)  a*(a+LD)  a*(a+bD)  a*(a+b1)
• Note that at each step of the productions we could have chosen any one of the variables to replace
with a more specific rule.

Leftmost Derivation
• In the previous example we used a derivation called a leftmost derivation. We can specifically
denote a leftmost derivation using the subscript “lm”, as in:lm or *lm
• A leftmost derivation is simply one in which we replace the leftmost variable in a production body
by one of its production bodies first, and then work our way from left to right.

DR. SAMPADA K S, CSE, RNSIT 4


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
– E  E*E  I*E  L*E  a*E  a*(E)  a*(E+E)  a*(I+E)  a*(L+E)  a*(a+E)
 a*(a+I)  a*(a+ID)  a*(a+LD)  a*(a+bD)  a*(a+b1)

Obtain leftmost derivation for the string aaabbabbbba using the following grammar:
S ab|bA
A aS|bAA|a
B bS|aBB|b

Rightmost Derivation
• Not surprisingly, we also have a rightmost derivation which we can specifically denote via:
rm or *rm
• A rightmost derivation is one in which we replace the rightmost variable by one of its production
bodies first, and then work our way from right to left.
Rightmost Derivation Example
• a*(a+b1) was already shown previously using a leftmost derivation.
• the rightmost derivation, makes replacements in different order:
– E rm E*E rm E * (E) rm E*(E+E) rm E*(E+I) rm E*(E+ID) rm E*(E+I1) rm
E*(E+L1) rm E*(E+b1) rm E*(I+b1) rm E*(L+b1) rm E*(a+b1) rm I*(a+b1) rm
L*(a+b1) rm a*(a+b1)
• Any derivation has an equivalent leftmost and rightmost derivation. That is, A * . iff A *lm
 and A *rm .

Language of a Context Free Grammar


The language that is represented by a CFG G(V,T,P,S) may be denoted by L(G), is a Context Free
Language (CFL) and consists of terminal strings that have derivations from the start symbol:
L(G) = { w in T | S *G w }
• Note that the CFL L(G) consists solely of terminals from G.
Examples
Obtain grammar to generate the following language
1. L={anbn| n>=0}
2. L={anbn| n>=1}
3. L={an+1bn| n>=0}
4. L={anbn+1| n>=0}
5. L={anbn+2| n>=0}

DR. SAMPADA K S, CSE, RNSIT 5


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
n 2n
6. L={a b | n>=0}
7. L={wwR where w ∈ {a,b}*}
8. L={0n1m2n| m>=1 and n>=0}
9. L={w: na(w)= nb(w)}
10. L={0ibj| i≠j, i>=0 and j>=0}
11. L={an+2bm| m>n}
12. Obtain grammar to generate the language L=L1.L2 where
L1={anbm| n>=0;m>n}
L2={0n12n| n>=0}

Parse trees
– A parse tree is a top-down representation of a derivation. It is a good way to visualize the
derivation process.
• Let G= (V,T,P,S) be a CFG. The tree is derivationtree with the following properties:
– The root has label S
– Every vertes has label which is in (VUTU)
– Every leaf node has label from T and an interior node has label from V.
• If a vertex is labelled A and if X1,X2,X3,X4……Xn are all children of A from left, then A
X1,X2,X3,X4……Xn must be a production in P
• Sample parse tree for the palindrome CFG for 1110111:
P   | 0 | 1 | 0P0 | 1P1

1 P 1

1 P 1

1 P 1

Using a leftmost derivation generate the parse tree for a*(a+b1)

DR. SAMPADA K S, CSE, RNSIT 6


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
E

E * E

I ( E )

L E + E

a I I

L I D

a L 1

• The yield of the parse tree is the string that results when we concatenate the leaves from left to
right (e.g., doing a leftmost depth first search).
– The yield is always a string that is derived from the root and is guaranteed to be a string in
the language L.

Ambiguities in grammars and languages


Ambiguous Grammars
• A CFG is ambiguous if one or more terminal strings have multiple leftmost derivations from the
start symbol.
– Equivalently: multiple rightmost derivations, or multiple parse trees.
•Examples
– E E+E | E*E
– E+E*E can be parsed as
• EE+E E+E*E

• E E*E E+E*E

DR. SAMPADA K S, CSE, RNSIT 7


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Examples
Is the following grammar ambiguous?
– SAS | ε
– AA1 | 0A1 | 01
Inherent Ambiguity
A CFL L is said to be inherently ambiguous if all its grammars are ambiguous
Example:
Condider the Grammar for string aabbccdd
SAB | C
A aAb | ab
BcBd | cd
C aCd | aDd
D->bDc | bc

• Parse tree for string aabbccdd

Removing Ambiguity
• No algorithm can tell us if an arbitrary CFG is ambiguous in the first place

DR. SAMPADA K S, CSE, RNSIT 8


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
– Halting / Post Correspondence Problem
• Why care?
– Ambiguity can be a problem in things like programming languages where we want
agreement between the programmer and compiler over what happens
• Solutions
– Apply precedence
– e.g. Instead of: E E+E | E*E
– Use: E T | E + T, T F | T * F
• This rule says we apply + rule before the * rule (which means we multiply first
before adding)

DR. SAMPADA K S, CSE, RNSIT 9


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 10


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 11


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 12


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 13


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 14


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 15


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 16


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 17


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 18


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 19


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 20


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 21


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 22


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 23


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 24


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 25


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 26


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 27


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 28


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 29


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

DR. SAMPADA K S, CSE, RNSIT 30


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

PART 2: SYNTAX ANALYSIS PHASE OF COMPILERS: PART-1


ROLE OF THE PARSER
Parser obtains a string of tokens from the lexical analyzer and verifies that it can be generated
by the language for the source program. The parser should report any syntax errors in an intelligible
fashion. The two types of parsers employed are:
1.Top down parser: which build parse trees from top(root) to bottom(leaves)

2.Bottom up parser: which build parse trees from leaves and work up the root.

Therefore there are two types of parsing methods– top-down parsing and bottom-up parsing

There are three general types of parsers for grammars: Universal, top-down and bottom-up.

Syntax error handling:


Types or Sources of Error – There are three types of error: logic, run-time and compile-time error:

Logic errors occur when programs operate incorrectly but do not terminate abnormally (or crash).
Unexpected or undesired outputs or other behavior may result from a logic error, even if it is not
immediately recognized as such.
A run-time error is an error that takes place during the execution of a program and usually happens
because of adverse system parameters or invalid input data. The lack of sufficient memory to run an

DR. SAMPADA K S, CSE, RNSIT 31


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
application or a memory conflict with another program and logical error is an example of this. Logic
errors occur when executed code does not produce the expected result. Logic errors are best handled
by meticulous program debugging.
Compile-time errors rise at compile-time, before the execution of the program. Syntax error or
missing file reference that prevents the program from successfully compiling is an example of this.

Classification of Compile-time error –

 Lexical : This includes misspellings of identifiers, keywords or operators


 Syntactical : a missing semicolon or unbalanced parenthesis
 Semantical : incompatible value assignment or type mismatches between operator and operand
 Logical : code not reachable, infinite loop.

Finding error or reporting an error – Viable-prefix is the property of a parser that allows early
detection of syntax errors.

Goal detection of an error as soon as possible without further consuming unnecessary input
How: detect an error as soon as the prefix of the input does not match a prefix of any string in the
language.
Example: for(;), this will report an error as for having two semicolons inside braces.

Error Recovery –
The basic requirement for the compiler is to simply stop and issue a message, and cease compilation.
There are some common recovery methods that are as follows.
1. Panic mode recovery :
This is the easiest way of error-recovery and also, it prevents the parser from developing infinite
loops while recovering error. The parser discards the input symbol one at a time until one of the
designated (like end, semicolon) set of synchronizing tokens (are typically the statement or
expression terminators) is found. This is adequate when the presence of multiple errors in the same
statement is rare. Example: Consider the erroneous expression- (1 + + 2) + 3. Panic-mode recovery:
Skip ahead to the next integer and then continue. Bison: use the special terminal error to describe
how much input to skip.
E->int|E+E|(E)|error int|(error)

2. Phase level recovery :


When an error is discovered, the parser performs local correction on the remaining input. If a parser
encounters an error, it makes the necessary corrections on the remaining input so that the parser can
continue to parse the rest of the statement. You can correct the error by deleting extra semicolons,
replacing commas with semicolons, or reintroducing missing semicolons. To prevent going in an
infinite loop during the correction, utmost care should be taken. Whenever any prefix is found in the
remaining input, it is replaced with some string. In this way, the parser can continue to operate on its

DR. SAMPADA K S, CSE, RNSIT 32


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
execution.

3. Error productions :
The use of the error production method can be incorporated if the user is aware of common mistakes
that are encountered in grammar in conjunction with errors that produce erroneous constructs. When
this is used, error messages can be generated during the parsing process, and the parsing can
continue. Example: write 5x instead of 5*x

4. Global correction :
In order to recover from erroneous input, the parser analyzes the whole program and tries to find the
closest match for it, which is error-free. The closest match is one that does not do many insertions,
deletions, and changes of tokens. This method is not practical due to its high time and space
complexity.

CONTEXT-FREE GRAMMARS:
A context-free grammar has four components:

A set of non-terminals (V). Non-terminals are syntactic variables that denote sets of strings. The
non-terminals define sets of strings that help define the language generated by the grammar.

A set of tokens, known as terminal symbols (Σ). Terminals are the basic symbols from which
strings are formed.

A set of productions (P). The productions of a grammar specify the manner in which the terminals
and non-terminals can be combined to form strings. Each production consists of a non-terminal
called the left side of the production, an arrow, and a sequence of tokens and/or on- terminals, called
the right side of the production.

One of the non-terminals is designated as the start symbol (S); from where the production begins.

The strings are derived from the start symbol by repeatedly replacing a non-terminal (initially the
start symbol) by the right side of a production, for that non-terminal.

Example
We take the problem of palindrome language, which cannot be described by means of Regular
Expression. That is, L = { w | w = wR } is not a regular language. But it can be described by means
of CFG, as illustrated below:
G = ( V, Σ, P, S )
Where:
V = { Q, Z, N }
Σ = { 0, 1 }

DR. SAMPADA K S, CSE, RNSIT 33


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
P = { Q → Z | Q → N | Q → ℇ | Z → 0Q0 | N → 1Q1 }
S={Q}
This grammar describes palindrome language, such as: 1001, 11100111, 00100, 1010101, 11111,
etc.

Derivation
A derivation is basically a sequence of production rules, in order to get the input string. During parsing,
we take two decisions for some sentential form of input:

 Deciding the non-terminal which is to be replaced.


 Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.

Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-most
derivation. The sentential form derived by the left-most derivation is called the left-sentential form.

Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-most
derivation. The sentential form derived from the right-most derivation is called the right-sentential form.

Example: Production rules:


E→E+E
E→E*E
E → id
Input string: id + id * id

The left-most derivation is:


E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Notice that the left-most side non-terminal is always processed first.

The right-most derivation is:


E→E+E
E→E+E*E
E → E + E * id
E → E + id * id
E → id + id * id

DR. SAMPADA K S, CSE, RNSIT 34


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are derived from
the start symbol. The start symbol of the derivation becomes the root of the parse tree. Let us see this by
an example from the last topic.

The left-most derivation is:


E→E*E
E→E+E*E
E → id + E * E
E → id + id * E
E → id + id * id
Step 1

E→E*E

Step 2:

E→E+E*E

Step 3:

E → id + E * E

DR. SAMPADA K S, CSE, RNSIT 35


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Step 4:

E → id + id * E

Step 5:

E → id + id * id

In a parse tree:
 All leaf nodes are terminals.
 All interior nodes are non-terminals.
 In-order traversal gives original input string.
A parse tree depicts associativity and precedence of operators. The deepest sub-tree is traversed first,
therefore the operator in that sub-tree gets precedence over the operator which is in the parent nodes.

Ambiguity
A grammar G is said to be ambiguous if it has more than one parse tree (left or right derivation) for at
least one string.

Example
E→E+E

DR. SAMPADA K S, CSE, RNSIT 36


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
E→E–E
E → id
For the string id + id – id, the above grammar generates two parse trees:

The language generated by an ambiguous grammar is said to be inherently ambiguous. Ambiguity in


grammar is not good for a compiler construction. No method can detect and remove ambiguity
automatically, but it can be removed by either re-writing the whole grammar without ambiguity, or by
setting and following associativity and precedence constraints.
Associativity
If an operand has operators on both sides, the side on which the operator takes this operand is decided by
the associativity of those operators. If the operation is left-associative, then the operand will be taken by
the left operator or if the operation is right-associative, the right operator will take the operand.

Example

Operations such as Addition, Multiplication, Subtraction, and Division are left associative. If the
expression contains:
id op id op id
it will be evaluated as:
(id op id) op id
For example, (id + id) + id

Operations like Exponentiation are right associative, i.e., the order of evaluation in the same expression
will be:
id op (id op id)
For example, id ^ (id ^ id)
Precedence
If two different operators share a common operand, the precedence of operators decides which will take
the operand. That is, 2+3*4 can have two different parse trees, one corresponding to (2+3)*4 and another

DR. SAMPADA K S, CSE, RNSIT 37


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
corresponding to 2+(3*4). By setting precedence among operators, this problem can be easily removed.
As in the previous example, mathematically * (multiplication) has precedence over + (addition), so the
expression 2+3*4 will always be interpreted as:
2 + (3 * 4)
These methods decrease the chances of ambiguity in a language or its grammar.
Consider VN = {S, E}, VT = {if,then ,else } and a grammar G = (VT, VN, S, P) such that the following
S -> if E then S | if E then S else S
are productions of G. Then the string
w = if E then if E then S else S
has two parse trees as shown on Figure 5.

Two parse trees for the same if-then-else statement.


In all programming languages with if-then-else statements of this form, the second parse tree is preferred.
Hence the general rule is: match each else with the previous closest then. This disambiguating rule can be
incorporated directly into a grammar by using the following observations.
 A statement appearing between a then and a else must be matched. (Otherwise there will be an
ambiguity.)
 Thus statements must split into kinds: matched and unmatched.
 A matched statement is
o either an if-then-else statement containing no unmatched statements
o or any statement which is not an if-then-else statement and not an if-then statement.
 Then an unmatched statement is
o an if-then statement (with no else-part)
o an if-then-else statement where unmatched statements are allowed in the else-part (but not
in the then-part).
stmt matched-stmt | unmatched-stmt

matched-stmt if expr then matched-stmt else matched-stmt

matched-stmt non-alternative-stmt

unmatched-stmt if expr then stmt

DR. SAMPADA K S, CSE, RNSIT 38


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
unmatched-stmt if expr then matched-stmt else unmatched-stmt

TOP-DOWN PARSING
A program that performs syntax analysis is called a parser. A syntax analyzer takes tokens as input
and output error message if the program syntax is wrong. The parser uses symbol-look- ahead and
an approach called top-down parsing without backtracking. Top-downparsers check to see if a
string can be generated by a grammar by creating a parse tree starting from the initial symbol and
working down. Bottom-up parsers, however, check to see a string can be generated from a
grammar by creating a parse tree from the leaves, and working up. Early parser generators such as
YACC creates bottom-up parsers whereas many of Java parser generators such as JavaCC create
top-down parsers.
Example of top-down parser:
Consider the grammar

DR. SAMPADA K S, CSE, RNSIT 39


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

4.3.1. RECURSIVE DESCENT PARSING


Typically, top-down parsers are implemented as a set of recursive functions that descent through a
parse tree for a string. This approach is known as recursive descent parsing, also known as LL(k)
parsing where the first L stands for left-to-right, the second L stands for leftmost-derivation, and k
indicates k-symbol lookahead. Therefore, a parser using the single symbol look-ahead method and
top-down parsing without backtracking is called LL(1) parser. In the following sections, we will
also use an extended BNF notation in which some regulation expression operators are to be
incorporated.
A syntax expression defines sentences of the form , or . A syntax of the form defines
sentences that consist of a sentence of the form followed by a sentence of the form followed
by a sentence of the form. A syntax of the form defines zero or one occurrence of the form.

A syntax of the form defines zero or more occurrences of the form .


A usual implementation of an LL(1) parser is:
o initialize its data structures,
o get the lookahead token by calling scanner routines, and
o call the routine that implements the start symbol.
Here is an example.
proc syntaxAnalysis()

DR. SAMPADA K S, CSE, RNSIT 40


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
begin
initialize(); // initialize global data and structures
nextToken(); // get the lookahead token
program(); // parser routine that implements the start symbol
end;

The above algorithm is non—deterministic. General Recursive descent may require backtracking.

Back-tracking
Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:

It will start with S from the production rules and will match its yield to the left-most letter of the input, i.e.
‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances to the next
input letter (i.e. ‘e’). The parser tries to expand non-terminal ‘X’ and checks its production from the left
(X → oa). It does not match with the next input symbol. So the top-down parser backtracks to obtain the
next production rule of X, (X → ea).

Now the parser matches all the input letters in an ordered manner. The string is accepted.

DR. SAMPADA K S, CSE, RNSIT 41


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

Example − Write down the algorithm using Recursive procedures to implement the following
Grammar.
E → TE′
E′ → +TE′
T → FT′
T′ →∗ FT′|ε
F → (E)|id

DR. SAMPADA K S, CSE, RNSIT 42


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

4.3.2. Left Recursion


A grammar becomes left-recursive if it has any non-terminal ‘A’ whose derivation contains ‘A’ itself as
the left-most symbol. Left-recursive grammar is considered to be a problematic situation for top-down
parsers. Top-down parsers start parsing from the Start symbol, which in itself is non-terminal. So, when
the parser encounters the same non-terminal in its derivation, it becomes hard for it to judge when to stop
parsing the left non-terminal and it goes into an infinite loop.

Example:
(1) A => Aα | β
(2) S => Aα | β
A => Sd
(1) is an example of immediate left recursion, where A is any non-terminal symbol and α represents a
string of non-terminals.
(2) is an example of indirect-left recursion.

A top-down parser will first parse the A, which in-turn will yield a string consisting of A itself and the
parser may go into a loop forever.

DR. SAMPADA K S, CSE, RNSIT 43


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Removal of Left Recursion
One way to remove left recursion is to use the following technique:
The production
A => Aα | β
is converted into following productions
A => βA'
A'=> αA' | ε
This does not impact the strings derived from the grammar, but it removes immediate left recursion.

Second method is to use the following algorithm, which should eliminate all direct and indirect left
recursions.
START
Arrange non-terminals in some order like A1, A2, A3,…, An

for each i from 1 to n


{
for each j from 1 to i-1
{
replace each production of form Ai ⟹Aj
with Ai ⟹ δ1 | δ2 | δ3 |…|
where Aj ⟹ δ1 | δ2|…| δn are current Aj productions
}
}
eliminate immediate left-recursion
END

Example

The production set


S => Aα | β
A => Sd
after applying the above algorithm, should become
S => Aα | β
A => Aαd | βd
and then, remove immediate left recursion using the first technique.
A => βdA'
A' => αdA' | ε
Now none of the production has either direct or indirect left recursion.

DR. SAMPADA K S, CSE, RNSIT 44


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
4.3.3. Left Factoring
If more than one grammar production rules has a common prefix string, then the top-down parser cannot
make a choice as to which of the production it should take to parse the string in hand.

Example

If a top-down parser encounters a production like


A ⟹ αβ | α | …
Then it cannot determine which production to follow to parse the string as both productions are starting
from the same terminal (or non-terminal). To remove this confusion, we use a technique called left
factoring.

Left factoring transforms the grammar to make it useful for top-down parsers. In this technique, we make
one production for each common prefixes and the rest of the derivation is added by new productions.

Example

The above productions can be written as


A => αA'
A'=> β | | …
Now the parser has only one production per prefix which makes it easier to take decisions.

DR. SAMPADA K S, CSE, RNSIT 45


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
4.3.4. FIRST AND FOLLOW
To compute FIRST(X) for all grammar symbols X, apply the following rules until no more
terminals or e can be added to any FIRST set.
1. If X is terminal, then FIRST(X) is {X}.
2. If X->ɛ is a production, then add e to FIRST(X).
3. If X is nonterminal and X->Y1Y2...Yk is a production, then place a in FIRST(X)
4. If for some i, a is in FIRST(Yi) and e is in all of FIRST(Y1),...,FIRST(Yi-1) that is, Y1.......Yi-1=*> ɛ.
5. If e is in FIRST(Yj) for all j=1,2,...,k, then add e to FIRST(X).

For example, everything in FIRST(Yj) is surely in FIRST(X). If Y1 does not derive e, then we
add nothing more to FIRST(X), but if Y1=*> ɛ, then we add FIRST(Y2) and so on.

To compute the FOLLOW(A) for all nonterminals A, apply the following rules until nothing
can be added to any FOLLOW set.

1. Place $ in FOLLOW(S), where S is the start symbol and $ in the input right endmarker.
2. If there is a production A=>aBs where FIRST(s) except ɛ is placed in FOLLOW(B).
3. If there is aproduction A->aB or a production A->aBs where FIRST(s) contains ɛ, then everything
in FOLLOW(A) is in FOLLOW(B).
Consider the following example to understand the concept of First and Follow.
Find the first and follow of all nonterminals in the Grammar-
E -> TE'
E'-> +TE'|e
T -> FT'
T'-> *FT'|e
F -> (E)|id
E E' T T' F
FIRST {(,id} {+,e} {(,id} {*,e} {(,id}
FOLLOW {),$} {),$} {+,),$} {+,),$} {+,*,),$}

For example, id and left parenthesis are added to FIRST(F) by rule 3 in definition of FIRST with i=1
in each case, since FIRST(id)=(id) and FIRST('(')= {(} by rule 1. Then by rule 3 with i=1, the
production T -> FT' implies that id and left parenthesis belong to FIRST(T) also.
To compute FOLLOW,we put $ in FOLLOW(E) by rule 1 for FOLLOW. By rule 2 applied
toproduction F-> (E), right parenthesis is also in FOLLOW(E). By rule 3 applied to
production E-> TE', $ and right parenthesis are in FOLLOW(E').

Calculate the first and follow functions for the given grammar-
S → (L) / a
L → SL’
L’ → ,SL’ / ∈
The first and follow functions are as follows-

DR. SAMPADA K S, CSE, RNSIT 46


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
S L L’
FIRST {(,a} {(,a} {,,∈}
FOLLOW {$,,,)} {)} {)}

4.3.5. LL(1) GRAMMAR


A context-free grammar G = (VT, VN, S, P) whose parsing table has no multiple entries is said to
be LL(1). In the name LL(1),
 the first L stands for scanning the input from left to right,
 the second L stands for producing a leftmost derivation,
 and the 1 stands for using one input symbol of lookahead at each step to make parsing action
decision.
A language is said to be LL(1) if it can be generated by a LL(1) grammar. It can be shown that LL(1)
grammars are not ambiguous and not left-recursive.
Moreover we have the following theorem to characterize LL(1) grammars and show their importance
in practice.
A context-free grammar G = (VT, VN, S, P) is LL(1) if and if only if for every nonterminal A and
every strings of symbols , such that and A we have
1. For no terminal a do both a and b derive strings beginning with a. i.e.FIRST(α) ∩ FIRST(β) = Φ ,
2. At most one of a and b can derive the empty string.
3. If β ɛ , then α does not derive any string beginning with a terminal in FOLLOW(A).
Likewise, if α ɛ then β does not derive any string with a terminal in FOLLOW(A). i.e.
if α ɛ then FIRST(β) ∩ FOLLOW(A) = Φ .
EXAMPLE:

DR. SAMPADA K S, CSE, RNSIT 47


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
CONSTRUCTION OF PREDICTIVE PARSING TABLES
For any grammar G, the following algorithm can be used to construct the predictive parsing table.
The algorithm is
Input : Grammar G

Output : Parsing table M Method


1. 1.For each production A-> a of the grammar, do steps 2 and 3.

2. For each terminal a in FIRST(a), add A->a, to M[A,a].

3. If e is in First(a), add A->a to M[A,b] for each terminal b in FOLLOW(A). If e is in

FIRST(a) and $ is in FOLLOW(A), add A->a to M[A,$].

4. Make each undefined entry of M be error.


The above algorithm can be applied to any grammar G to produce a parsing table M. For some
Grammars, for example if G is left recursive or ambiguous, then M will have at least one
multiply-defined entry. A grammar whose parsing table has no multiply defined entries is said to
be LL(1). It can be shown that the above algorithm can be used to produce for every LL(1) grammar
G a parsing table M that parses all and only the sentences of G. LL(1) grammars have several
distinctive properties. No ambiguous or left recursive grammar can be LL(1) (eliminate all left
recursion and left factoring).
Example 4.32 : For the grammar given below, Algorithm produces the parsing table in Blanks are error
entries; nonblanks indicate a production with which to expand a nonterminal.
E -> TE'
E'-> +TE'|e
T -> FT'
T'-> *FT'|e F -> (E)|id
E E' T T' F
FIRST {(,id} {+,e} {(,id} {*,e} {(,id}
FOLLOW {),$} {),$} {+,),$} {+,),$} {+,*,),$}

INPUT SYMBOLS
Non-Terminal id + * ( ) $
E E→TE’ E→TE’
E’ E’→+TE’ E’→ε E’→ε
T T→FT’ T→FT’
T’ T’→ε T’→*FT’ T’→ε T’→ε
F F→id F→(E)

The main difficulty in using predictive parsing is in writing a grammar for the source
language such that a predictive parser can be constructed from the grammar. Although left

DR. SAMPADA K S, CSE, RNSIT 48


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
recursion elimination and left factoring are easy to do, they make the resulting grammar hard to read
and difficult to use the translation purposes.
The following grammar, which abstracts the dangling-else problem, is repeated here from Example 4.22:

The parsing table for this grammar appears in Fig. 4.18.


The entry for M[S', e) contains both S' —> eS and S' —> ɛ.
The grammar is ambiguous and the ambiguity is manifested by a choice in what production to use when
an e (else) is seen.

We can resolve this ambiguity by choosing S' -> eS. This choice corresponds to associating an else with
the closest previous then .
Problems with top down parser
The various problems associated with top down parser are:
Ambiguity in the grammar, Left recursion, Non-left factored grammar, Backtracking

DR. SAMPADA K S, CSE, RNSIT 49


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Ambiguity in the grammar: A grammar having two or more left most derivations or two or more right
most derivations is called ambiguous grammar. For example, the following grammar is ambiguous:
E → E + E | E – E | E * E | E / E | ( E ) | id
The ambiguous grammar is not suitable for top-down parser. So, ambiguity has to be eliminated from the
grammar.
Left-recursion: A grammar G is said to be left recursive if it has non-terminal A such that there is a
derivation of the form:
A Aα
where α is string of terminals and non-terminals. That is, whenever the first symbol in a partial derivation
is same as the symbol from which this partial derivation is obtained, then the grammar is said to be left-
recursive grammar. For example,
consider the following grammar:
E→E+T|T
T→T*F|F
F → ( E ) | is
The above grammar is unambiguous but, it is having left recursion and hence, it is not suitable for top
down parser. So, left recursion has to be eliminated
Non-left factored grammar: If A-production has two or more alternate productions and they have a
common prefix, then the parser has some confusion in selecting the appropriate production for expanding
the non-terminal A. For example, consider the following grammar that recognizes the if-statement:
S → if E then S else S | if E then S
Observe the following points:

“if” from the lexical analyzer, we cannot tell whether to use the first
production or to use the second production to expand the non- terminal S.
prefix. That is, left
factoring is must for parsing using top-down parser.
A grammar in which two or more productions from every non-terminal A do not have a common prefix of
symbols on the right hand side of the A-productions is called left factored grammar.
Backtracking: The backtracking is necessary for top down parser for following reasons:

DR. SAMPADA K S, CSE, RNSIT 50


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
1) During parsing, the productions are applied one by one. But, if two or more alternative productions are
there, they are applied in order from left to right one at a time.
2) When a particular production applied fails to expand the non-terminal properly, we have to apply the
alternate production. Before trying alternate production, it is necessary undo the activities done using the
current production. This is possibly only using backtracking.
Even though backtracking parsers are more powerful than predictive parsers, they are also much slower,
requiring exponential time in general and therefore, backtracking parsers are not suitable for practical
compilers.

Nonrecursive Predictive Parsing


A nonrecursive predictive parser can be built by maintaining a stack explicitly, rather than implicitly via
recursive calls. The parser mimics a leftmost derivation. If w is the input that has been matched so far,
then the stack holds a sequence of grammar symbols a such that

The table-driven parser in Fig. 4.19 has an input buffer, a stack containing a sequence of grammar
symbols, a parsing table constructed by Algorithm 4.31, and an output stream. The input buffer contains
the string to be parsed, followed by the endmarker $. We reuse the symbol $ to mark the bottom of the
stack, which initially contains the start symbol of the grammar on top of $.
The parser is controlled by a program that considers X, the symbol on top of the stack, and a, the current
input symbol. If X is a nonterminal, the parser chooses an X-production by consulting entry M[X, a] of the
parsing table M. (Additional code could be executed here, for example, code to construct a node in a parse
tree.) Otherwise, it checks for a match between the terminal X and current input symbol a.
The behavior of the parser can be described in terms of its configurations, which give the stack contents
and the remaining input. The next algorithm describes how configurations are manipulated.

DR. SAMPADA K S, CSE, RNSIT 51


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
METHOD : Initially, the parser is in a configuration with w$ in the input buffer and the start symbol S of
G on top of the stack, above $. The program in Fig. 4.20 uses the predictive parsing table M to produce a
predictive parse for the input.
Algorithm 4.3.4 : Table-driven predictive parsing.
INPUT : A string w and a parsing table M for grammar G.
OUTPUT : If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
1. set ip to point to the first symbol of w;
2. set X to the top stack symbol;
3. w h i le ( X $ )
{ /* stack is not empty */
3.1 if ( X is a ) pop the stack and advance ip;
3.2 else if ( X is a terminal ) error Q;
3.3 else if ( M[X,a] is an error entry ) error Q;
3.4 else if ( M[X,a] = X -> Y1Y2 •••Yk)
{
output the production X -> Y1 Y2 • • • Yk;
pop the stack;
push Yk, Yk-i,... ,Yi onto the stack, with Y1 on top;
}
4. set X to the top stack symbol;
5. }

Example 4.35 : Consider grammar


E -> TE'
E'-> +TE'|e
T -> FT'
T'-> *FT'|e F -> (E)|id
E E' T T' F
FIRST {(,id} {+,e} {(,id} {*,e} {(,id}
FOLLOW {),$} {),$} {+,),$} {+,),$} {+,*,),$}

INPUT SYMBOLS
Non-Terminal id + * ( ) $
E E→TE’ E→TE’
E’ E’→+TE’ E’→ε E’→ε
T T→FT’ T→FT’
T’ T’→ε T’→*FT’ T’→ε T’→ε
F F→id F→(E)

DR. SAMPADA K S, CSE, RNSIT 52


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
On input id + id * id,
the nonrecursive predictive parser of Algorithm 4.34 makes the sequence of moves in Fig. 4.21. These
moves correspond to a leftmost derivation (see Fig. 4.12 for the full derivation):

Note that the sentential forms in this derivation correspond to the input that has already been matched (in
column M A T C H E D ) followed by the stack contents. The matched input is shown only to highlight
the correspondence. For the same reason, the top of the stack is to the left; when we consider bottom-up
parsing, it will be more natural to show the top of the stack to the right. The input pointer points to the
leftmost symbol of the string in the INPUT column.

DR. SAMPADA K S, CSE, RNSIT 53


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
ERROR RECOVERY IN PREDICTIVE PARSING

The stack of a nonrecursive predictive parser makes explicit the terminals and nonterminals that
the parser hopes to match with the remainder of the input. We shall therefore refer to symbols
on the parser stack in the following discussion. An error is detected during predictive
parsing when the terminal on top of the stack does not match the next input symbol or when
nonterminal A is on top of the stack, a is the next input symbol, and the parsing table entry
M[A,a] is empty.
Panic-mode error recovery is based on the idea of skipping symbols on the input until a token

in a selected set of synchronizing tokens appears. Its effectiveness depends on the choice of
synchronizing set. The sets should be chosen so that the parser recovers quickly from errors that
are likely to occur in practice. Some heuristics are as follows

 As a starting point, we can place all symbols in FOLLOW(A) into the synchronizing set for
nonterminal A.
 If we skip tokens until an element of FOLLOW(A) is seen and pop A from the stack, it is
likely that parsing can continue.
 It is not enough to use FOLLOW(A) as the synchronizing set for A. For example , if
semicolons terminate statements, as in C, then keywords that begin statements may not
appear in the FOLLOW set of the nonterminal generating expressions.
 A missing semicolon after an assignment may therefore result in the keyword beginning the
next statement being skipped.
 Often, there is a hierarchical structure on constructs in a language; e.g., expressions
appear within statement, which appear within bblocks,and so on.
 We can add to the synchronizing set of a lower construct the symbols that begin
higher constructs.
 For example, we might add keywords that begin statements to the synchronizing sets for the
non-terminals generation expressions.
 If we add symbols in FIRST(A) to the synchronizing set for nonterminal A, then it may be
possible to resume parsing according to A if a symbol in FIRST(A) appears in the input.
 If a nonterminal can generate the empty string, then the production deriving e can be
used as a default. Doing so may postpone some error detection, but cannot cause an error to
be missed. This approach reduces the number of nonterminals that have to be considered
during error recovery.
 If a terminal on top of the stack cannot be matched, a simple idea is to pop the
terminal, issue a message saying that the terminal was inserted, and continue parsing. In
effect, this approach takes the synchronizing set of a token to consist of all other tokens.

DR. SAMPADA K S, CSE, RNSIT 54


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51
Panic Mode Recovery
This involves careful selection of synchronizing tokens for each non terminal.
Some points to follow are,
 Place all symbols in FOLLOW(A) into the synchronizing set of A. In this case parser skips
symbols until a symbol from FOLLOW(A) is seen. Then A is popped off the stack and parsing
continues
 The symbols that begin the higher constructs must be added to the synchronizing set of the
lower constructs
 Add FIRST(A) to the synchronizing set of A, so that parsing can continue when a symbol in
FIRST(A) is encountered during skipping of symbols
 If some non-terminal derives ε then using it can be used as default
 If a symbol on the top of the stack can't be matched, one method is pop the symbol and issue
message saying that the symbol is inserted.
Example:
The Grammar:
E → TE'
E' → +TE' | ε
T → FT'
T' → *FT' | ε
F → (E)
F → id
E E' T T' F

FIRST {(,id} {+,e} {(,id} {*,e} {(,id}


FOLLOW {),$} {),$} {+,),$} {+,),$} {+,*,),$}

Predictive Parser table after modification for error handling is

INPUT SYMBOLS
Non-Terminal id + * ( ) $
E E→TE’ E→TE’ SYNCH SYNCH
E’ E’→+TE’ E’→ε E’→ε
T T→FT’ SYNCH T→FT’ SYNCH SYNCH
T’ T’→ε T’→*FT’ T’→ε T’→ε
F F→id SYNCH SYNCH F→(E) SYNCH SYNCH

Parsing and Error Recovery moves made by Predictive Parser

DR. SAMPADA K S, CSE, RNSIT 55


AUTOMATA THEORY AND COMPILER DESIGN- 21CS51

STACK INPUT REMARK


$E )id* +id$ Error, skip )
$E id* +id$ id is in FIRST(E)
$E’T id* +id$
$E’T’F id* +id$
$E’T’id id* +id$
$E’T’ *+id$
$E’T’F* *+id$
$E’T’F +id$ Error,M[F, +] = synch
$E’T’ +id$ F has been popped
$E’ +id$
$E’T+ +id$
$E’T id$
$E’T’F id$
$E’T’id id$
$E’T’ $
$E’ $
$ $

Phrase - level Recovery


Phrase-level error recovery is implemented by filling in the blank entries in the predictive parsing table
with pointers to error routines. These routines may change, insert, or delete symbols on the input and
issue appropriate error messages. They may also pop from the stack. Alteration of stack symbols or the
pushing of new symbols onto the stack is questionable for several reasons. First, the steps carried out by
the parser might then not correspond to the derivation of any word in the language at all. Second, we
must ensure that there is no possibility of an infinite loop. Checking that any recovery action eventually
results in an input symbol being consumed (or the stack being shortened if the end of the input has been
reached) is a good way to protect against such loops.

DR. SAMPADA K S, CSE, RNSIT 56

You might also like