Chapter 3 Syntax Analysis (Parsing)
Chapter 3 Syntax Analysis (Parsing)
Syntax Analysis(Parsing)
Compiler Design
1
Objective
Understand ambiguous grammars and how to deal with ambiguity from CFGs.
2
The Role of the Parser
Source token
Lexical Rest of Intermediate
program Parser Parse tree representation
Analyzer Front End
getNextToken
Symbol
table
Parser
The parser is expected to report any
performs context-free syntax
syntax errors in an intelligible fashion and
analysis
to recover from commonly occurring
guides context-sensitive analysis
errors to continue processing the
constructs an intermediate
remainder of the program.
representation
Conceptually, for well-formed programs,
produces meaningful error
3
the parser constructs a parse tree and passes
messages
Contd.
The parser obtains a string of tokens from the lexical analyzer, as shown in the
above Figure and verifies that the string of token names can be generated by the
grammar for the source language.
A grammar gives a precise, yet easy-to-understand, syntactic specification of a
programming language.
From certain classes of grammars, we can construct automatically an efficient
useful for translating source programs into correct object code and for
detecting errors.
A grammar allows a language to be evolved or developed iteratively, by adding
4
new constructs to perform new tasks.
Contd.
There are three general types of parsers for grammars: universal, top-down, and
bottom-up.
1. Universal parsing methods such as the Cocke-Younger-Kasami algorithm and
Earley's algorithm can parse any grammar [Read more on this].
These general methods are, however, too inefficient to use in production
compilers.
The methods commonly used in compilers can be classified as being either top-
down or bottom-up.
2. Top-Down Methods:- As implied by their names, top-down methods build
parse trees from the top (root ) to the bottom (leaves ) .
3. Bottom-up methods:- start from the leaves and work their way up to the
root to build the parse tree .
In either case, the input to the parser is scanned from left to right, one symbol at
5 a time
The most efficient top-down and bottom-up methods work only for sub-classes
Error Handling
Common Programming Errors include:
Lexical errors, Syntactic errors, Semantic errors and logical Errors
Error handler goals
Report the presence of errors clearly and accurately
Recover from each error quickly enough to detect subsequent errors
Add minimal overhead to the processing of correct programs
Common Error-Recovery Strategies includes:
1. Panic mode recovery:- Discard input symbol one at a time until one of designated
set of synchronization tokens is found.
2. Phrase level recovery:- Replacing a prefix of remaining input by some string that
allows the parser to continue.
3. Error productions:- Augment the grammar with productions that generate the
erroneous constructs
4. Global correction:- Choosing minimal sequence of changes to obtain a globally
6
least-cost correction
Context-Free Grammars (CFGs)
CFG is used as a tool to describe the syntax of a programming language.
A CFG includes 4 components:
Terminals = { id, +, -, *, /, (, ) }
symbol.
2. These symbols are non-terminals:
body.
6. A set of productions A 1, A 2, A 3,..., A k with a common head A (call
7. Unless stated otherwise, the head of the first production is the start symbol.
• The notational
Example:- Using these conventions , the grammar of Example conventions
4 of tell us #that
slide 9 can be
E,T, and F are non-
rewritten concisely as: terminals, with E the start
symbol.
E E+ T|E-T|T • The remaining symbols
11 are terminals
T T*F|T/F|F
Derivations
A derivation is a description of how a string is generated from the start symbol of a
grammar.
1. A leftmost derivation always picks the leftmost non-terminal to replace (see slide 13)
2. A rightmost derivation always picks the rightmost non-terminal to replace( see slide 14)
Some derivations are neither leftmost nor rightmost (see slide 15)
For example: Use the CFG below to generate print (id);
Terminals = { id, num, if, then, else, print, =, {, }, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = (1) S print(E);
(2) S while (B) do S
(3) S { L }
(4) E id
(5) E num
(6) B E > E
(7) L S
12
(8) L SL
Leftmost Derivations
A string of terminals and non-terminals α that can be derived from the initial symbol of the
grammar is called a sentential form
Thus the strings “{ S L }”, “while(id>E) do S”, and print(E>id)” of the above example re
all sentential forms
A derivation is “leftmost” if, at each step in the derivation, the leftmost non-terminal is
selected to replace
All of the above examples are leftmost derivations
A sentential form that occurs in a leftmost derivation is called a left-sentential form
Example 1: We can use leftmost derivations to generate while(id > num) do print(id); from this
CFG as follows: Example 2: We also can generate { print(id);
S while(B) do S print(num); } from the CFG as follows:
while(E>E) do S S{L}
{SL}
while(id>E) do S
{ print(E); L }
while(id>num) do S { print(id); L }
while(id>num) do print(E); { print(id); S }
while(id>num) do print(id); { print(id); print(E); }
13
{ print(id); print(num); }
Rightmost Derivations
replace
S while(B) do S
Example 2: Try to derivate { print(num); print(id); }
while(B) do print(E);
from S
S{L}
while(B) do print(id);
{SL}
while(E>E) do print(id);
{SS}
while(E>num) do print(id); { S print(E); }
Some derivations are neither Some strings that are not derivable from this CFG,
1. print(id)
S while(B) do S
2. { print(id); print(id) }
while(E>E) do S
3. while (id) do print(id);
while(E>E) do print(E);
4. print(id > id);
Example:
Non-Terminals = { S, E, B, L }
E id | num
BE>E
L S | SL
Start Symbol = S
16
Parse Trees
A parse tree is a graphical representation of a derivation that filters out the order in
The interior node is labeled with the nonterminal A in the head of the
production; the children of the node are labeled, from left to right, by the symbols in
the body of the production by which this A was replaced during the derivation .
We start with the initial symbol S of the grammar as the root of the tree
The children of the root are the symbols that were used to rewrite the initial symbol in the
derivation
The children of each internal node N are the symbols on the right-hand side of a rule that has
N as the left-hand side (e.g. B E > E where E > E is the right-hand side and B is the left-
17
hand side of the rule)
Examples
Example 1: -(id+id)
E => -E => -(E) => -(E+E) => -(id+E)=>-(id+id)
Example 2: (id+id*id)
E => E+E => E+E*E =>(E+id*E) => (E+id*id)=>(id+id*id)
18 a) b)
Ambiguous Grammars
A grammar is ambiguous if there is at least one string derivable from the grammar that
has more than one different parse tree, or more than one leftmost derivation, or more
than one rightmost derivation
Example 2 of slide 18 has two parse trees(parse tree a and b) that are ambiguous
grammars.
Ambiguous grammars are bad, because the parse trees don’t tell us the exact meaning of the
string.
For example, in Example 2 of the previous slide, in Fig a. the string means id*(id+id),
Rules = E E +T T F
E E -T F
id
19
E T id
A parse tree for id*id(id+id)
Contd.
We need to make sure that all additions appear higher in the tree than multiplications
(Why?)
How can we do this?
Once we replace an E with E*E using single rule 4, we don’t want to rewrite any of the
Es we’ve just created using rule 2, since that would place an addition (+) lower in the
subtractions.
18:
id+id+id = (id+id)+id or
id-id-id = (id-id)-id
In other words, we need to make sure that the right sub-tree of an addition or
We modified the parse tree of example 2 of slide 18 by the CFG and parse tree shown
21
at slide no. 19 to generate an unambiguous CFG and parse tree.
Extended Backus Naur Form (EBNF)
Another term for a CFG is a Backus Naur Form (BNF).
There is an extension to BNF notation, called Extended Backus Naur Form, or EBNF
EBNF rules allow us to mix and match CFG notation and regular expression notation
in the right-hand side of CFG rules
For example, consider the following CFG, which describes simpleJava statement
blocks and stylized simpleJava print statements:
2. Show that the following CFGs are ambiguous by giving two parse trees for the same
2.2) Terminals = { if, then, else, print, id }
string?
Non-Terminals = {S, T}
2.1) Terminals = { a, b }
Start Symbol = S
Non-Terminals = {S, T}
Rules = S if id then S T
Start Symbol = S S print id
Rules = S STS T else S
S b T ε
23
T aT
Contd.
3. Construct a CFG for each of the following:
b.The set of all strings over { (, ), [, ]} which form balanced parenthesis. That is,
(). ()(), ((()())()), [()()] and ([()[]()]) are in the language but )( , ][ , (() and ([ are
not.
c.The set of all string over {num, +, -, *, /}which are legal binary post-fix
expressions. Thus numnum+, num num num + *, num num – num * are all in
the language, while num*, num*num and num num num – are not in the
language.
There are a few things to note about the form of JavaCC rules:
In CFGs, we have followed the common convention of using uppercase letters for non-
JavaCC uses the reverse conversion, i.e. uppercase letters for terminals and
JavaCC non-terminals are usually not a single letter, but amore meaningful identifier.
Non-terminals represent method calls in the generator parser, hence the () after each
29