003chapter 3 - Syntax Analysis
003chapter 3 - Syntax Analysis
Syntax Analysis
Basic Topics of Chapter -Three
Ambiguity
Eliminating ambiguity
3.1. Introduction of syntax analysis
Syntax Analysis creates the syntactic structure of the given source program.
Parser: program that takes tokens and grammars (CFGs) as input and validates the
output tokens against the grammar.
A context-free grammar
token
Source Lexical Parser Parse tree Rest of Front End Intermediate
program Analyzer representation
getNext
Token
Symbol
table
The main Responsibility of Syntax Analysis
Major task conducted during parsing(syntax analysis), such as
the parser obtains a stream of tokens from the lexical analyzer. verifies that the
stream of token names can be generated by the grammar for the source language.
Determine the syntactic validity of a source string is valid, a tree is built for use by
Collecting information about various tokens into the symbol table, performing type
Syntactic errors: include misplaced semicolons or extra or missing braces; that is, "
{ " or " } . “
Logical errors : can be anything from incorrect reasoning on the part of the
programmer to the use in a C program of the assignment operator = instead of the
comparison operator ==.
Error-Recover Strategies
Once an error is detected, how should the parser recover?
a. Panic-mode recovery
b. Phrase-level recovery
c. Error-productions, and
d. Global-correction.
i. Panic Mode Recovery
Once an error is found, the parser intends to find designated set of synchronizing tokens
by discarding input symbols one at a time.
Synchronizing tokens are delimiters, semicolon or } whose role in source program is clear.
When parser finds an error in the statement, it ignores the rest of the statement by not
processing the input.
The compiler will discard all subsequent tokens till a semi-colon is encountered.
ii. phrase-level recovery
On discovering an error, a parser may perform local correction on the remaining
input; that is, it may
o replacing a prefix of the remaining input by some string that allows the parser to continue.
Works only for most common mistakes which can be easily identified
string .
Basically there is a number of type grammar but for compiler design or syntactic structure of
the programming language we use the CFG.
CFG- is used to define the syntactic structure of a programming language constructions, like:
Algebric expression, if else statement, while loop, array representation etc..
I am going.
(This implies proper noun, helping verb, verb followed by ing and terminates by
full stop respectively).
. In programming language suppose we write a sentence in this form
int a, b,c;
define data type, variable name separated by comma, terminated by semicolon
Therefore, CFG used to check the syntax of a programming language
Formal Definition of a CFG
Conventionally, the productions for the start symbol are listed first.
iv. Productions
The productions of a grammar specify the manner in which the terminals and non-terminals
can be combined to form strings.
Each Production Consists of:
a. A non-terminal called the head or left side of the production;
this production defines some of the strings denoted by the head.
b. The symbol -K Sometimes : : = has been used in place of the arrow.
c. A body or right side consisting of zero or more terminals and non-terminals.
Example #1: Simple Arithmetic Expressions
The nonterminal symbols are expression, term and factor, and expression is the
start symbol .
Example #2: CFG (Algebraic grammar)
• G: SAB
AaAA
AaA
Aa
BbB
Bb
Q1. Identify S, T, N and P
Q2. Check if the following input string is accepted or not by the given G. Input
string= ab, aab, aaab , aabba
Example #3: CFG (Algebraic grammar)
G: EE+E|E-E|E*E|E/E|(E)id
sentence: id+id*id
( this sentence is derived from the above grammar hence it is a valid sentence)
( this sentence can not be generated through above grammar hence it is a valid
sentence)
NB: so the job of a grammar is to validate correctness of the sentence or find out the
error of the sentence
Example #4: CFG (“If else” Grammar)
• Sif expression then Statement
|if expression then statement else statement
Q. How can we write this sentence in the help of grammar?
- Derive this statement with the help of grammar, this statement is valid sentence
Example: Sentence
If(a<b) then printf(“yes”);
if (a<b) then printf(“yes”)
else
printf(“no”);
CFG - Terminology
L(G) is the language of G (the language generated by G) which is a set of
sentences.
Associativity and precedence are captured in the following grammar, Expressions, Terms,
and Factors.
E represents expressions consisting of terms separated by
+ signs,
T represents terms consisting of factors separated by * signs, and
F represents factors that can be either parenthesized expressions or identifiers:
Cont’d
G: E → E + T | T
T→T*F|F
F → (E) | id
Expression grammar above belongs to the class of LR grammars that are suitable for
bottom-up parsing.
This grammar can be adapted to handle additional operators and additional levels of
precedence.
E → TE’
E’ →+TE’ | Ɛ
T →FT’
T’ → *FT’ | Ɛ
F →(E) | id
Cont’d
The following grammar treats + and * alike, so it is useful for illustrating techniques
for handling ambiguities during parsing:
Grammar above permits more than one parse tree for expressions like: a + b*c.
3.6. Derivations and Parse Trees
• Definition: Let G=(N T P S) be a CFG
– If a vertex with label A has children with labels X1,X2,,…Xn from left to right then
AX1,X2,…Xn must be a production in p
– If a vertex has label є, then that vertex is a leaf and the only child of it’s parent.
– More generally, a derivation tree can be defined with non-terminal as the tree.
Derivations
Derivation is a sequence of production rules.
It is used to get the input string through these productions rules.
We have to decide:
• which non-terminal to replace
• Production rule by which the Non-terminals will be replaced
If there is a production A α then we say that A derives α and is denoted by A α
α A β α γ β if A γ is a production
If α1 α2 … αn then α1 αn
Given a grammar G and a string w of terminals in L(G) we can write S w
If S α where α is a string of terminals and non terminals of G then we say that α is a
sentential form of G.
There are two options for Derivation
a. Left-Most Derivation (LMD) and b. Right-Most Derivation (RMD)
a. Left-Most Derivations (LMD)
• If we always choose the left-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
• In LMD- input string scanned and replaced with the production rule from left to
right.
• In a sentential form only the leftmost non terminal is replaced then it becomes
leftmost derivation.
Every leftmost step can be written as
wAγ lm* wδγ
where w is a string of terminals and A δ is a production
LMD: Example #1 LMD: Example #2
G: E E+E
G: E E+E Input String: -(id+id)
Input String: id+id
E+E derives from E |(E)
In RHD input string is scanned and replaced with the production rule from
right to left.
We will see that the top-down parsers try to find the left-most derivation of the
given source program.
We will see that the bottom-up parsers try to find the right-most derivation of
the given source program in the reverse order.
RMD: Example #1
i. G: E E+E|(E)|-E Input String: -(id+id)
Right-Most Derivation
E -E -(E) -(E+E) -(E+id) -(id+id)
Right-Most Derivation
It shows how the start symbol of a grammar derives a string in the language
if A is a non-terminal labeling an internal node and x1, x2, …xn are labels of
children of that node then A x1 x2 … xn is a production
Parse Tree: Example #1
G: E E+E |(E) |-E |id Input String: -(id+id)
Parse tree:
Parse Tree: Example #2
Construct parse tree for the given grammar:
Parse tree:
Exercise #2
• Construct parse tree for the following grammar:
1. G: TT+T
|T*T
Xa
Yb
Zc|d
3.7. Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous. Or
Drawback of Ambiguity:
o Parsing complexity
Consider grammar:
G: E E+E|E*E|id Input string: id+id*id
E + E
id+id*E id+id*id
id E * E
id id
*
id+id*E id+id*id E E
E + E id
id id
Ambiguity : Example #2
string string + string
| string – string
|0|1|…|9
• String 9-5+2 has two parse trees
Ambiguity (cont.)
For the most parsers, the grammar must be unambiguous.
unambiguous grammar
We should eliminate the ambiguity in the grammar during the design phase of the
compiler.
Associativity rules
Precedence rules
Associativity of Operators
If an operand has operator on both the sides, the side on which operator takes this
operand is the associativity of that operator.
In a+b+c b is taken by left +
+, -, *, / are left associative
^, = are right associative
e.g. 1+2+3 first we evaluate (1+2)+3 left associative
1 ^2 ^3 = 1 ^(2^ 3) right associative
a=b=c right associative
Grammar to generate strings with right associative operators
right letter = right | letter
letter a| b |…| z
Precedence of Operator
Whenever an operators has a higher precedence then the other operators
it means that the first operator will get its operands before the operators with lower
precedence.
since multiplication and division have the same precedence we must use the
associative
which means they are grouped left to right as if the expression was (a*b)/c
e.g 3
3.10 Eliminating Ambiguity
AMBIGUITY. The context-free grammar G = (T, N, S, P) is
unambiguous if every sentence of G has a unique parse tree,
we shall eliminate the ambiguity from the following "dangling else" grammar:
tree is preferred.
Hence the general rule is: match each else with the previous closest then.
This disambiguating rule can be incorporated directly into a grammar by using the
following observations.
Eliminating Ambiguity(cont’d)
A statement appearing between a then and a else must be matched. (Otherwise there
will be an ambiguity.)
A matched statement is
o an if-then-else statement where unmatched statements are allowed in the else-part (but
not in the then-part).
Top-down parsing is one of the methods that we will study for generating parse trees.
which is not left recursive and which generates the same language as G.
How to eliminate left recursion?
A simple rule for direct left recursion elimination:
A →A α|β
– We may replace it with
A → β A’
A’ → α A’ | ɛ
where
Eliminate left recursion: Example #1
The following grammar which generates arithmetic expressions.
E →E + T | T
T→T*F | F
F→(E) | id
has two left recursive productions. Applying the above trick leads to
E →TE’
E’ →+TE’ |∈
T →FT’
T’ →*FT’ |∈
F→(E) | id
Elimination of left recursion(cont’d)
The Case of Several Left Recursive A-productions.
S→A
A → Ad / Ae / aB / ac
B → bBc / f
• Solution-
• The grammar after eliminating left recursion is-
ii. Right Recursion
A production of grammar is said to have right recursion if the rightmost variable of
its RHS is same as variable of its LHS.
A grammar containing a production having RR is called as Right Recursive Grammar.
Example: S → aS / ∈ (Right Recursive Grammar)
Note: Right recursion does not create any problem for the Top down parsers.
Therefore, there is no need of eliminating right recursion from the grammar.
The recursion which is neither left recursion nor right recursion is called as general recursion.
Example: S → aSb / ∈
Left factoring
Left factoring is a process by which the grammar with common prefixes is
transformed to make it useful for Top down parsers.
If the RHS of more than one production starts with the same symbol, then such a
grammar is called as grammar with common prefixes.
• Ex: A → αβ1 / αβ2 / αβ3 (Grammar with common prefixes)
This kind of grammar creates a problematic situation for Top down parsers.
Top down parsers can not decide which production must be chosen to parse the string in
hand.
The grammar obtained after the process of left factoring is called as left factored
grammar.
Example:
Parsing is a process that constructs a syntactic structure (i.e. parse tree) from the stream
of tokens.
The way the production rules are implemented (derivation) divides parsing into two types
:
Backtracking means (If a choice of a production rule does not work, we backtrack to
try other alternatives.)
Not efficient
Recursive descent parsing: Example #1
Consider the grammar with input string “cad”:
S→cAd
A→ab | a
Backtracking is needed.
Input string :
accd
Recursive Descent Parsing: Algorithm
A typical procedure for a non-terminal
Procedure A() {
choose an A-production, A→X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
Recursive Descent Parsing: Algorithm (cont’d)
If one failed the input pointer needs to be reset and another alternative should be
tried.
E → TE’
E’→+TE’ | Ɛ
T → FT’
T’ →*FT’ | Ɛ
F → (E) | id
Recursive descent parsing: Exercise #1
Q. Construct a recursive descent parser for the following grammar.
AabC|aBd|aAD
BbB|∈
Cd|∈
Da|b|∈
First(A), is the set of terminals that begin the strings derived from a.
Follow (A), for a nonterminal A, to be the set of terminals a that can appear immediately to the
right of A in a sentential form.
Example: #2
Rules of Computing FOLLOW
Rules in computing FOLLOW ( X) where X is a nonterminal
1) If X is a part of a production and is succeeded by a terminal,
for example:
A → Xa; then Follow(X) = { a }
2) If X is the start symbol for a grammar, for ex:
X → AB
A→a
B → b;
then add $ to FOLLOW (X); FOLLOW(X)= { $ }
Rules of Computing FOLLOW(cont’d)
3) If X is a part of a production and followed by another non terminal, get the FIRST of that
succeeding nonterminal.
Ex: A →XD D → aB ;
S→ABCDE
A →a|∈
B →b|∈
C →c
D →d|∈
E →e|∈
A grammar G is LL(1) if and only if whenever A→α|β are two distinct productions
of G, the following conditions hold:
– For non-terminal a do α and β both derive strings beginning with a
– At most one of α or β can derive empty string
– If α=> ɛ then β does not derive any string beginning with a terminal in Follow(A).
How LL(1) Parser works?
How LL(1) Parser works?…
input buffer
– our string to be parsed.
– We will assume that its end is marked with a special symbol $.
output
– a production rule representing a step of the derivation sequence (left-most derivation) of
the string in the input buffer.
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S.
$S initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is completed.
Predictive Parsing Tables Construction
The general idea is to use the FIRST AND FOLLOW to construct the parsing tables.
Each FIRST of every production is labeled in the table whenever the input matches with it.
When a FIRST of a production contains ε, then we get the Follow of the production.
4. If after performing the above, there is no production in M[A,a] then set M[A,a] to error.
Predictive Parsing:
Tables Construction– Example #1
Consider grammar G:
E → TE’
E’ → +TE’ | Ɛ
T →FT’
T’ →*FT’ | Ɛ
F → (E) | id
and their First and Follow
Predictive Parsing:
Tables Construction– Example #1…
• we start from a sentence and then apply production rules in reverse manner in order
to reach the start symbol.
T→T*F|F
F →(E) | id
attempts to find the left most derivation attempts to reduce the input string to
Attempt for a given string. first symbol of the grammar.
The bottom-up parsing as the process of “reducing” a token string to the start
At each reduction, the token string matching the RHS of a production is replaced
The key decisions during bottom-up parsing are about when to reduce and about
After entering this state/configuration on the parsers halts and announces successful
completion of parsing.
The symbol is the RHS of the production and non terminal is the LHS of the production.
Solution 1
Shift reduce parsing:
Stack Input string action
Example #1
$ id*id+id$ Shift id
$E +id$ Shift +
$E $ Accept
Shift reduce parsing: Exercise #1
S→S+S|S-S|(S) |a Parse the input string a-(a+a) using shift reduce parsing.
E→ 2E2|3E3|4
Let w= n, where n is the nth right sentential form of an unknown RMD.
Replace n by LMS of some Ann to get (n-1)th right sentential form n-1
Stack contents and the next input symbol may not decide action:
– reduce/reduce conflict: The parser cannot decide which of several reductions to make.
If a shift-reduce parser cannot be used for a grammar, that grammar is called as non-LR(k)
grammar.
Operator grammar
No Ɛ-transition.
E E + E | E* E | id is an operator grammar
– if two operators have equal precedence, then we check the Associativity of that
particular operator.
Using Operator - Precedence Relations
E → E+E| E*E | id
PRECEDENCE TABLE
Then the input string id + id*id with the precedence relations inserted will be: $ <. id .> + <. id .> *
<. id .> $
Basic principle
Scan input string left to right, try to detect .> and put a pointer on its location.
Grammar:
Q2. Consider the following grammar and construct the operator precedence parser.
EEAE|id
A+|*
It cannot handle the unary minus (the lexical analyzer should handle the unary minus).
Operator precedence parsers use precedence functions that map terminal symbols to
integers.
2. Partition the symbols in groups so that fa and gb are in the same group
if a =· b (there can be symbols in the same group even if they are not connected by this relation).
3. Create a directed graph whose nodes are in the groups, next for each symbols a and
b do: place an edge from the group of gb to the group of fa if a <· b, otherwise if a ·>
b place an edge from the group of fa to that of gb.
5. When there are no cycles collect the length of the longest paths from the groups of fa
and gb respectively.
Consider the following table:
We can make the look-ahead parameter explicit and discuss LR(k) parsers,
where k is the look-ahead size.
LR(k) parsers are of interest in that they are the most powerful class of
deterministic bottom-up parsers using at most K look-ahead tokens.
Deterministic parsers must uniquely determine the correct parsing action at each
step.
2) reduce (R),
3) accept (A) the source code, or
4) signal a syntactic error (E).
LR Parsers (Cont.)
An LR parser makes shift-reduce decisions by maintaining states to keep track of
where we are in a parse.
States represent sets of items.
LR(k) Parsers:
4 types of LR(k) parsers:
i. LR(0)
ii. SLR(1) –Simple LR
iii. LALR(1) – Look Ahead LR and
iv. CLR(1) – Canonical LR
LR Parsers (Cont.)
In order to construct parsing table of LR(0) and SLR(1) we use canonical collection of
LR(0) items
In order to construct parsing table of LALR(1) and CLR(1) we use canonical
collection of LR(1) items.
i. LR(0) Item
LR(0) and all other LR-style parsing are based on the idea of: an item of the form:
A→X1…Xi.Xi+1…Xj
The dot symbol . in an item may appear anywhere in the right-hand side of a
production.
It marks how much of the production has already been matched.
An LR(0) item (item for short) of a grammar G is a production of G with a dot at
some position of the RHS.
The production A → XYZ yields the four items:
A → .XYZ this means at RHS we have not seen anything
A → X . YZ this means at RHS we have seen Y
A → XY . Z this means at RHS we have seen Z
A → XYZ . this means at RHS we have seen everything
The production A → λ generates only one item, A → .
Constructing Canonical LR(0) item sets
• Augmented Grammar
– If G is a grammar with start symbol S then G’, the augmented grammar for G, is the grammar with
new start symbol S’ and a production
E’ → E.
– The purpose of this new starting production is to indicate to the parser when it should stop parsing
and announce acceptance of input.
– Let a grammar be
E→BB
B→ cB | d
Closure of a state
Closure of a state adds items for all productions whose LHS occurs in an item
• Closure operation
– Let I be a set of items for a grammar G
B → .γ is in closure(I)
• Intuitively A → α.Bβ indicates that we expect a string derivable from Bβ in input
• If B → γ is a production then we might see a string derivable from γ at this point
Example
• For the grammar
E’ →E
E→E+T|T
T→T*F|F
F → ( E ) | id
If I is , E’ → .E then closure(I) is
• E’ →.E
E → .E + T
E → .T
T → .T * F
T → .F
F →.id
F → .(E)
Constructing canonical LR(0) item sets…
Goto operation
Step 2. Draw canonical collection of LR(0) item (apply closure and go-to)
E→BB
E’→E
E→BB
B →cB|d
Step 2. Draw canonical co llection of LR(0) item
Step 3. Number the production
Find CLOSURE(I)
Constructing canonical LR(0) item sets: Example#2
(Cont.)
First, E’ → .E is put in CLOSURE(I) by rule 1.
I0=closure({[E’->.E]}
Then, E-productions with dots at the left end: E’->.E
E → ‧E + T and E → ‧T E->.E+T
Now, there is a T immediately to the right of a dot in E->.T
E → .T, so we add T → .T * F and T → .F T->.T*F
Next, T → .F forces us to add: T->.F
F → ‧(E) and F → .id
F->.(E)
F->.id
Goto Next State
Given an item set (state) s,
we can compute its next state, s’, under a symbol X,
E → E +.T
T → .T * F (by closure)
T → .F (by closure)
We can build all the states of the Transition Diagram this way.
LR(0) Transition Diagram (Cont.)
Each state in the Transition Diagram,
C= CLOSURE({[S’->.S]});
repeat
add GOTO(I,X) to C;
}
Example
E’->E acc
E -> E + T | T $ I6 I9
E->E+.T
T -> T * F | F I1 T->.T*F T
F -> (E) | id E’->E. + T->.F
E->E+T.
T->T.*F
E E->E.+T
F->.(E)
F->.id
I0=closure({[E’->.E]} I2
T I7 I10
E’->.E
E’->T. * T->T*.F F
E->.E+T F->.(E)
T->T.*F T->T*F.
E->.T id F->.id
T->.T*F id
T->.F I5
F->.(E) F->id.
F->.id ( +
I4
F->(.E)
E->.E+T I8 I11
E->.T
E E->E.+T )
T->.T*F F->(E.) F->(E).
T->.F
F->.(E)
F->.id
I3
T>F.
LR(0) Parsing Table
LR(0) Stack Implementation: Example: id*id
SLR(1) has the same Transition Diagram and Goto table as LR(0)
BUT with different Action table because it looks ahead 1 token.
SLR(1) Look-ahead
SLR(1) parsers are built first by constructing:
• Transition Diagram,
Slow construction.
• LL(k) ≤ LR(k)
Exercise
Q1. Construct LL(1) parse table for the expression grammar
Steps to be followed
S→(L)|a
L → L,S|S Parse the input string (a,(a,a)) using shift reduce parsing.