CD Unit Ii
CD Unit Ii
UNIT- II
Syntax Analysis
Role of Parser:
• Parser gets a string of tokens from lexical analyzer then construct parse tree and passes it
to rest of the compiler for further processing.
• Checking and translation actions can be a part of parsing. So parse tree need not
constructed explicitly.
• Parser can report any syntax error. It also recovers from commonly occurring errors to
continue parsing.
National conventions:
1. Normally lower case letters, operators, digits, punctuation symbols (parenthesis, comma
etc), boldface strings, if and id are terminals.
2. Normally uppercase letters, lowercase italic names such as expr or stmt are non
terminals. letter s is starting symbol.
3. Uppercase letters x, y, z represent grammar symbol i.e either terminal or non terminals.
4. Lowercase letters u, v, w, ----z represent (empty) strings of terminals.
5. Lowercase Greek letters ⍺, β, γ represent (empty) strings of grammar symbols.
6. If A → ⍺1, A → ⍺2, ---- A → ⍺k are productions with A on left then we write A →⍺1|⍺2|-----⍺k.
7. Unless stated otherwise, left side of the first production is start symbol.
Example:
expression → expression + term
expression → expression – term
expression → term
term → term * factor
term → term / factor
term → factor
factor → (expression)
factor → id
Using the above conventions given grammar is rewritten as
E→E+T|E-T|T
T→T*F|T/F|F
F → (E) | id
Derivations:
• Construction of parse tree can be made exactly by taking a derivational view, in which
productions are treated as rewriting rules.
• In derivation, we start with starting symbol; each rewriting step replaces a non-terminal
by body of one of its productions.
• This derivational view corresponds to top down construction of parse tree, but the
correctness afforded by derivations will helpful when bottom up parsing is discussed.
• At each step in derivation, there are two choices to be made. We need to choose which
non terminal to replace .Based on this derivations are two types
1. leftmost derivation
2. rightmost derivation
• In rightmost derivation the right most non terminal is always chosen, we write as ⍺ ⇒ β.
𝑙𝑚
Example: construct leftmost and rightmost derivations for given grammar for string id + id.
E → E + E | E * E | (E) | id
Leftmost derivation is
E ⇒ E + E ⇒ id + E ⇒ id + id
𝑙𝑚 𝑙𝑚 𝑙𝑚
Rightmost derivation is
E ⇒ E + E ⇒ E + id ⇒ id + id
𝑟𝑚 𝑟𝑚 𝑟𝑚
Parse Tree:
• Parse tree is graphical representation of derivation that filters out the order which
productions are applied to replace non terminals.
• Interior node is labelled with non terminal in the head of production.
• Leaves of parse tree are labelled by non terminal or terminals.
• Parse tree of the string id + id * id for given grammar E → E + E | E * E | (E) | id is
Ambiguity:
• A grammar that produces more than one parse tree for some input string Is said to be
ambiguous.
• Ambiguous grammar is one that produces more than one left most derivation or more
than one right most derivation for some input string.
• Below grammar permits two distinct left most derivations for input string “id + id *id “.
E → E + E | E * E | (E) | id
E⇒ E+E E ⇒ E*E
𝑙𝑚 𝑟𝑚
⇒ id + E ⇒ E+E*E
𝑙𝑚 𝑟𝑚
⇒ id + E * E ⇒ id + E * E
𝑙𝑚 𝑟𝑚
⇒ id + id * E ⇒ id + id * E
𝑙𝑚 𝑟𝑚
⇒ id + id * id ⇒ id + id * id
𝑙𝑚 𝑟𝑚
Eliminating Ambiguity:
• Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity.
• Eliminate the ambiguity from following dangling else grammar:
Stmt→ if expression then statement
| if expr then stmt else stmt
| other
• According to this grammar, the compound conditional statement ,
If E1 then S1 else if E2 then S2 else S3.
• To eliminate the ambiguity for above grammar, we can reconstruct the above grammar as
shown below.
stmt → matched_stmt
|open_stmt
matched_stmt → if expr then matched_stmt else matched_stmt
| other
open_stmt → if expr then stmt
| If expr then matched_stmt else open_stmt
Left Factoring:
• When the choice between two alternative A-productions is not clear by reading initial
elements in input .This situation is called non-deterministic.
• Top down parsing methods can’t be non-deterministic situation .So, eliminate non-
deterministic by left factoring.
• To implement left factoring, each non-deterministic productions A→ᾳβ1/ᾳβ2 can be
replaced by
A → ᾳ AI
AI → β1 | β2
Example: Eliminate non-deterministic (left factoring) on below grammar
S→iE+S|iE+SeS|a
E→b
Left recursion elimination process
S→iE+S|iE+SeS|a E→b
S → i E + S SI | a No non deterministic
S→eS|ε
Parsers
• Left most leaf labelled c, matches first symbol of input w. so, we advance pointer to a,
second symbol of w and next leaf element labelled A.
• We expand A using first alternative, a match for second symbol a, so we advances pointer
to d, third input symbol and compare d against next has labelled b. b doesn’t match d. We
report failure and go back to A to check another alternative.
• In going back to A, we must reset input pointer to position 2, then proceed with
alternative production. To store input pointer position we use a local variable.
• In alternative production leaf a match 2nd symbol as W and leaf d match 3rd symbol of w
then halt and announces completion of parsing.
• Biggest drawback of brute force in recursive descent with backtracking parser is, if one of
a phase enters into infinite loop due to backtracking in left recursion lead compiler or
machine to crash.
Predictive Parsing:
• Predictive parse doesn’t allow grammar, that has left recursion, non deterministic and
backtracking.
• Predictive parser is special type of recursive descent parser.
• It can predicted with production is suitable for completion of parsing based on input
symbol.
Recursive Predictive Parsing:
• Recursive descent parsing program consists of set of procedures. One for each non
terminal.
• It consists input buffer contains string to be parsed, followed by end marker $.
• It can be built by maintaining stack implicitly via recursive calls by using one input
symbol of look ahead at each step to make parsing decision.
T()
{
if (lookahead==’+’)
{
match(‘+’);
if (lookahead==’id’)
{
match(‘id
’);
T();
}
else
return ;
}
else
return;
}
match(chart t)
{
if (lookahead==’t’)
lookahead==next-token;
else
printf(“error”)
}
main()
{
E();
if(lookahead==’$’)
printf(“Success”) ;
}
∴ first(F)={ (, id }
∴ first(E) = first(T) = first(F) = {(,id}
first(EI):
EI → + T EI EI → ε
(rule 2) (rule 1)
first(EI)=first(+TEI) first(EI)=first(ε)
=first(+) ={ ε }
={ + }
∴ first(EI)={ +, ε }
first(TI):
TI → * F TI TI → ε
( rule2) (rule 1)
first(TI)=first(*FTI) first(TI)=first(ε)
=first(*) ={ ε }
={ * }
∴ first (TI)={ *, ε }
∴ follow(E)={ $ } ⋃ { ) }
= { $, ) }
Follow(EI):
E→ TEI EI→+TEI
(rule 3) (rule 3)
follow (EI)=follow(E) follow(EI)=follow(EI)
={ $, ) } ={ $, ) }
∴ follow(EI)={ $, ) } ⋃ { $, ) }
={ $, ) }
∴ follow(E)=follow(EI)={ $, ) }
Follow(T):
E →TEI EI→+TEI
(rule 2) (rule 2)
follow(T)=first(EI) if ε is there follow(T)=first(EI) if ε is there
={ first(EI) - { ε }} ⋃ follow(E) ={ first(EI) - { ε }} ⋃ follow(EI)
={{+, ε} - { ε }} ⋃ { $, ) } ={{+, ε} - { ε }} ⋃ { $, ) }
={ +, $ , ) } ={ +, $ , ) }
∴ follow(T)={ +, $, ) } ⋃ { +, $ , ) }
= { +, $, ) }
∴ follow(T)={ +, $, ) }
Follow(TI):
T→FTI TI→*FTI
(rule 3) (rule 3)
follow(TI)=follow(T) follow(TI)=follow(TI)
={ +, $, ) } ={ +, $, ) }
∴ follow (TI)={ +, $, ) }
∴ follow (F)= { *, +, $, ) } ⋃ { *, +, $ , ) }
= { *, +, $, ) }
∴ follow(T)={ *, +, $, ) }
First Follow
E (, id $, )
EI +, ε $, )
T (, id +, $, )
TI *, ε +, $, )
F (, id *, +, $, )
LL(1) grammar:
• Non recursive predictive parsing can be constructed for class as grammar called LL(1).
• First ’L’ in LL(1) stands for scanning input from left to right, second ’L’ for producing left
most derivation and ’1’ for using one input symbol of lookahead at each step to make
parsing decision.
• Grammar G is LL(1) if and only if whenever A→ α | β are two distinct productions as G ,by
holding following conditions.
1. For terminal a, both α and β cannot drive strings beginning with a.
2. At most one as α and β can drive empty string.
• Next algorithm collects information from FIRST and FOLLOW sets into predictive passing
table M [A, a], it is two-dimensional array, where A is no terminal, a is terminal and $ is
end marker.
Algorithm for construction of parsing table
INPUT: Grammar G.
OUTPUT: Parsing table M .
Method: Each production A → α of grammar, do the following:
1. For each terminal a in first (α), add A → α to M [A, a].
2. If ε in first(α) ,then for each terminal b in follow(A) add A → α to M[A ,b].
3. There is no production at all in M [A, a] then set M [A, a] to error.
Production EI → +TEI
first(+TEI)={ + }
Therefore EI → +TEI is place in M[ EI, + ].
Production EI → ε
first(ε)={ ε }
Therefore EI → ε is placed in M[ EI, follow(EI) ]
EI → ε is placed in M[ EI, ( ] and M[ EI, id ].
Production T → FTI
first(FTI)=first(F)
={ (, id }
Therefore T → FTI is placed in M[ T, ( ] and M[ T, id ].
Production TI → ε
first(ε)={ ε }
Therefore TI → ε is placed in M[ TI, FOLLOW(TI) ]
TI → ε is placed in M[ TI, + ], M[ TI, $ ] and M[ TI, ) ].
Production F → (E)
first((E))= first (()={ ( }
Therefore F → (E) is placed in M[ F, ( ].
Production F → id
first(id)={id}
Therefore F → id is placed in M[ F, id ].
• Program considers X, top of stack, a is current input symbol, then parser chooses x
productions by consulting M[X, a] in parsing table m. otherwise check for matching to
input if X is terminal.
Types of Grammar:
Type 0(Unrestricted Grammar): Type 1 (Context Sensitive): Type 2(Context Free): Type 3(Regular):
• If there is no restriction on any • Apply some restrictions to type • Apply context free restriction on • If any grammar is left linear,
grammar then that grammar is 0 grammar is called as type 1 or type 1 grammar is called Context right linear and middle linear
categorized as type 0 or Un- Context sensitive grammar. free grammar. then it is called linear grammar.
restricted grammar. • Context sensitive means before • Context free means before or • If any grammar is left linear and
• In this grammar, non terminal and after of non terminal should after of any non terminal on left right linear, but not middle
and terminals in production has be a terminal or non terminal. hand should be empty. linear then it is called regular
no limit. • Right hand side of grammar is • In this grammar, left hand side of grammar.
α→β always greater than in length of a production should be only one Examples: A → xB/y (∴ RL)
α ∊ (N+T)+ length hand side. non terminal. A → Bx/y (∴ LL)
β ∊ (N+T)* α → β (∴|α|<=|β|) α → β (∴|α|=1) A, B ∊ N
Example: aAb → bB α, β ∊ (N+T)+ β ∊ (N+T)* x, y ∊ T*
aA → ε Example: aAb → bbb Example: A → BCD
aA → bB B→a
Bottom up parsing:
• Bottom up parse corresponds to the construction of parse tree for input string beginning
at leaves and working up towards root.
• Largest class of grammars for which shift reduce parser can be built is LR grammars
• It is too much work to built LR parser by hand, tools like automated parser generators
make it is to construct LR parser from suitable grammars.
Example: Sequence of parse tree of bottom up approach for input id * id with
E→E+T|T
T→T*F|F
F → (E) | id
Parsers
Operator procedure
parsing
$F * id2 $ reduce T → F
$T * id2 $ shift
$T* id2 $ shift
$ T * id2 $ reduce F → id
$T*F $ reduce T → T * F
$T $ reduce E → T
$E $ accept
Reduce/Reduce conflict:
• Parser cannot decide which one of several reductions will use, then reduce/reduce
conflict will occur.
stack input
$ E+T*F $
• To solve the above problem, we will take action based on rightmost elements of stack
should reduce first.
• These conflicts will encountered for those grammars which are not LR or those grammars
are ambiguous.
• The program driving the LR Parser behaves as follows, it determine sm, state currently
on top of the stack, and ai, the current input symbol. It then consults action [sm,ai], the
parsing action table entry in state sm and input ai, which can have one of four values.
1. shift s, where s is state.
2. reduce by a grammar production A→β.
3. accept,
4. error.
• The function goto takes a state and grammar symbol as arguments and produces a state.
Augmented Grammar:
• If G is a grammar with start symbol S then G’, the augmented grammar for G, is G with a
new start symbol S’ and production S’→S.
GOTO ( I0 , E)
I1 : E’ → E .
E→E.+T
GOTO ( I0 , T)
I2 : E → T .
T →T . * F
GOTO ( I0 , F)
I3 : T →F .
GOTO ( I0 , id )
I5 : F → id .
GOTO ( I1 , + )
I6 : E → E + . T
T→.T*F
T→.F
F → . (E)
F → . id
GOTO ( I2 , * )
I7 : T →T * . F
F → . (E)
F → . id
GOTO ( I4 , E )
I8 : F →( E . )
E →E . + T
GOTO ( I6 , T)
I9 : E →E + T .
T →T . * F
GOTO ( I7 , F )
I10 : T → T * F .
GOTO ( I8 , ) )
I11 : F →( E ) .
Augmented grammar G’ of G:
1. S’ → S
2. S → CC
3. C → cC
4. C → d
To find lookahead:
Take S’ → .S, $ because S’ is the starting symbol of the augmented grammar
Rule finding Lookahead of B:
A → α.Bβ,a
Lookahead of B = first(βa)
Then production is
B→ .ϒ, first(βa)
Lookahead of S:
S’ → .S, $
Lookahead of S = first(ε$)=$
Then production is
S → .CC, $
Lookahead of C:
S → .CC, $
Lookahead of S = first(C$)={c,d}
Then production is
C → .cC, c/d
C → .d, c/d
States:
I0 : S’ → .S, $
S → .CC, $
C → .cC, c/d
C → .d, c /d
GOTO ( I0 , S )
I1: S’ → S., $
GOTO ( I0 , C )
I2: S → C.C, $
C → .Cc, $
C → .d, $
GOTO ( I0 , c )
I3: C → c.C, c/d
C → .Cc, c/d
C → .d, c/d
GOTO ( I0 , d)
I4: C → d., c/d
GOTO ( I2 , C )
I5: S → CC., $
GOTO ( I2 , c )
I6: C → c.C, $
C → .cC, $
C → .d, $
GOTO ( I2 , d )
I7: C → d., $
GOTO ( I3 , C )
I8: C → cC., c/d
GOTO ( I6 , C )
I9: C → cC., $
Example: construct a LALR parsing table for the grammar S→CC C→cC C→c/d
Procedure:
Given Grammar G:
1. S → CC
2. C → cC
3. C → d
Augmented grammar G’ of G:
1. S’ → S
2. S → CC
3. C → cC
4. C → d
To find lookahead:
Take S’ → .S, $ because S’ is the starting symbol of the augmented grammar
Lookahead of S:
S’ → .S, $
Lookahead of S = first(ε$)=$
Then production is
S → .CC, $
Lookahead of C:
S → .CC, $
Lookahead of S = first(C$)={c,d}
Then production is
GEETHANJALI INSTITUTE OF SCIENCE AND TECHNOLOGY, NELLORE Y.v.R 32
III BTECH II-SEM, CSE: COMPILER DESIGN
C → .cC, c/d
C → .d, c/d
States:
I0 : S’ → .S, $
S → .CC, $
C → .cC, c/d
C → .d, c /d
GOTO ( I0 , S )
I1: S’ → S., $
GOTO ( I0 , C )
I2: S → C.C, $
C → .Cc, $
C → .d, $
GOTO ( I0 , c )
I3: C → c.C, c/d
C → .Cc, c/d
C → .d, c/d
GOTO ( I0 , d)
I4: C → d., c/d
GOTO ( I2 , C )
I5: S → CC., $
GOTO ( I2 , c )
I6: C → c.C, $
C → .cC, $
C → .d, $
GOTO ( I2 , d )
I7: C → d., $
GOTO ( I3 , C )
I8: C → cC., c/d
GOTO ( I6 , C )
I9: C → cC., $
• By observing above states I3 and I6 are similar except the lookahead. so merge these
2 states and form a new state I36 . in the same manner
I4 and I7 merge and form a new state I47
I8 and I9 merge and form a new state I89
• After merging the states the computed CLR or LALR parsing table is
Parser Generator:
• Various tools are used for Parser Generators to describe syntax of given expression.
• A tool called Yacc used for construction of Parser Generators. Yacc stands for yet another
compiler-compiler; basically it is a unix utility.
Use of Yacc:
• In yacc tool we write input file with program.y then it forwarded to yacc compiler.
• Yacc compiler transforms program.y to C program file y.tab.c. later this file is compiled by
C compiler into a file called a.out
%%
line : expr'\n' {printf("%d\n",$1);}
;
expr : expr'+'term {$$=$1+$3;}
| term
;
term : term'*'factor {$$=$1*$3;}
| factor
;
f actor : '('expr')' {$$=$2;}
| DIGIT
;
%%
yylex(){
intc;
c = getchar();
if(isdigit(c)){
yylval=c-'0';
return DIGIT;
}
return c;
}
Declarations Part:
• There are two sections in the declarations part of a Yacc program; both are optional. In
first section, we put ordinary C declarations, delimited by %{ and %}.
#include<ctype.h>
• it causes the C preprocessor to include the standard header file <ctype.h> that contains
the predicate isdigit.
%token DIGIT
• Declares DIGIT to be a token. Tokens declared in this section can then be used in the
second and third parts of the Yacc specification.
• Note that the non terminal term in the first production is the third grammar symbol of the
body, while + is the second. We have omitted the semantic action for the second
production altogether, since copying the value is the default action for productions with a
single grammar symbol in the body. In general,{$$=$1;}is the default semantic action.
• Notice that we have added a new starting production to the Yacc specification
line: expr'\n' {printf("%d\n",$1);}
• This production says that an input to the desk calculator is to be an expression followed
by a newline character. The semantic action associated with this production prints the
decimal value of the expression followed by a new line character.
%%
lines: lines expr'\n' {printf("%g\n",$2);}
| lines'\n'
|/*empty*/
;
expr : expr'+'expr {$$=$1+$3;}
| expr'-'expr {$$=$1-$3;}
| expr'*'expr {$$=$1*$3;}
| expr'+'expr {$$=$1/$3;}
| '('expr')' {$$=$2;}
| NUMBER
;
%%
yylex(){
int c;
while((c=getchar())=='');
if ((c =='.') || (isdigit(c)) )
{
ungetc(c,stdin);
scanf("%lf",&yylval);
return NUMBER;
}
return c;
}