CD Unit-Ii
CD Unit-Ii
2.1 INTRODUCTION
Syntax Analysis is a second phase of the compiler design process in which the given input
string is checked for the confirmation of rules and structure of the formal grammar. It
analyses the syntactical structure and checks if the given input is in the correct syntax of the
programming language or not.
Syntax Analysis in Compiler Design process comes after the Lexical analysis phase. It is also
known as the Parse Tree or Syntax Tree. The Parse Tree is developed with the help of pre-
defined grammar of the language. The syntax analyser also checks whether a given program
fulfills the rules implied by a context-free grammar. If it satisfies, the parser then creates the
parse tree of that source program. Otherwise, it will display error messages.
Page 1 of 34
Context-free grammars have the following components:
A set of terminal symbols which are the characters that appear in the language/strings
generated by the grammar. Terminal symbols never appear on the left-hand side of the
production rule and are always on the right-hand side.
A set of nonterminal symbols (or variables) which are placeholders for patterns of
terminal symbols that can be generated by the nonterminal symbols. These are the
symbols that will always appear on the left-hand side of the production rules, though
they can be included on the right-hand side. The strings that a CFG produces will
contain only symbols from the set of nonterminal symbols.
A set of production rules which are the rules for replacing nonterminal symbols.
Production rules have the following form: variable \rightarrow→ string of variables
and terminals.
A start symbol which is a special nonterminal symbol that appears in the initial string
generated by the grammar.
Example:
L= {wcwR | w € (a, b)*}
Production rules:
1. S → aSa
2. S → bSb
3. S → c
Now check that abbcbba string can be derived from the given CFG.
S ⇒ aSa
S ⇒ abSba
S ⇒ abbSbba
S ⇒ abbcbba
By applying the production S → aSa, S → bSb recursively and finally applying the
production
S → c, we get the string abbcbba.
Page 2 of 34
Derivation
A derivation is basically a sequence of production rules, in order to get the input string.
During parsing, we take two decisions for some sentential form of input:
Deciding the non-terminal which is to be replaced.
Deciding the production rule, by which, the non-terminal will be replaced.
To decide which non-terminal to be replaced with production rule, we can have two options.
Left-most Derivation
If the sentential form of an input is scanned and replaced from left to right, it is called left-
most derivation. The sentential form derived by the left-most derivation is called the left-
sentential form.
Right-most Derivation
If we scan and replace the input with production rules, from right to left, it is known as right-
most derivation. The sentential form derived from the right-most derivation is called the
right-sentential form.
Example
Production rules:
E→E+E
E→E-E
E → id
Input string: id + id * id
Parse Tree
A parse tree is a graphical depiction of a derivation. It is convenient to see how strings are
derived from the start symbol. The start symbol of the derivation becomes the root of the
parse tree. For the string id + id – id, to construct two parse trees: Left-Most -Derivation and
Right- Most -Derivation
Page 4 of 34
Fig: Types of Top-Down Parsing Techniques
Backtracking
Backtracking is a technique in which for expansion for non-terminal symbol we choose one
alternative and if some mismatch occurs then we try another alternative if any.
Example
Consider the grammar G : S → cAd
A→ab|a and the input string w=cad.
The parse tree can be constructed using the following top-down approach :
Step1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first
symbol of w. Expand the tree with the production of S.
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’.But the third leaf of tree is b which does not match with the
Page 5 of 34
input symbol d.Hence discard the chosen production and reset the pointer to
second backtracking.
A backtracking parser will try different production rules to find the match for the input string
by backtracking each time. The backtracking is powerful than predictive parsing. But this
technique is slower and its requires exponential time in general. Hence backtracking is not
preferred for practical compilers.
Predictive parser
As the same name suggests the predictive parser tries to predict the next construction using
one or more lookahead symbols from input string. There are two types of predictive parsers.
1. Recursive Descent parser
2. LL(1) parser
1. Ambiguity
2. Back Tracking
3. Left Recursion
4. Left Factoring
Ambiguity
Page 6 of 34
A grammar is said to be ambiguous if there exists more than one leftmost derivation or more
than one rightmost derivative or more than one parse tree for the given input string. If the
grammar is not ambiguous then it is called unambiguous.
Example:
S = aSb | SS
S=∈
For the string aabb, the above grammar generates two parse trees:
If the grammar has ambiguity then it is not good for a compiler construction. No method can
automatically detect and remove the ambiguity but you can remove ambiguity by re-writing
the whole grammar without ambiguity.
Backtracking
Backtracking is a technique in which for expansion for non-terminal symbol we choose one
alternative and if some mismatch occurs then we try another alternative if any.
Example
Step1:
Initially create a tree with single node labeled S. An input pointer points to ‘c’, the first
symbol of w. Expand the tree with the production of S.
Page 7 of 34
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to the second
symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first alternative.
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance the input
pointer to third symbol of w ‘d’.But the third leaf of tree is b which does not match with the
input symbol d.Hence discard the chosen production and reset the pointer to
second backtracking.
Left Recursion
Page 8 of 34
There is a formal technique for eliminating left-recursion from productions
A --> A |
introduce a new nonterminal A' and rewrite the rule as
A --> A'
A' --> | A'
Thus the production:
E --> E + T | T
is left-recursive with "E" playing the role of "A","+ T" playing the role of , and "T" playing
the role of A'. Introducing the new nonterminal E', the production can be replaced by:
E --> T E'
E' --> | + T E'
Of course, there may be more than one left-recursive part on the right-hand side. The general
rule is to replace:
Step one describes a rule to eliminate direct left recursion from a production. To eliminate
left-recursion from an entire grammar may be more difficult because of indirect left-
recursion. For example,
A --> B x y | x
B --> C D
C --> A | c
D --> d
is indirectly recursive because
Page 9 of 34
A ==> B x y ==> C D x y ==> A D x y.
That is, A ==> ... ==> A where is D x y.
To eliminates left-recursion entirely. It contains a "call" to a procedure which eliminates
direct left-recursion (as described in step one).
Left Factoring
Left Factoring is a grammar transformation technique. It consists in "factoring out" prefixes
which are common to two or more productions.
Left factoring is removing the common left factor that appears in two productions of the same
non-terminal. It is done to avoid back-tracing by the parser. Suppose the parser has a look-
ahead ,consider this example-
A -> qB | qC
where A,B,C are non-terminals and q is a sentence. In this case, the parser will be confused
as to which of the two productions to choose and it might have to back-trace. After left
factoring, the grammar is converted to-
A -> qD
D -> B | C
In this case, a parser with a look-ahead will always choose the right production.
Left recursion is a case when the left-most non-terminal in a production of a non-terminal is
the non-terminal itself( direct left recursion ) or through some other non-terminal definitions,
rewrites to the non-terminal again(indirect left recursion). Consider these examples -
Page 10 of 34
E -> TE′
E′-> +TE′ | €
T -> FT′
T′-> *FT′ | €
F -> (E) | id
Reccursive procedures for the recursive descent parser for the given grammar are given
below
procedure E( )
{
if(Lookahead==' $ ')
declare as Success ;
else error;
T( );
E′( );
}
procedure T ( )
{
F( );
T′( );
}
Procedure E′( )
{
if (Lookahead= = '+')
{
match('+');
T ( );
E′( );
}
else if (Lookahead==NULL)
{
match('NULL') ;
}
else error;
Page 11 of 34
}
procedure T′( )
{
if (Lookahead == '*')
{
match('+');
F ( );
T′( );
}
else if(Lookahead == 'NULL')
{
match(NULL);
}
else error;
}
procedure F( )
{
if (Lookahead== '(' )
{
match( '(' );
E( );
}
if else(Lookahead==' )' )
{
match(' )' );
}
if else(Lookahead==' id ')
{
match('id');
}
else error;
}
Page 12 of 34
{
token t = next token;
t= token;
}
Procedure Error ( )
{
printf(" Error!");
}
Procedure NULL( )
{
Printf(" Empty!");
}
A table-driven predictive parser has an input buffer, a stack, a parsing table, and an output
stream. The input buffer contains the string to be parsed, followed by $, a symbol used as a
Page 13 of 34
right end marker to indicate the end of the input string. The stack contains a sequence of
grammar symbols with $ on the bottom, indicating the bottom of the stack. Initially, the stack
contains the start symbol of the grammar on top of $. The parsing table is a two dimensional
array M[A,a] where A is a nonterminal, and a is a terminal or the symbol $. The parser is
controlled by a program that behaves as follows. The program considers X, the symbol on the
top of the stack, and a, the current input symbol. These two symbols determine the action of
the parser. There are three possibilities.
1. If X= a=$, the parser halts and announces successful completion of parsing.
2. If X=a!=$, the parser pops X off the stack and advances the input pointer to the next
input symbol.
3. If X is a nonterminal, the program consults entry M[X,a] of the parsing table M. This
entry will be either an X-production of the grammar or an error entry. If, for example,
M[X,a]={X- >UVW}, the parser replaces X on top of the stack by WVU( with U on
top). As output, we shall assume that the parser just prints the production used; any
other code could be executed here. If M[X,a]=error, the parser calls an error recovery
routine.
Algorithm for predictive LL(1) Parsing.
Input: A string w and a parsing table M for grammar G.
Output: If w is in L(G), a leftmost derivation of w; otherwise, an error indication.
Method: Initially, the parser is in a configuration in which it has $S on the stack with S,
the start symbol of G on top, and w$ in the input buffer. The program that utilizes the
predictive parsing table M to produce a parse for the input is shown in Fig.
else error()
Page 14 of 34
until X=$ /* stack is empty */
Page 15 of 34
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table
Example:
Consider the following grammar:
E→E+T|T
T→T*F|F
F→(E)|id
Solution
First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ , ε }
FIRST(T) = { ( , id}
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
Predictive parsing Table
Page 16 of 34
Stack Implementation
Page 17 of 34
Fig: Types of Bottom up Parser
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as
shift-step and reduce-step.
Shift step: The shift step refers to the advancement of the input pointer to the next
input symbol, which is called the shifted symbol. This symbol is pushed onto the
stack. The shifted symbol is treated as a single node of the parse tree.
Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it
to (LHS), it is known as reduce-step. This occurs when the top of the stack contains a
handle. To reduce, a POP function is performed on the stack which pops off the
handle and replaces it with LHS non-terminal symbol.
Shift Reducing Parser to construct parse free from leaves to root. It works on the same
principle of bottom up parser. A shift reduce parser requires following steps:
Page 18 of 34
The parser performs following basic operations
Shift: Moving of Symbols from input buffer out to the stack, this action is called shift.
Reduce: If the handle appears on the top of the stack there reduction of if by
appropriate rule is done that means RHS of rule is popped of and LHS is pushed in.
Accept: If the stack contain start symbol only and input buffer is empty at the same time then
that action is called accept.
Page 19 of 34
2.8 OPERATOR PRECEDENCE PARSING
Operator precedence grammar is kinds of shift reduce parsing method. It is applied to a small
class of operator grammars.
Operator precedence can only established between the terminals of the grammar. It ignores
the non-terminal.
a ⋗ b means that terminal "a" has the higher precedence than terminal "b".
a ⋖ b means that terminal "a" has the lower precedence than terminal "b".
a ≐ b means that the terminal "a" and "b" both have same precedence.
Precedence table:
2.9 LR PARSERS
LR parsing is one type of bottom up parsing. It is used to parse the large class of grammars.
Page 20 of 34
Figure: Structure of LR Parser
Input
Contains the string to be parsed and pointer
Parsing Table
Contains two parts ACTION and GOTO which is used by the parser program
1. ACTION Part
The ACTION part of the table is a two dimensional array indexed by state and the input
symbol, i.e. ACTION[state][input], An action table entry can have one of following four
kinds of values in it. They are:
1. Shift X, where X is a State number.
2. Reduce X, where X is a Production number.
3. Accept, signifying the completion of a successful parse.
2. GO TO Part
The GO TO part of the table is a two dimensional array indexed by state and a Non terminal,
i.e. GOTO[state][NonTerminal]. A GO TO entry has a state number in the table.
Augment Grammar
The Augment Grammar G`, is G with a new starting symbol S` an additional production
Page 21 of 34
S` ->S. this helps the parser to identify when to stop the parsing and announce the acceptance
of the input. The input string is accepted if and only if the parser is about to reduce by S`-> S.
NOTE: Augment Grammar is simply adding one extra production by preserving the actual
meaning of the given Grammar G.
Stack
Contains string of the form s0X1s1X2…..Xmsm where sm is on the top of the stack. Each Xi
denotes a grammar symbol and si a state. Each state symbol summarizes the information
contained in the stack below it
Parser Program
This determines the state on the sm on the top of the stack and the current input symbol a to
determine the next step
Types LR Parsers
SLR parsing, CLR parsing and LALR parsing.
In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
Various steps involved in the SLR (1) Parsing:
o For the given input string write a context free grammar
o Check the ambiguity of the grammar
o Add Augment production in the given grammar
o Create Canonical collection of LR (0) items
o Draw a data flow diagram (DFA)
o Construct a SLR (1) parsing table
Page 22 of 34
Example : To construct SLR ( 1 ) parser of the given grammar
E→E+T
E→T
T→T*F
T→ F
F →(E)
F → id
Solution:
The canonical collection of SLR(1) items are
Closure
I0:
E’ → .E
E → .E + T
E → .T
T → .T * F
T → .F
F → .( E )
F → .id
I1: GOTO(I0,E)
E’ → E.
E → E.+ T
I2: GOTO(I0,T)
E → T.
T → T .* F
I3: GOTO(I0,F)
T → F.
I4: GOTO(I0,( )
F → (.E)
E→.E+T
E → .T
Page 23 of 34
T → .T * F
T → .F
F → .( E )
F → .id
I5: GOTO(I0,id)
F → id.
I6: GOTO(I1,E)
E → E + .T
T → .T * F
T → .F
F → .( E )
F → .id
I7: GOTO(I2,*)
T → T * .F
F → .( E)
F → .id
I8: GOTO(I4,E)
F → ( E .)
E → E. + T
I9: GOTO(I6,T)
E → E + T.
T → T. * F
I10: GOTO(I7,F)
T → T * F.
I11: GOTO(I8,) )
F → ( E ).
Page 24 of 34
ACTION GOTO
STATES id + * ( ) $ E T F
0 s5 1 2 3
1 s6 s4 ACC
s
2 r2 7 r2 r2
3 r4 r4 r4 r4
4 s5 S4 8
5 r6 r6 r6 r6
6 s5 s4 9
7 s5 s4 10
8 s6 s11
s
9 r1 7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
CLR refers to canonical lookahead. CLR parsing use the canonical collection of LR (1) items
to build the CLR (1) parsing table. CLR (1) parsing table produces the more number of states
as compare to the SLR (1) parsing.
In the CLR (1), we place the reduce node only in the lookahead symbols. Various steps
involved in the CLR (1) Parsing:
LR (1) item
Page 25 of 34
The look ahead is used to determine that where we place the final item.
The look ahead always add $ symbol for the argument production.
S → AA
A → aA
A→b
Add Augment Production, insert '•' symbol at the first position for every production in G and
also add the lookahead.
S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
I0 State:
Add all productions starting with S in to I0 State because "." is followed by the non-terminal.
So, the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "." is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
Page 26 of 34
I1= Go to (I0, S) = closure (S` → S•, $) = S` → S•, $
I2= Go to (I0, A) = closure ( S → A•A, $ )
Add all productions starting with A in I2 State because "." is followed by the non-terminal.
So, the I2 State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
Add all productions starting with A in I3 State because "." is followed by the non-terminal.
So, the I3 State becomes
Add all productions starting with A in I6 State because "." is followed by the non-terminal.
So, the I6 State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
Page 27 of 34
I7= Go to (I2, b) = Closure (A → b•, $) = A → b•, $
I8= Go to (I3, A) = Closure (A → aA•, a/b) = A → aA•, a/b
I9= Go to (I6, A) = Closure (A → aA•, $) = A → aA•, $
Drawing DFA:
LALR refers to the lookahead LR. To construct the LALR (1) parsing table, we use the
canonical collection of LR (1) items.
In the LALR (1) parsing, the LR (1) items which have same productions but different look
ahead are combined to form a single set of items
LALR (1) parsing is same as the CLR (1) parsing, only difference in the parsing table.
Page 28 of 34
I0 State:
Add all productions starting with S in to I0 State because "•" is followed by the non-terminal.
So, the I0 State becomes
I0 = S` → •S, $
S → •AA, $
Add all productions starting with A in modified I0 State because "•" is followed by the non-
terminal. So, the I0 State becomes.
I0= S` → •S, $
S → •AA, $
A → •aA, a/b
A → •b, a/b
Add all productions starting with A in I2 State because "•" is followed by the non-terminal.
So, the I2 State becomes
I2= S → A•A, $
A → •aA, $
A → •b, $
Add all productions starting with A in I3 State because "•" is followed by the non-terminal.
So, the I3 State becomes
Page 29 of 34
Go to (I3, a) = Closure (A → a•A, a/b) = (same as I3)
Go to (I3, b) = Closure (A → b•, a/b) = (same as I4)
Add all productions starting with A in I6 State because "•" is followed by the non-terminal.
So, the I6 State becomes
I6 = A → a•A, $
A → •aA, $
A → •b, $
If we analyze then LR (0) items of I3 and I6 are same but they differ only in their lookahead.
I3 = { A → a•A, a/b
A → •aA, a/b
A → •b, a/b
}
I6= { A → a•A, $
A → •aA, $
A → •b, $
}
Clearly I3 and I6 are same in their LR (0) items but differ in their lookahead, so we can
combine them and called as I36.
Page 30 of 34
I36 = { A → a•A, a/b/$
A → •aA, a/b/$
A → •b, a/b/$
}
The I4 and I7 are same but they differ only in their look ahead, so we can combine them and
called as I47.
The I8 and I9 are same but they differ only in their look ahead, so we can combine them and
called as I89.
Drawing DFA:
YACC Specification
The YACC specification file consists of three parts.
Page 31 of 34
Declaration section: In this section, ordinary C declarations are inserted and grammar tokens
are declared. The tokens should be declared between %{ and %}
Translation rule section
It includes the production rules of context free grammar with corresponding actions
Rule-1 action-1
Rule-2 action-2
:
:
Rule n action n
If there is more than one alternative to a single rule then those alternatives are separated by ‘|’
(pipe) character. The actions are typical C statements. If CFG is
LHS: alternative 1 | alternative 2 | …… alternative n
Then
LHS: alternative 1 {action 1}
| alternative 2 {action 1}
:
:
alternative n {action n}
C functions Section: this consists of one main function in which the routine yyparse() is
called. And it also contains required C functions
Example:
YACC Specification of a simple desk calculator:
%{
#include <ctype.h>
%}
%token DIGIT
%%
line: expr ‘\n’ { printf(“%d\n”, $1); }
;
expr : expr ‘+’ term { $$ = $1 + $3; }
| term
;
term : term ‘*’ factor { $$ = $1 * $3; }
| factor
;
factor : ‘(‘ expr ‘)’ { $$ = $2; }
Page 32 of 34
| DIGIT
;
%%
yylex() {
int c;
c = getchar();
if(isdigit(c)
{
yylval = c-‘0’;
return DIGIT;
}
return c;
}
The declaration keywords %left, %right and %nonassoc inform YACC that the following
tokens are to be treated as left-associative (as binary + - * & / commonly are), right-
associative (as exp often is), or non-associative (as binay < & > often are)
The order of declarations informs YACC that the tokens should be accorded increasing
precedence
Page 33 of 34
%%
lines : lines expr ‘\n’ { printf(“%g\n”, $2); }
| lines ‘\n’
| /* empty */
;
expr : expr ‘+’ expr { $$ = $1 + $3; }
| expr ‘-‘ expr { $$ = $1 - $3; }
| expr ‘*’ expr {$$ = $1 * $3; }
| expr ‘/’ expr {$$ = $1 / $3; }
| ‘(‘ expr ‘)’ {$$ = $2;}
| ‘-‘ expr %prec UMINUS { $$ = -$2; }
| NUMBER
;
%
yylex() {
int c;
while (( c = getchar() ) == ‘ ‘ );
if ((c == ‘.’) || (isdigit(c) ) {
unget(c, stdin);
scanf(“%1f”, &uulval);
return Number;
}
return c;
Page 34 of 34