Syntax Analyzer: CS416 Compilr Design 1

Syntax Analyzer
• Syntax Analyzer creates the syntactic structure of the given source program.
• This syntactic structure is mostly a parse tree.
• Syntax Analyzer is also known as parser.
• The syntax of a programming is described by a context-free grammar (CFG). We will
use BNF (Backus-Naur Form) notation in the description of CFGs.
• The syntax analyzer (parser) checks whether a given source program satisfies the rules
implied by a context-free grammar or not.
– If it satisfies, the parser creates the parse tree of that program.
– Otherwise the parser gives the error messages.
• A context-free grammar
– gives a precise syntactic specification of a programming language.
– the design of the grammar is an initial phase of the design of a compiler.
– a grammar can be directly converted into a parser by some tools.
CS416 Compilr Design 1

Parser
• Parser works on a stream of tokens.
• The smallest item is a token.
source Lexical token parse tree

program Parser
Analyzer get next token

Parsers (cont.)
• We categorize the parsers into two groups:
1. Top-Down Parser
– the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
– the parse is created bottom to top; starting from the leaves
• Both top-down and bottom-up parsers scan the input from left to right
(one symbol at a time).
• Efficient top-down and bottom-up parsers can be implemented only for
sub-classes of context-free grammars.
– LL for top-down parsing (left-right & construct left most derivation)
– LR for bottom-up parsing(left-right ; rightmost derivation in reverse)
Syntax Error Handling
Programs may contain errors at many different levels.
Example, error may be
•Lexical, such as misspelling an identifier, keyword or operator
•Syntactic, such as an arithmetic expression with unbalanced parenthesis
•Semantic, such as an operator applied to an incompatible operand
•Logical, such as an infinitely recursive call

Error Recovery Strategies
• Panic Mode
• Phrase level
• Error productions
• Global correction

Context-Free Grammars
• Inherently recursive structures of a programming language are defined
by a context-free grammar.
• In a context-free grammar, we have:
– A finite set of terminals (in our case, this will be the set of tokens)
– A finite set of non-terminals (syntactic-variables)
– A finite set of productions rules in the following form
• A where A is a non-terminal and
 is a string of terminals and non-terminals (including the
empty string)
– A start symbol (one of the non-terminal symbol)
• Example:
E E+E | E–E | E*E | E/E | -E
E (E)
E  id

Derivations
E  E+E
• E+E derives from E
– we can replace E by E+E
– to able to do this, we have to have a production rule EE+E in our grammar.
E  E+E  id+E  id+id

• A sequence of replacements of non-terminal symbols is called a derivation of id+id from E.
• In general a derivation step is
A   if there is a production rule A in our grammar
where  and  are arbitrary strings of terminal and non-terminal symbols
1  2  ...  n (n derives from 1 or 1 derives n )
 : derives in one step

*
 : derives in zero or more steps
+
 : derives in one or more steps

Derivation Example
E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
OR
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)
• At each derivation step, we can choose any of the non-terminal in the sentential form
of G for the replacement.
• If we always choose the left-most non-terminal in each derivation step, this derivation
is called as left-most derivation.
• If we always choose the right-most non-terminal in each derivation step, this

derivation is called as right-most derivation.

Left-Most and Right-Most Derivations
Left-Most Derivation
E
lm
-E 
lm
-(E) lm
 -(E+E) lm
 -(id+E) 
lm
-(id+id)
Right-Most Derivation
E
rm
-E 
rm
-(E) rm
 -(E+E) rm
 -(E+id) 
rm
-(id+id)
• We will see that the top-down parsers try to find the left-most
derivation of the given source program.
• We will see that the bottom-up parsers try to find the right-most
derivation of the given source program in the reverse order.
Parse Tree
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.
• A parse tree can be seen as a graphical representation of a derivation .
E  -E E E E
 -(E)  -(E+E)
- E - E - E
( E ) ( E )
E E E + E
- E - E
 -(id+E)  -(id+id)
( E ) ( E )
E + E E + E
id id id

Ambiguity
• A grammar produces more than one parse tree for a sentence is

called as an ambiguous grammar.
E
E  E+E  id+E  id+E*E E + E
 id+id*E  id+id*id
id E * E
id id
E
E  E*E  E+E*E  id+E*E
 id+id*E  id+id*id E * E
E + E id
id id

Ambiguity (cont.)
stmt  if expr then stmt |

if expr then stmt else stmt | otherstmts
if E1 then if E2 then S1 else S2
stmt stmt
if expr then stmt else stmt if expr then stmt
E1 if expr then stmt S2 E1 if expr then stmt else stmt
E2 S1 E2 S1 S2
1 2
Ambiguity – Operator Precedence
• Ambiguous grammars (because of ambiguous operators) can be
disambiguated according to the precedence and associativity rules.
E  E+E | E*E | E^E | id | (E)

disambiguate the grammar
 precedence: ^ (right to left)

* (left to right)
+ (left to right)
E  E+T | T
T  T*F | F
F  G^F | G
G  id | (E)

Left Recursion
• A grammar is left recursive if it has a non-terminal A such that there is
a derivation.
+
A A for some string 
• Top-down parsing techniques cannot handle left-recursive grammars.

• So, we have to convert our left-recursive grammar into an equivalent
grammar which is not left-recursive.
• The left-recursion may appear in a single step of the derivation
(immediate left-recursion), or may appear in more than one step of
the derivation.

Immediate Left-Recursion
AA|  where  does not start with A
 eliminate immediate left recursion
A   A’
A’   A’ |  an equivalent grammar
In general,
A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A
A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar

Immediate Left-Recursion -- Example
E  E+T | T
T  T*F | F
F  id | (E)

E  T E’
E’  +T E’ | 
T  F T’
T’  *F T’ | 
F  id | (E)

Parsing Strategies
• Top Down
• Bottom Up
S->aABe
A->Abc | b
B->d
abbcde

Top-Down Parsing
• The parse tree is created top to bottom.
• Top-down parser
– Recursive-Descent Parsing
• Backtracking is needed (If a choice of a production rule does not work, we backtrack to try other
alternatives.)
• It is a general parsing technique, but not widely used.
• Not efficient
– Predictive Parsing
• no backtracking
• efficient
• needs a special form of grammars (LL(1) grammars).
• Recursive Predictive Parsing is a special form of Recursive Descent parsing without backtracking.
• Non-Recursive (Table Driven) Predictive Parser is also known as LL(1) parser.
CS416 Compiler Design 18

Recursive-Descent Parsing (uses Backtracking)
• Backtracking is needed.
• It tries to find the left-most derivation.
S  cAd
A  ab | a
S S
Input string : w=cad
c A d c A d
fails, backtrack
a b a

Predictive Parser
a grammar   a grammar suitable for predictive
eliminate left parsing (a LL(1) grammar)
left recursion factor no %100 guarantee.
• When re-writing a non-terminal in a derivation step, a predictive parser

can uniquely choose a production rule by just looking the current symbol
in the input string.
A  1 | ... | n input: ... a .......
current token

Compute FIRST for Any String X
• If X is a terminal symbol  FIRST(X)={X}
• If X is a non-terminal symbol and X   is a production rule
  is in FIRST(X).
• If X is a non-terminal symbol and X  Y1Y2..Yn is a production rule
 if a terminal a in FIRST(Yi) and  is in all FIRST(Yj) for j=1,...,i-1
then a is in FIRST(X).
 if  is in all FIRST(Yj) for j=1,...,n
then  is in FIRST(X).
• If X is   FIRST(X)={}
• If X is Y1Y2..Yn  if a terminal a in FIRST(Yi) and  is
in all FIRST(Yj) for j=1,...,i-1 then a is in FIRST(X).
 if  is in all FIRST(Yj) for
j=1,...,n then  is in FIRST(X).

FIRST Example
E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id
FIRST(F) = {(,id} FIRST(TE’) = {(,id}

FIRST(T’) = {*, } FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST() = {}
FIRST(E’) = {+, } FIRST(FT’) = {(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST() = {}
FIRST((E)) = {(}
FIRST(id) = {id}

Compute FOLLOW (for non-terminals)
• If S is the start symbol  $ is in FOLLOW(S)
• if A  B is a production rule

 everything in FIRST() is FOLLOW(B) except 
• If ( A  B is a production rule ) or
( A  B is a production rule and  is in FIRST() )
 everything in FOLLOW(A) is in FOLLOW(B).
We apply these rules until nothing more can be added to any follow set.

FOLLOW Example
E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }

Constructing LL(1) Parsing Table -- Algorithm
• for each production rule A   of a grammar G
– for each terminal a in FIRST()
 add A   to M[A,a]
– If  in FIRST()
 for each terminal a in FOLLOW(A) add A   to M[A,a]
– If  in FIRST() and $ in FOLLOW(A)
 add A   to M[A,$]
• All other undefined entries of the parsing table are error entries.

Constructing LL(1) Parsing Tables
• Two functions are used in the construction of LL(1) parsing tables:
– FIRST FOLLOW
• FIRST() is a set of the terminal symbols which occur as first symbols

in strings derived from  where  is any string of grammar symbols.
• if  derives to , then  is also in FIRST() .
• FOLLOW(A) is the set of the terminals which occur immediately after

(follow) the non-terminal A in the strings derived from the starting
symbol.
– a terminal a is in FOLLOW(A) if S  *
Aa
*
– $ is in FOLLOW(A) if S  A

Constructing LL(1) Parsing Table -- Example
E  TE’ FIRST(TE’)={(,id}  E  TE’ into M[E,(] and M[E,id]
E’  +TE’ FIRST(+TE’ )={+}  E’  +TE’ into M[E’,+]
E’   FIRST()={}  none
but since  in FIRST()
and FOLLOW(E’)={$,)}  E’   into M[E’,$] and M[E’,)]
T  FT’ FIRST(FT’)={(,id}  T  FT’ into M[T,(] and M[T,id]

T’  *FT’ FIRST(*FT’ )={*}  T’  *FT’ into M[T’,*]
T’   FIRST()={}  none
but since  in FIRST()
and FOLLOW(T’)={$,),+}  T’   into M[T’,$], M[T’,)] and M[T’,+]
F  (E) FIRST((E) )={(}  F  (E) into M[F,(]
F  id FIRST(id)={id}  F  id into M[F,id]

LL(1) Parser – Example2
E  TE’
E’  +TE’ | 
T  FT’ Blanks are Error entries. Non blanks
T’  *FT’ |  indicate a production with which to
F  (E) | id expand the top nonterminal on the stack
id + * ( ) $
E E  TE’ E  TE’
E’ E’  +TE’ E’   E’  
T T  FT’ T  FT’
T’ T’   T’  *FT’ T’   T’  
F F  id F  (E)
Model of Predictive Parser

LL(1) Parser
input buffer
– our string to be parsed. We will assume that its end is marked with a special symbol $.
output
– a production rule representing a step of the derivation sequence (left-most derivation) of the string in the input
buffer.
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S. $S  initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is completed.
parsing table
– a two-dimensional array M[A,a]
– each row is a non-terminal symbol
– each column is a terminal symbol or the special symbol $
– each entry holds a production rule.

LL(1) Parser – Parser Actions
• The symbol at the top of the stack (say X) and the current symbol in the input string (say
a) determine the parser action.
• There are four possible parser actions.
1. If X and a are $  parser halts (successful completion)
2. If X and a are the same terminal symbol (different from $)
 parser pops X from the stack, and moves the next symbol in the input buffer.
3. If X is a non-terminal
 parser looks at the parsing table entry M[X,a]. If M[X,a] holds a production rule
XY1Y2...Yk, it pops X from the stack and pushes Yk,Yk-1,...,Y1 into the stack. The parser
also outputs the production rule XY1Y2...Yk to represent a step of the derivation.
4. none of the above  error
– all empty entries in the parsing table are errors.
– If X is a terminal symbol different from a, this is also an error case.

S  aBa a b $ LL(1) Parsing
B  bB |  S S  aBa Table
B B B  bB
i/p-abba$
stack input output
$S abba$ S  aBa
$aBa abba$
$aB bba$ B  bB
$aBb bba$
$aB ba$ B  bB
$aBb ba$
$aB a$ B
$a a$
$ $ accept, successful completion

LL(1) Parser – Example1 (cont.)
Outputs: S  aBa B  bB B  bB B
Derivation(left-most): SaBaabBaabbBaabba
S
parse tree
a B a
b B
b B

stack input output
$E id+id$ E  TE’
$E’T id+id$ T  FT’
$E’ T’F id+id$ F  id
$ E’ T’id id+id$
$ E ’ T’ +id$ T’  
$ E’ +id$ E’  +TE’
$ E’ T+ +id$
$ E’ T id$ T  FT’
$ E ’ T’ F id$ F  id
$ E’ T’id id$
$ E ’ T’ $ T’  
$ E’ $ E’  
$ $ accept

LL(1) Grammars
• A grammar whose parsing table has no multiply-defined entries is said
to be LL(1) grammar.
one input symbol used as a look-head symbol do determine parser action
LL(1) left most derivation

input scanned from left to right
• The parsing table of a grammar may contain more than one production
rule. In this case, we say that it is not a LL(1) grammar.

A Grammar which is not LL(1)
SiCtSE | a FOLLOW(S) = { $,e }
EeS |  FOLLOW(E) = { $,e }
Cb FOLLOW(C) = { t }
FIRST(iCtSE) = {i}
a b e i t $
FIRST(a) = {a}
S Sa S  iCtSE
FIRST(eS) = {e}
E EeS E
FIRST() = {}
E
FIRST(b) = {b}
C Cb
two production rules for M[E,e]
Problem  ambiguity

Syntax Analyzer: CS416 Compilr Design 1

Uploaded by

Copyright:

Available Formats

Syntax Analyzer: CS416 Compilr Design 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Syntax Analyzer: CS416 Compilr Design 1

Uploaded by

Copyright:

Available Formats

Syntax Analyzer

CS416 Compilr Design 1

• Parser works on a stream of tokens.

• The smallest item is a token.

source Lexical token parse tree

CS416 Compilr Design 2

•Syntactic, such as an arithmetic expression with unbalanced parenthesis

•Semantic, such as an operator applied to an incompatible operand

•Logical, such as an infinitely recursive call

CS416 Compilr Design 4

CS416 Compilr Design 5

CS416 Compilr Design 6

E  E+E  id+E  id+id

1  2  ...  n (n derives from 1 or 1 derives n )

 : derives in one step

CS416 Compilr Design 7

• If we always choose the right-most non-terminal in each derivation step, this

CS416 Compilr Design 8

• A parse tree can be seen as a graphical representation of a derivation .

CS416 Compilr Design 10

• A grammar produces more than one parse tree for a sentence is

CS416 Compilr Design 11

stmt  if expr then stmt |

if E1 then if E2 then S1 else S2

if expr then stmt else stmt if expr then stmt

E1 if expr then stmt S2 E1 if expr then stmt else stmt

E  E+E | E*E | E^E | id | (E)

 precedence: ^ (right to left)

CS416 Compilr Design 13

• Top-down parsing techniques cannot handle left-recursive grammars.

CS416 Compilr Design 14

CS416 Compilr Design 15

 eliminate immediate left recursion

CS416 Compilr Design 16

CS416 Compilr Design 17

CS416 Compiler Design 18

CS416 Compiler Design 19

• When re-writing a non-terminal in a derivation step, a predictive parser

A  1 | ... | n input: ... a .......

CS416 Compiler Design 20

CS416 Compiler Design 21

FIRST(F) = {(,id} FIRST(TE’) = {(,id}

CS416 Compiler Design 22

• if A  B is a production rule

CS416 Compiler Design 23

CS416 Compiler Design 24

CS416 Compiler Design 25

• FIRST() is a set of the terminal symbols which occur as first symbols

• FOLLOW(A) is the set of the terminals which occur immediately after

CS416 Compiler Design 26

E’  +TE’ FIRST(+TE’ )={+}  E’  +TE’ into M[E’,+]

T  FT’ FIRST(FT’)={(,id}  T  FT’ into M[T,(] and M[T,id]

F  (E) FIRST((E) )={(}  F  (E) into M[F,(]

F  id FIRST(id)={id}  F  id into M[F,id]

CS416 Compiler Design 27

CS416 Compilr Design 29

CS416 Compiler Design 30

CS416 Compiler Design 31

CS416 Compiler Design 32