0% found this document useful (0 votes)
40 views53 pages

Ch4a Modified

Uploaded by

Maheen Munir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views53 pages

Ch4a Modified

Uploaded by

Maheen Munir
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

1

Syntax Analysis
Part I
Chapter 4
The role of parser

token
Source Lexical Parse tree Rest of Intermediate
program Analyzer Parser
Front End representation
getNext
Token

Symbol
table
3

The Parser
• A parser implements a C-F grammar as a
recognizer of strings
• The role of the parser in a compiler is two fold:
1. To check syntax (= string recognizer)
• And to report syntax errors accurately
2. To invoke semantic actions
• For static semantics checking, e.g. type checking of
expressions, functions, etc.
• For syntax-directed translation of the source code to an
intermediate representation
4

Syntax-Directed Translation
• One of the major roles of the parser is to produce
an intermediate representation (IR) of the source
program using syntax-directed translation
methods
• Possible IR output:
– Abstract syntax trees (ASTs)
– Control-flow graphs (CFGs) with triples, three-address
code, or register transfer list notation
– WHIRL (SGI Pro64 compiler) has 5 IR levels!
5

Error Handling
• A good compiler should assist in identifying and
locating errors
– Lexical errors: important, compiler can easily recover and
continue, misspellings of tokens
– Syntax errors: most important for compiler, can almost
always recover, misplaced semicolons, or extra or missing
braces.
– Static semantic errors: important, can sometimes recover,
type mismatched between operators and operands.
– Dynamic semantic errors: hard or impossible to detect at
compile time, runtime checks are required
– Logical errors: hard or impossible to detect
6

Error Recovery Strategies

• Panic mode
– Discard input until a token in a set of designated
synchronizing tokens is found
• Phrase-level recovery
– Perform local correction on the input to repair the
error
– A typical local correction is to replace a comma by a
semicolon, delete an extraneous semicolon, or insert a
missing semicolon.
• Error productions
– Augment (Expand) grammar with productions for
erroneous constructs
7

Error Recovery Strategies


• Global correction
– Choose a minimal sequence of changes to obtain a
global least-cost correction
– Unfortunately, these methods are in general too costly
to implement in terms of time and space, so these
techniques are currently only of theoretical interest.
8

Grammars (Recap)
• Context-free grammar is a 4-tuple
G = (N, T, P, S) where
– T is a finite set of tokens (terminal symbols)
– N is a finite set of nonterminals
– P is a finite set of productions of the form

where   N and   (NT)*
– S  N is a designated start symbol
9

Notational Conventions Used


• Terminals
a,b,c,…  T
specific terminals: 0, 1, id, +
• Nonterminals
A,B,C,…  N
specific nonterminals: expr, term, stmt
• Grammar symbols
X,Y,Z  (NT)
• Strings of terminals
u,v,w,x,y,z  T*
• Strings of grammar symbols
,,  (NT)*
10

Chomsky Hierarchy: Language


Classification
• A grammar G is said to be
– Regular if it is right linear where each production is of
the form
AwB or Aw
or left linear where each production is of the form
ABw or Aw
– Context free if each production is of the form
A
where A  N and   (NT)*
– Context sensitive if each production is of the form
A
where A  N, ,,  (NT)*, || > 0
– Unrestricted
11

Chomsky Hierarchy

L(regular)  L(context free)  L(context sensitive)  L(unrestricted)

Where L(T) = { L(G) | G is of type T }


That is: the set of all languages
generated by grammars G of type T

Examples:
Every finite language is regular! (construct a FSA for strings in L(G))
L1 = { anbn | n  1 } is context free
L2 = { anbncn | n  1 } is context sensitive
12

Chomsky Hierarchy (Contd.)

• Venn Diagram of Grammar Types:


Type 0 – Phrase-structure (Unrestricted)
Type 1 –
Context-Sensitive
Type 2 –
Context-Free

Type 3 –
Regular
13

Derivations (Recap)
• The one-step derivation is defined by
A
where A   is a production in the grammar
• In addition, we define
–  is leftmost lm if  does not contain a nonterminal
–  is rightmost rm if  does not contain a nonterminal
– Transitive closure * (zero or more steps)
– Positive closure + (one or more steps)
• The language generated by G is defined by
L(G) = {w  T* | S + w}
14

Derivation (Example)
Grammar G = ({E}, {+,*,(,),-,id}, P, E) with
productions P = EE+E
EE*E
E(E)
E-E
E  id
Example derivations:
E  - E  - id
E rm E + E rm E + id rm id + id
E * E
E * id + id
E + id * id + id
15

Parsing
• Universal (any C-F grammar)
– Cocke-Younger-Kasimi
– Earley
– These methods are now too inefficient
• Top-down (C-F grammar with restrictions)
– Recursive descent (predictive parsing)
– LL (Left-to-right, Leftmost derivation) methods
• Bottom-up (C-F grammar with restrictions)
– Operator precedence parsing
– LR (Left-to-right, Rightmost derivation) methods
• SLR, canonical LR, LALR
16

Top-Down Parsing
• LL methods (Left-to-right, Leftmost
derivation) and recursive-descent parsing
Grammar: Leftmost derivation:
ET+T E lm T + T
T(E) lm id + T
T-E lm id + id
T  id
E E E E

T T T T T T

+ id + id + id
Problems for top-down parsing with
backtracking :
(1) left-recursion (can cause a top-down parser to go into an
infinite loop)
Def. A grammar is said to be left-recursive
+
if it has a
nonterminal A s.t. there is a derivation A ⇒ A  for some
.

(2) backtracking - undo not only the movement but also the
semantics entering in symbol table.

(3) the order the alternatives are tried (For the grammar
shown above, try w = cabd where A  a is applied first)
Ambiguity
• For some strings there exist more than one
parse tree
• Or more than one leftmost derivation
• Or more than one rightmost derivation
• Example: id+id*id
Ambiguity
20
Elimination of ambiguity (cont.)

• Idea:
– A statement appearing between a then and an
else must be matched
22

Left Recursion
• Productions of the form
AA
|
|
are left recursive
• When one of the productions in a grammar
is left recursive then a predictive parser
loops forever on certain inputs
Elimination of Left Recursion
• A grammar is left recursive if it has a non-
terminal A such that there is a derivation
A+⇒ Aα
• Top down parsing methods cant handle left-
recursive grammars
• A simple rule for direct left recursion
elimination:
– For a rule like:
• A → A α|β
– We may replace it with
• A → β A'
• A' → α A' | ɛ
24

Elimination of Left Recursion

A A
 A'
A 
===>  A'
A 
..  A'
A .
.
A   A'

  … 
25

e.g. E → E + T | T T → T * F | F
F → (E) | id

After transformation:

E → TE E' → +TE' | 
T → FT' T' → *FT' | 
F → (E) | id
26

General form (with left recursion):


A → A 1 | A 2 | ... | A n | 1 | 2 | ... | m

After transformation:

==> A → 1 A' | 2 A' | ... | m A'


A' → 1 A' | 2 A' | ... | n A' | 
27

How about left recursion occurred for


derivation with more than two steps?

e.g., S → Aa | b A → Ac | Sd | ε
Now, S → Aa | b
A →Ac | Aad | bd | ε
After eliminating left recursion:
28

A General Systematic Left


Recursion Elimination Method
Input: Grammar G with no cycles or -productions
Arrange the nonterminals in some order A1, A2, …, An
for i = 1, …, n do
for j = 1, …, i-1 do
replace each
Ai  Aj 
with
Ai  1  | 2  | … | k 
where
Aj  1 | 2 | … | k
enddo
eliminate the immediate left recursion in Ai
enddo
29

Non-backtracking (recursive-
descent) parsing (Left Factoring)
• When a nonterminal has two or more productions
whose right-hand sides start with the same
grammar symbols, the grammar is not LL(1) and
cannot be used for predictive parsing
• Replace productions
A   1 |  2 | … |  n | 
with
A   AR | 
AR  1 | 2 | … | n
30

Example (Left Factoring)

Here, i, t, and e stand for if, then, and else; E and S


stand for "conditional expression" and "statement." Left-
factored, this grammar becomes:
Top-Down Parsing
• A Top-down parser tries to create a parse tree
from the root towards the leafs scanning input
from left to right
• It can be also viewed as finding a leftmost
derivation for an input string
• Example: id+id*id
E E E E E E
E → TE’ lm lm lm lm lm
E’ → +TE’ | Ɛ T E’ T E’ T E’ T E’ T E’
T → FT’ T’ T’ + T E’
F T’ F T’ F F
T’ → *FT’ | Ɛ
F → (E) | id id id Ɛ id Ɛ
32

Predictive Parsing
• Eliminate left recursion from grammar
• Eliminate Left factoring the grammar
• Compute FIRST and FOLLOW
• Two variants:
– Recursive (recursive-descent parsing)
– Non-recursive (table-driven parsing)
33

Recursive-Descent Parsing
34

FIRST
• FIRST() = { the set of terminals that begin all
strings derived from  }
For every production A  
FIRST(A) = {} if   T
FIRST(A) = {} if  = 
FIRST(A) = A FIRST() if   N
35

LL(1) Grammar
• A grammar G is LL(1) if it is not left recursive
and for each collection of productions
A  1 | 2 | … | n
for nonterminal A the following holds:

1. FIRST(i)  FIRST(j) =  for all i  j


2. if i *  then
2.a. j *  for all i  j
2.b. FIRST(j)  FOLLOW(A) = 
for all i  j
Compute Follow(A) for all 36

nonterminals A
1. Place $ in Follow(S), where S is the start symbol and $ is the
input buffer end-marker.
2. If there is a production A   B ,
a) if  is nonterminal then everything in First() except for  is
placed in Follow(B).
b) if  is terminal then it is included in Follow(B).
3. If there is a production A   B, or a production A   B 
where First() contains , then everything in Follow(A) is in
Follow(B).
37

Non-LL(1) Examples

Grammar Not LL(1) because:


SSa|a Left recursive
SaS|a FIRST(a S)  FIRST(a)  
SaR|
RS| For R: S *  and  * 
SaRa For R:
RS| FIRST(S)  FOLLOW(R)  
38

Recursive-Descent Parsing
(Recap)
• Grammar must be LL(1)
• Every nonterminal has one (recursive) procedure
responsible for parsing the nonterminal’s
syntactic category of input tokens
• When a nonterminal has multiple productions,
each production is implemented in a branch of a
selection statement based on input look-ahead
information
39

Using FIRST and FOLLOW in a


Recursive-Descent Parser
procedure rest();
begin
expr  term rest if lookahead in FIRST(+ term rest) then
rest  + term rest match(‘+’); term(); rest()
else if lookahead in FIRST(- term rest) then
| - term rest match(‘-’); term(); rest()
| else if lookahead in FOLLOW(rest) then
term  id return
else error()
end;

where FIRST(+ term rest) = { + }


FIRST(- term rest) = { - }
FOLLOW(rest) = { $ }
40

Non-Recursive Predictive
Parsing: Table-Driven Parsing
• Given an LL(1) grammar G = (N, T, P, S)
construct a table M[A,a] for A  N, a  T
and use a driver program with a stack
input a + b $

stack
Predictive parsing
X output
program (driver)
Y
Z Parsing table
$ M
Predictive Parsing Table
Algorithm
A FIRST() FOLLOW(A)
Example E  T E’ ( id $ )
E’  + T E’ + $ )
E  T E’
E’    $ )
E’  + T E’ | 
T  F T’
TFT’ ( id +$)
T’ *FT’ | T’  * F T’ * +$)
F  ( E ) | id T’    +$)
F(E) ( *+$)
F  id id *+$)

id + * ( ) $
E E  T E’ E  T E’
E’ E’  + T E’ E’   E’  
T T  F T’ T  F TR
T’ T’   T’  * F TR T’   T’  
F F  id F(E)
Answer
A FIRST() FOLLOW(A)
Ambiguous grammar
S  i E t S S’ i e$
Sa a e$
S  i E t S S’ | a
S’ e S |  S’  e S e e$
E b S’    e$
Eb b t

Error: duplicate table entry


a b e i t $
S Sa S  i E t S SR
SR  
SR SR  
SR  e S
E Eb
Predictive Parsing Algorithm
Initially, w$ is in the input buffer and S is on top of the stack
set ip to point to the first symbol of w;
set X to the top stack symbol;
while (X != $) {
if (X is a) pop the stack and advance ip;
else if ( X is a terminal) error();
else if ( M[X, a] is an error entry) error();
else if ( M[X, a] = X  Y1Y2…Yk ) {
output the production X  Y1Y2…Yk
pop the stack;
push Yk , Yk-1 , ….Y1 onto the stack, with Y1 on top;
}
set X to the top stack symbol;
}
MATCHED STACK INPUT ACTION

Example E$ id + id * id$
TE’$ id + id * id$ E  T E’
FT’E’$ id + id * id$ T  F T ’
id T’E’$ id + id * id$ F  id
Table-Driven Parsing id T’E’$ + id * id$ match id
id E’$ + id * id$ T ’  

E  T E’ id +TE’$ + id * id$ E’  + T E ’
TE’$
E’  + T E’ |  id + id * id$ match +
FT’E’$ id * id$ T  F T’
TFT’
id +
id + id T’E’$ id * id$ F  id
T’ *FT’ | id + id T’E’$ * id$ match id
F  ( E ) | id id + id * FT’E’$ * id$ T’  * F T’
id + id * FT’E’$ id$ match *
id + id * id T’E’$ id$ F  id
id + id * id T’E’$ $ Match id
id + id * id E’$ $ T’  
id + id * id $ $ E’  
Panic Mode Recovery
Add synchronizing actions to FOLLOW(E) = { ) $ }
undefined entries based on FOLLOW FOLLOW(E’) = { ) $ }
FOLLOW(T) = { + ) $ }
Pro: Can be automated FOLLOW(TR) = { + ) $ }
Cons: Error messages are needed FOLLOW(F) = { + * ) $ }

id + * ( ) $
E E  T E’ E  T E’ synch synch
E’ E’  + T E’ E’   v
T T  F T’ synch T  F TR synch synch
T’ T’   T’  * F TR T’   T’  
F F  id synch synch F(E) synch synch
synch: the driver pops current nonterminal A and skips input till
synch token or skips input until one of FIRST(A) is found
47

Synch Application
“synch” indicating synchronizing tokens obtained from
FOLLOW set of the nonterminal in question.
If the parser looks up entry M[A,a] and finds that it is
blank, the input symbol a is skipped.
If the entry is synch, the the nonterminal on top of the
stack is popped.
If a token on top of the stack does not match the input
symbol, then we pop the token from the stack.
STACK INPUT ACTION

Example E$ ) id * + id $ error, skip )


E$ id * + id $ id is in FIRST(E)
TE’$ id * + id $
FT’E’$ id * + id $
id T’E’$ id * + id $

E  T E’ T’E’$ * + id $

E’  + T E’ |  * FT’E’$ * + id $

TFT’ FT’E’$ + id $ error, M[F, +] = synch


T’E’$
T’ *FT’ | + id $ F has been popped
E’$
F  ( E ) | id
+ id $
+TE’$ + id $
T’E’$ id $
FT’E’$ id $
id T’E’$ id $
T’E’$ $
E’$ $
$ $
49
50
51
Phrase-Level Recovery
By filling empty slots with pointer to error routines, these error
routines may change, insert or delete on input and generate
error.
Change input stream by inserting missing tokens
For example: id id is changed into id * id
Can then continue here

id + * ( ) $
E E  T E’ E  T E’ synch synch
E’ E’  + T E’ E’   E’  
T T  F T’ synch T  F T’ synch synch
T’ insert * T’   T’  * F T’ T’   T’  
F F  id synch synch F(E) synch synch

insert *: driver inserts missing * and retries the production


Error Productions
E  T E’ Add “error production”:
E’  + T E’ |  T’  F T’
T  F T’ to ignore missing *, e.g.: id id
T’  * F T’ |  Pro: Powerful recovery method
F  ( E ) | id Cons: Cannot be automated
id + * ( ) $
E E  T E’ E  T E’ synch synch
E’ E’  + T E’ E’   E’  
T T  F T’ synch T  F T’ synch synch
T’ T’  F T’ T’   T’  * F TR T’   T’  
F F  id synch synch F(E) synch synch

You might also like