Chapter 8 - Syntax Analysis
Chapter 8 - Syntax Analysis
Chapter 8 - Syntax Analysis
Syntax Analysis
Outline
Role of parser
Context free grammars
Top down parsing
Bottom up parsing
Parser generators
The role of parser
token
Source Lexical Parse tree Rest of Intermediate
program Analyzer Parser
Front End representation
getNext
Token
Symbol
table
Uses of grammars
E -> E + T | T
T -> T * F | F
F -> (E) | id
E -> TE’
E’ -> +TE’ | Ɛ
T -> FT’
T’ -> *FT’ | Ɛ
F -> (E) | id
Error handling
Common programming errors
Lexical – misspelling identifier,keyword,or operator
Syntactic – an arithmetic expression with unbalanced
parenthesis
Semantic – an operator applied to an incompatible operand
Lexical – an infinitely recursive call
Error handler goals
Report the presence of errors clearly and accurately
Recover from each error quickly enough to detect subsequent
errors
Add minimal overhead to the processing of correct programs
Error-recover strategies
Panic mode recovery
Discard input symbol one at a time until one of
designated set of synchronization tokens is found
Phrase level recovery
Replacing a prefix of remaining input by some string
that allows the parser to continue
Error productions
Augment the grammar with productions that generate
the erroneous constructs
Global correction
Choosing minimal sequence of changes to obtain a
globally least-cost correction
Context free grammars
Terminals
Nonterminals
Start symbol
productions
E -> TE’ E
lm
E
lm
E
lm
E
lm
E
lm
E
E’ -> +TE’ | Ɛ E’ E’
T E’ T E’ T E’ T T
T -> FT’
T’ -> *FT’ | Ɛ F T’ F T’ F T’ F T’ + T E’
F -> (E) | id id id id
Ɛ Ɛ
Recursive descent parsing
Consists of a set of procedures, one for each
nonterminal
Execution begins with the procedure for start symbol
A typical procedure for a non-terminal
void A() {
choose an A-production, A->X1X2..Xk
for (i=1 to k) {
if (Xi is a nonterminal
call procedure Xi();
else if (Xi equals the current input symbol a)
advance the input to the next symbol;
else /* an error has occurred */
}
}
Recursive descent parsing (cont)
General recursive descent may require backtracking
The previous code needs to be modified to allow
backtracking
In general form it cant choose an A-production easily.
So we need to try all alternatives
If one failed the input pointer needs to be reset and
another alternative should be tried
Recursive descent parsers can’t be used for left-
recursive grammars
Example
S->cAd
A->ab | a Input: cad
S S S
c A d c A d c A d
a b a
First and Follow
First() is set of terminals that begins strings derived from
If α=>ɛ
* then is also in First(ɛ)
In predictive parsing when we have A-> α|β, if First(α)
and First(β) are disjoint sets then we can select
appropriate A-production by looking at the next input
Follow(A), for any nonterminal A, is set of terminals a that
can appear immediately after A in some sentential form
* αAaβ for some αand βthen a is in
If we have S =>
Follow(A)
If A can be the rightmost symbol in some sentential form,
then $ is in Follow(A)
Computing First
To compute First(X) for all grammar symbols X, apply
*
following rules until no more terminals or ɛ can be
added to any First set:
1. If X is a terminal then First(X) = {X}.
2. If X is a nonterminal and X->Y1Y2…Yk is a production
for some k>=1, then place a in First(X) if for some i a is
in First(Yi) and ɛ is in all of First(Y1),…,First(Yi-1) that
*
is Y1…Yi-1 => ɛ. if ɛ is in First(Yj) for j=1,…,k then add ɛ
to First(X).
3. If X-> ɛ is a production then add ɛ to First(X)
Example!
Computing follow
To compute First(A) for all nonterminals A, apply
following rules until nothing can be added to any
follow set:
1. Place $ in Follow(S) where S is the start symbol
2. If there is a production A-> αBβ then everything in
First(β) except ɛ is in Follow(B).
3. If there is a production A->B or a production
A->αBβ where First(β) contains ɛ, then everything
in Follow(A) is in Follow(B)
Example!
LL(1) Grammars
Predictive parsers are those recursive descent parsers needing no
backtracking
Grammars for which we can create predictive parsers are called
LL(1)
The first L means scanning input from left to right
The second L means leftmost derivation
And 1 stands for using one input symbol for lookahead
A grammar G is LL(1) if and only if whenever A-> α|βare two
distinct productions of G, the following conditions hold:
For no terminal a do αandβ both derive strings beginning with a
At most one of α or βcan derive empty string
*
If α=> ɛ then βdoes not derive any string beginning with a
terminal in Follow(A).
Construction of predictive
parsing table
For each production A->α in grammar do the
following:
1. For each terminal a in First(α) add A-> in M[A,a]
2. If ɛ is in First(α), then for each terminal b in
Follow(A) add A-> ɛ to M[A,b]. If ɛ is in First(α) and
$ is in Follow(A), add A-> ɛ to M[A,$] as well
If after performing the above, there is no production
in M[A,a] then set M[A,a] to error
Example First Follow
F {(,id} {+, *, ), $}
E -> TE’ {(,id} {+, ), $}
E’ -> +TE’ | Ɛ T
E {(,id} {), $}
T -> FT’
T’ -> *FT’ | Ɛ E’ {+,ɛ} {), $}
T’ {*,ɛ} {+, ), $}
F -> (E) | id
Input Symbol
Non -
terminal id + * ( ) $
E E -> TE’ E -> TE’
Input Symbol
Non -
terminal a b e i t $
S S -> a S -> iEtSS’
S’ S’ -> Ɛ S’ -> Ɛ
S’ -> eS
E E -> b
Table driven predictive parsing
• A predictive parser can be built by maintaining a stack
explicitly.
• The table driven parser has an input buffer, stack
containing sequence of grammar symbols, parsing
table and an output stream.
• The input buffer contains the string to be parsed
followed by $.
• Initially the stack contains start symbol of the
grammar on the top followed by $.
• The parsing table deterministically guesses the correct
production to be used.
Model of a table driven predicting
parsing
Input a + b $
Predictive
parsing output
stack X
Y program
Z
$
Parsing
Table
M
Predictive parsing algorithm
Set ip point to the first symbol of w;
Set X to the top stack symbol;
While (X<>$) { /* stack is not empty */
if (X is a) pop the stack and advance ip;
else if (X is a terminal) error();
else if (M[X,a] is an error entry) error();
else if (M[X,a] = X->Y1Y2..Yk) {
output the production X->Y1Y2..Yk;
pop the stack;
push Yk,…,Y2,Y1 on to the stack with Y1 on top;
}
set X to the top stack symbol;
}
Procedure of predictive parser
• The current symbol of the input string is
maintained by a pointer say ‘ip’.
• In every step consider the set {α,a} where ‘α’ is the
top of the stack and ‘a’ is the symbol pointed by
the ‘ip’.
• If ‘α’ is a Non Terminal ,then see the table cell
M{α,a} for the production.
1. If M{α,a} is a valid production then pop the
stack , push the production into the stack.
2. If M{α,a} is error or blank then report an error
• If ‘α’ is a terminal then pop it from the stack and
also increment the input pointer ‘ip’ to point the
next symbol in the input string.
• The output will be the set of productions
• The following example illustrates the top-down
predictive parser using parser table .
String: id + id * id
Grammar: Mentioned LL(1) Grammar in
Previous slides
MATCHED STACK INPUT ACTION
E$ id+id * id$
TE’$ id+id * id$ E->TE’
FT’E’$ id+id * id$ T->FT’
id T’E’$ id+id * id$ F->id
id T’E’$ +id * id$ Match id
id E’$ +id * id$ T’->Є
id +TE’$ +id * id$ E’-> +TE’
id+ TE’$ id * id$ Match +
id+ FT’E’$ id * id$ T-> FT’
id+ idT’E’$ id * id$ F-> id
id+id T’E’$ * id$ Match id
id+id * FT’E’$ * id$ T’-> *FT’
id+id * FT’E’$ id$ Match *
id+id * idT’E’$ id$ F-> id
id+id * id T’E’$ $ Match id
id+id * id E’$ $ T’-> Є
id+id * id $ $ E’-> Є
Error recovery in predictive parsing
Panic mode
Place all symbols in Follow(A) into synchronization set for
nonterminal A: skip tokens until an element of Follow(A) is seen
and pop A from stack.
Add to the synchronization set of lower level construct the symbols
that begin higher level constructs
Add symbols in First(A) to the synchronization set of nonterminal
A
If a nonterminal can generate the empty string then the production
deriving can be used as a default
If a terminal on top of the stack cannot be matched, pop the
terminal, issue a message saying that the terminal was insterted
Fig. Synchronising tockens added to parsing table
Introduction
Constructs parse tree for an input string beginning at
the leaves (the bottom) and working towards the root
(the top)
Example: id*id
id id F id T*F
id F id
id
Shift-reduce parser
The general idea is to shift some symbols of input to
the stack until a reduction can be applied
At each reduction step, a specific substring matching
the body of a production is replaced by the
nonterminal at the head of the production
The key decisions during bottom-up parsing are about
when to reduce and about what production to apply
A reduction is a reverse of a step in a derivation
The goal of a bottom-up parser is to construct a
derivation in reverse:
E=>T=>T*F=>T*id=>F*id=>id*id
Handle pruning
A Handle is a substring that matches the body of a
production and whose reduction represents one step
along the reverse of a rightmost derivation
1-2 E → E + T | T
3-4 T → T * F | F
5-6 T → ( E ) | id
Producing the parse table
FirstTerm(A) = {a | A + a or A + Ba}
LastTerm(A) = {a | A + a or A + aB}
57
Example:
FirstTerm (E) = {+, *, id, (}
FirstTerm (T) = {*, id, (}
FirstTerm (F) = {id, (}
58
Precedence Functions vs
Relations + - * / ( ) id $
f 2 2 4 4 4 0 6 6 0
g 1 1 3 3 5 5 0 5 0
59
Constructing precedence
functionsg id f id
f * g *
g + f +
f $ g $
+ * id $
f 2 4 4 0
g 1 3 5 0
60
Advantages
The advantages of operator precedence parsing are-
The implementation is very easy and simple.
The parser is quite powerful for expressions in
programming language
Disadvantages
The disadvantages of operator precedence parsing are-
The handling of tokens known to have two different
precedence becomes difficult.
Only small class of grammars can be parsed using this
par
Extracting Precedence relations from parse tables
E
E + T + <. *
T * F
id * <. id 1-2 E → E + T | T
3-4 T → T * F | F
5-6 T → ( E ) | id
63
Extracting Precedence relations from parse tables
E
T * F
* .> *
T * F 1-2 E → E + T | T
3-4 T → T * F | F
5-6 T → ( E ) | id
F
id id .> *
64
LR Parser
LR parser
It is an efficient bottom-up syntax analysis technique
that can be used to parse large classes of context free
grammar is called LR(0) parsing.
1. L stands for the left to right scanning
2. R stands for rightmost derivation in reverse
0 stands for no. of input symbols of lookaheads :
Advantages of LR parsing :
It recognizes virtually all programming language
constructs for which CFG can be written
It is able to detect syntactic errors
It is an efficient non-backtracking shift reducing
parsing method.
Type of LR parser
SLR
CLR
LALR
LR Parsing
The most prevalent type of bottom-up parsers
LR(k), mostly interested on parsers with k<=1
Why LR parsers?
Table driven
Can be constructed to recognize all programming language
constructs
Most general non-backtracking shift-reduce parsing method
Can detect a syntactic error as soon as it is possible to do so
Class of grammars for which we can construct LR parsers are
superset of those which we can construct LL parsers
States of an LR parser
States represent set of items
An LR(0) item of G is a production of G with the dot at
some position of the body:
For A->XYZ we have following items
A->.XYZ
A->X.YZ
A->XY.Z
A->XYZ.
In a state having A->.XYZ we hope to see a string
derivable from XYZ next on the input.
What about A->X.YZ?
Constructing canonical LR(0)
item sets
Augmented grammar:
G with addition of a production: S’->S
Closure of item sets:
If I is a set of items, closure(I) is a set of items constructed from I by
the following rules:
Add every item in I to closure(I)
Example acc
$
T -> T * F | F
F -> (E) | id
I6 I9
E->E+.T
I1 T->.T*F T
E’->E. + T->.F
E->E+T.
T->T.*F
E E->E.+T
F->.(E)
F->.id
I0=closure({[E’->.E]}
*
I2 I10
E’->.E T I7
F
E->.E+T E’->T. * T->T*.F
F->.(E)
T->T.*F T->T*F.
E->.T id F->.id
T->.T*F id
T->.F I5
F->.(E) F->id.
F->.id ( +
I4
F->(.E)
E->.E+T I8 I11
E->.T
E E->E.+T )
T->.T*F F->(E.) F->(E).
T->.F
F->.(E)
F->.id
I3
T>F.
Use of LR(0) automaton
Example: id*id
Line Stack Symbols Input Action
(1) 0 $ id*id$ Shift to 5
(2) 05 $id *id$ Reduce by F->id
(3) 03 $F *id$ Reduce by T->F
(4) 02 $T *id$ Shift to 7
(5) 027 $T* id$ Shift to 5
(6) 0275 $T*id $ Reduce by F->id
(7) 02710 $T*F $ Reduce by T->T*F
(8) 02 $T $ Reduce by E->T
(9) 01 $E $ accept
LR-Parsing model
INPUT a1 … ai … an $
LR Parsing Output
Sm
Program
Sm-1
…
$
ACTION GOTO
LR parsing algorithm
let a be the first symbol of w$;
while(1) { /*repeat forever */
let s be the state on top of the stack;
if (ACTION[s,a] = shift t) {
push t onto the stack;
let a be the next input symbol;
} else if (ACTION[s,a] = reduce A->β) {
pop |β| symbols of the stack;
let state t now be on top of the stack;
push GOTO[t,A] onto the stack;
output the production A->β;
} else if (ACTION[s,a]=accept) break; /* parsing is done */
else call error-recovery routine;
}
Example (0) E’->E
(1) E -> E + T
(2) E-> T
STATE ACTON GOTO
(3) T -> T * F
id + * ( ) $ E T F (4) T-> F
0 S5 S4 1 2 3 (5) F -> (E) id*id+id?
(6) F->id
1 S6 Acc
Line Stac Symbol Input Action
2 R2 S7 R2 R2
k s
3 R R7 R4 R4 (1) 0 id*id+id$ Shift to 5
4
(2) 05 id *id+id$ Reduce by F->id
4 S5 S4 8 2 3
(3) 03 F *id+id$ Reduce by T->F
5 R R R6 R6
(4) 02 T *id+id$ Shift to 7
6 6
(5) 027 T* id+id$ Shift to 5
6 S5 S4 9 3
(6) 0275 T*id +id$ Reduce by F->id
7 S5 S4 10
(7) 02710 T*F +id$ Reduce by T-
8 S6 S11 >T*F
(8) 02 T +id$ Reduce by E->T
9 R1 S7 R1 R1
(9) 01 E +id$ Shift
10 R3 R3 R3 R3
(10) 016 E+ id$ Shift
11 R5 R5 R5 R5
(11) 0165 E+id $ Reduce by F->id
(12) 0163 E+F $ Reduce by T->F
(13) 0169 E+T` $ Reduce by E-
>E+T
(14) 01 E $ accept
Constructing SLR parsing table
Method
Construct C={I0,I1, … , In}, the collection of LR(0) items for G’
State i is constructed from state Ii:
If [A->α.aβ] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to “shift j”
If [A->α.] is in Ii, then set ACTION[i,a] to “reduce A->α” for all a in
follow(A)
If {S’->.S] is in Ii, then set ACTION[I,$] to “Accept”
If any conflicts appears then we say that the grammar is not
SLR(1).
If GOTO(Ii,A) = Ij then GOTO[i,A]=j
All entries not defined by above rules are made “error”
The initial state of the parser is the one constructed from the
set of items containing [S’->.S]
Example grammar which is not
SLR(1) S -> L=R | R
L -> *R | id
R -> L
I0 I1 I3 I5 I7
S’->.S S’->S. S ->R. L -> id. L -> *R.
S -> .L=R
S->.R I2 I4 I6
I8
L -> .*R | S ->L.=R L->*.R S->L=.R
R -> L.
L->.id R ->L. R->.L R->.L
R ->. L L->.*R L->.*R I9
L->.id L->.id S -> L=R.
Action
=
Shift 6
2 Reduce R->L
More powerful LR parsers
Canonical-LR or just LR method
Use lookahead symbols for items: LR(1) items
Results in a large collection of items
LALR: lookaheads are introduced in LR(0) items
Canonical LR(1) items
In LR(1) items each item is in the form: [A->α.β,a]
An LR(1) item [A->α.β,a] is valid for a viable prefix γ if
*
there is a derivation S=>δAw=>δαβw, where
rm
Γ= δα
Either a is the first symbol of w, or w is ε and a is $
Example:
*
S=>aaBab=>aaaBab
S->BB rm
SetOfItems Goto(I,X) {
initialize J to be the empty set;
for (each item [A->α.Xβ,a] in I)
add item [A->αX.β,a] to set J;
return closure(J);
}
void items(G’){
initialize C to Closure({[S’->.S,$]});
repeat
for (each set of items I in C)
for (each grammar symbol X)
if (Goto(I,X) is not empty and not in C)
add Goto(I,X) to C;
until no new sets of items are added to C;
}
Example
S’->S
S->CC
C->cC
C->d
Canonical LR(1) parsing table
Method
Construct C={I0,I1, … , In}, the collection of LR(1) items for G’
State i is constructed from state Ii:
If [A->α.aβ, b] is in Ii and Goto(Ii,a)=Ij, then set ACTION[i,a] to
“shift j”
If [A->α., a] is in Ii, then set ACTION[i,a] to “reduce A->α”
If {S’->.S,$] is in Ii, then set ACTION[I,$] to “Accept”
If any conflicts appears then we say that the grammar is not
LR(1).
If GOTO(Ii,A) = Ij then GOTO[i,A]=j
All entries not defined by above rules are made “error”
The initial state of the parser is the one constructed from the
set of items containing [S’->.S,$]
Example
S’->S
S->CC
C->cC
C->d
LALR Parsing Table
For the previous example we had:
I4
C->d. , c/d
I47
C->d. , c/d/$
I7
C->d. , $
E->E*E 0 S3 S2 1
1 S4 S5 Acc
E->(E) 2 S3 S2 6
E->id 3 R4 R4 R4 R4
4 S3 S2 7
5 S3 S2 8