04 Parsing
04 Parsing
Spring 2017
1 / 330
Outline
1. Parsing
First and follow sets
Top-down parsing
Bottom-up parsing
References
2 / 330
INF5110 – Compiler Construction
Parsing
Spring 2017
3 / 330
Outline
1. Parsing
First and follow sets
Top-down parsing
Bottom-up parsing
References
4 / 330
Outline
1. Parsing
First and follow sets
Top-down parsing
Bottom-up parsing
References
5 / 330
Overview
6 / 330
First and Follow sets
• general concept for grammars
• certain types of analyses (e.g. parsing):
• info needed about possible “forms” of derivable words,
First-set of A
which terminal symbols can appear at the start of strings derived
from a given nonterminal A
Follow-set of A
Which terminals can follow A in some sentential form.
Definition (Nullable)
Given a grammar G . A non-terminal A ∈ ΣN is nullable, if A ⇒∗ .
8 / 330
Examples
9 / 330
Remarks
10 / 330
A more algorithmic/recursive definition
X → X1 X2 . . . Xn
α = X1 . . . Xn ,
12 / 330
Pseudo code
f o r all non-terminals A do
F i r s t [ A ] := {}
end
w h i l e there are changes to any F i r s t [ A ] do
f o r each production A → X1 . . . Xn do
k := 1 ;
c o n t i n u e := t r u e
w h i l e c o n t i n u e = t r u e and k ≤ n do
F i r s t [ A ] := F i r s t [ A ] ∪ F i r s t [ Xk ] ∖ {}
i f ∉ F i r s t [ Xk ] then c o n t i n u e := f a l s e
k := k + 1
end ;
if continue = true
then F i r s t [ A ] := F i r s t [ A ] ∪ {}
end ;
end
13 / 330
If only we could do away with special cases for the empty
words . . .
1
A production of the form A → .
14 / 330
Example expression grammar (from before)
15 / 330
Example expression grammar (expanded)
16 / 330
nr pass 1 pass 2 pass 3
2 exp → term
3 addop → +
4 addop → −
6 term → factor
7 mulop → ∗
8 factor → ( exp )
9 factor → n
17 / 330
“Run” of the algo
18 / 330
Collapsing the rows & final result
• results per pass:
1 2 3
exp {(, n}
addop {+, −}
term {(, n}
mulop {∗}
factor {(, n}
First[_]
exp {(, n}
addop {+, −}
term {(, n}
mulop {∗}
factor {(, n}
19 / 330
Work-list formulation
f o r all non-terminals A do
F i r s t [ A ] := {}
WL := P // a l l p r o d u c t i o n s
end
w h i l e WL = / ∅ do
remove one (A → X1 . . . Xn ) from WL
if F i r s t [A] = / F i r s t [ A ] ∪ F i r s t [ X1 ]
then F i r s t [ A ] := F i r s t [ A ] ∪ F i r s t [ X1 ]
add a l l p r o d u c t i o n s (A → X1′ . . . Xm′ ) to WL
else skip
end
20 / 330
Follow sets
S $ ⇒∗G α1 Aaα2 , a ∈ ΣT + { $ } .
21 / 330
Follow sets, recursively
22 / 330
More imperative representation in pseudo code
Follow [ S ] := {$}
f o r all non-terminals A =/ S do
Follow [ A ] := {}
end
w h i l e there are changes to any Follow − s e t do
f o r each production A → X1 . . . Xn do
f o r e a c h Xi w h i c h i s a non − t e r m i n a l do
Follow [ Xi ] := Follow [ Xi ] ∪ ( F i r s t ( Xi+1 . . . Xn ) ∖ {})
i f ∈ F i r s t ( Xi+1 Xi+2 . . . Xn )
then Follow [ Xi ] := Follow [ Xi ] ∪ Follow [ A ]
end
end
end
23 / 330
Example expression grammar (expanded)
24 / 330
nr pass 1 pass 2
2 exp → term
6 term → factor
8 factor → ( exp )
normalsize
25 / 330
“Run” of the algo
26 / 330
Illustration of first/follow sets
28 / 330
Some forms of grammars are less desirable than others
• left-recursive production:
A → Aα
more precisely: example of immediate left-recursion
• 2 productions with common “left factor”:
29 / 330
Some simple examples for both
• left-recursion
30 / 330
Transforming the expression grammar
• obviously left-recursive
• remember: this variant used for proper associativity!
31 / 330
After removing left recursion
• still unambiguous
• unfortunate: associativity now different!
• note also: -productions & nullability
32 / 330
Left-recursion removal
Left-recursion removal
A transformation process to turn a CFG into one without left
recursion
• price: -productions
• 3 cases to consider
• immediate (or direct) recursion
• simple
• general
• indirect (or mutual) recursion
33 / 330
Left-recursion removal: simplest case
Before After
A → Aα ∣ β A → βA′
A′ → αA′ ∣
34 / 330
Schematic representation
A → Aα ∣ β A → βA′
A′ → αA′ ∣
A A
A α β A′
A α α A′
A α α A′
β α A′
35 / 330
Remarks
A → β{α}
• two negative aspects of the transformation
1. generated language unchanged, but: change in resulting
structure (parse-tree), i.a.w. change in associativity, which
may result in change of meaning
2. introduction of -productions
• more concrete example for such a production: grammar for
expressions
36 / 330
Left-recursion removal: immediate recursion (multiple)
Before After
A → Aα1 ∣ . . . ∣ Aαn A → β1 A′ ∣ . . . ∣ βm A′
∣ β1 ∣ . . . ∣ βm A′ → α1 A′ ∣ . . . ∣ αn A′
∣
A → (β1 ∣ . . . ∣ βm )(α1 ∣ . . . ∣ αn )∗
37 / 330
Removal of: general left recursion
Assume non-terminals A1 , . . . , Am
f o r i := 1 to m do
f o r j := 1 to i −1 do
replace each grammar rule of the form Ai → Aj β by // i < j
rule Ai → α1 β ∣ α2 β ∣ . . . ∣ αk β
where Aj → α1 ∣ α2 ∣ . . . ∣ αk
is the current rule(s) for Aj // c u r r e n t , i . e . , a t t h e c u r r e n t s t a g e
end
{ c o r r e s p o n d s to i = j }
remove, if necessary, immediate left recursion for Ai
end
38 / 330
Example (for the general case)
let A = A1 , B = A2
A → Ba ∣ Aa ∣ c
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ BaA′ b ∣ cA′ b ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → cA′ bB ′ ∣ dB ′
B′ → bB ′ ∣ aA′ bB ′ ∣
39 / 330
Example (for the general case)
let A = A1 , B = A2
A → Ba ∣ Aa ∣ c
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ BaA′ b ∣ cA′ b ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → cA′ bB ′ ∣ dB ′
B′ → bB ′ ∣ aA′ bB ′ ∣
40 / 330
Example (for the general case)
let A = A1 , B = A2
A → Ba ∣ Aa ∣ c
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ BaA′ b ∣ cA′ b ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → cA′ bB ′ ∣ dB ′
B′ → bB ′ ∣ aA′ bB ′ ∣
41 / 330
Example (for the general case)
let A = A1 , B = A2
A → Ba ∣ Aa ∣ c
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ Ab ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → Bb ∣ BaA′ b ∣ cA′ b ∣ d
A → BaA′ ∣ cA′
A′ → aA′ ∣
B → cA′ bB ′ ∣ dB ′
B′ → bB ′ ∣ aA′ bB ′ ∣
42 / 330
Left factor removal
Simple situation
A → αA′ ∣ . . .
A → αβ ∣ αγ ∣ . . .
A′ → β ∣ γ
43 / 330
Example: sequence of statements
Before After
44 / 330
Example: conditionals
Before
if -stmt → if ( exp ) stmt-seq end
∣ if ( exp ) stmt-seq else stmt-seq end
After
if -stmt → if ( exp ) stmt-seq else-or -end
else-or -end → else stmt-seq end ∣ end
45 / 330
Example: conditionals (without else)
Before
if -stmt → if ( exp ) stmt-seq
∣ if ( exp ) stmt-seq else stmt-seq
After
if -stmt → if ( exp ) stmt-seq else-or -empty
else-or -empty → else stmt-seq ∣
46 / 330
Not all factorization doable in “one step”
Starting point
A → abcB ∣ abC ∣ aE
After 1 step
A → abA′ ∣ aE
A′ → cB ∣ C
After 2 steps
A → aA′′
A′′ → bA′ ∣ E
A′ → cB ∣ C
48 / 330
Outline
1. Parsing
First and follow sets
Top-down parsing
Bottom-up parsing
References
49 / 330
What’s a parser generally doing
50 / 330
Lexer, parser, and the rest
token
source parse tree rest of the interm.
lexer parser
program get next front end rep.
token
symbol table
51 / 330
Top-down vs. bottom-up
Bottom-up Top-down
Parse tree is being grown from Parse tree is being grown from
the leaves to the root. the root to the leaves.
• while parse tree mostly conceptual: parsing build up the
concrete data structure of AST bottom-up vs. top-down.
52 / 330
Parsing restricted classes of CFGs
• parser: better be “efficient”
• full complexity of CFLs: not really needed in practice2
• classification of CF languages vs. CF grammars, e.g.:
• left-recursion-freedom: condition on a grammar
• ambiguous language vs. ambiguous grammar
• classification of grammars ⇒ classification of language
• a CF language is (inherently) ambiguous, if there’s no
unambiguous grammar for it
• a CF language is top-down parseable, if there exists a grammar
that allows top-down parsing . . .
54 / 330
Relationship of some grammar (not language) classes
unambiguous ambiguous
LL(k) LR(k)
LL(1) LR(1)
LALR(1)
SLR
LR(0)
LL(0)
55 / 330
General task (once more)
56 / 330
Schematic view on “parser machine”
... if 1 + 2 ∗ ( 3 + 4 ) ...
q2
Reading “head”
(moves left-to-right)
q3 ⋱
q2 qn ...
q1 q0
unbounded extra memory (stack)
Finite control
57 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
exp
58 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
term exp ′
59 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
60 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
61 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
numberterm′ exp ′
62 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
number exp ′
63 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
numberexp ′
64 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
65 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
66 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
67 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
68 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
69 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
70 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
71 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
72 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
73 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
74 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
75 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
76 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
77 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
78 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
79 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
80 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
81 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
83 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
84 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
... 1 + 2 ∗ ( 3 + 4 ) ...
... 1 + 2 ∗ ( 3 + 4 ) ...
... 1 + 2 ∗ ( 3 + 4 ) ...
... 1 + 2 ∗ ( 3 + 4 ) ...
90 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
91 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
92 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
93 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
94 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
95 / 330
Derivation of an expression
... 1 + 2 ∗ ( 3 + 4 ) ...
96 / 330
Remarks concerning the derivation
Note:
• input = stream of tokens
• there: 1 . . . stands for token class number (for
readability/concreteness), in the grammar: just number
• in full detail: pair of token class and token value ⟨number, 1⟩
Notation:
• underline: the place (occurrence of non-terminal where
production is used)
• crossed out:
• terminal = token is considered treated
• parser “moves on”
• later implemented as match or eat procedure
97 / 330
Not as a “film” but at a glance: reduction sequence
exp ⇒
term exp ′ ⇒
factor term′ exp ′ ⇒
number term′ exp ′ ⇒
numberterm′ exp ′ ⇒
number exp ′ ⇒
numberexp ′ ⇒
numberaddop term exp ′ ⇒
number+ term exp ′ ⇒
number +term exp ′ ⇒
number +factor term′ exp ′ ⇒
number +number term′ exp ′ ⇒
number +numberterm′ exp ′ ⇒
number +numbermulop factor term′ exp ′ ⇒
number +number∗ factor term′ exp ′ ⇒
number +number ∗ ( exp ) term′ exp ′ ⇒
number +number ∗ ( exp ) term′ exp ′ ⇒
number +number ∗ ( exp ) term′ exp ′ ⇒
...
98 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
99 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
100 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
101 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
102 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
103 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
104 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
105 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
106 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
107 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
108 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
109 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
110 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
111 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
112 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
113 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
114 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
115 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
116 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
117 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
118 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
119 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
120 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
121 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
122 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
123 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
124 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
125 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
126 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
127 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
128 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
129 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
130 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
131 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
132 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
133 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
134 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
135 / 330
Best viewed as a tree
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
136 / 330
Non-determinism?
exp ⇒∗ 1 + 2 ∗ (3 + 4)
• i.e.: input stream of tokens “guides” the derivation process (at
least it fixes the target)
• but: how much “guidance” does the target word (in general)
gives?
137 / 330
Two principle sources of non-determinism here
Using production A → β
S ⇒∗ α1 A α2 ⇒ α1 β α2 ⇒∗ w
2 choices to make
1. where, i.e., on which occurrence of a non-terminal in α1 Aα2 to
apply a productiona
2. which production to apply (for the chosen non-terminal).
a
Note that α1 and α2 may contain non-terminals, including further
occurrences of A.
138 / 330
Left-most derivation
139 / 330
Non-determinism vs. ambiguity
• Note: the “where-to-reduce”-non-determinism =
/ ambiguitiy of
a grammar3
• in a way (“theoretically”): where to reduce next is irrelevant:
• the order in the sequence of derivations does not matter
• what does matter: the derivation tree (aka the parse tree)
Using production A → β
S ⇒∗ α1 A α2 ⇒ α1 β α2 ⇒∗ w
3
A CFG is ambiguous, if there exists a word (of terminals) with 2 different
parse trees.
140 / 330
Non-determinism vs. ambiguity
• Note: the “where-to-reduce”-non-determinism =
/ ambiguitiy of
a grammar3
• in a way (“theoretically”): where to reduce next is irrelevant:
• the order in the sequence of derivations does not matter
• what does matter: the derivation tree (aka the parse tree)
Using production A → β
S ⇒∗l w1 A α2 ⇒ w1 β α2 ⇒∗l w
3
A CFG is ambiguous, if there exists a word (of terminals) with 2 different
parse trees.
141 / 330
What about the “which-right-hand side” non-determinism?
A→β ∣ γ
Look-ahead of length k
resolve the “which-right-hand-side” non-determinism inspecting only
fixed-length prefix of w2 (for all situations as above)
LL(k) grammars
CF-grammars which can be parsed doing that.a
a
Of course, one can always write a parser that “just makes some decision”
based on looking ahead k symbols. The question is: will that allow to capture
all words from the grammar and only those.
143 / 330
Parsing LL(1) grammars
• this lecture: we don’t do LL(k) with k > 1
• LL(1): particularly easy to understand and to implement
(efficiently)
• not as expressive than LR(1) (see later), but still kind of decent
• two flavors for LL(1) parsing here (both are top-down parsers)
• recursive descent4
• table-based LL(1) parser
• predictive parsers
4
If one wants to be very precise: it’s recursive descent with one look-ahead
and without back-tracking. It’s the single most common case for recursive
descent parsers. Longer look-aheads are possible, but less common.
Technically, even allowing back-tracking can be done using recursive descent as
principle (even if not done in practice). 144 / 330
Sample expr grammDar again
145 / 330
Look-ahead of 1: straightforward, but not trivial
• look-ahead of 1:
• not much of a look-ahead, anyhow
• just the “current token”
⇒ read the next token, and, based on that, decide
• but: what if there’s no more symbols?
⇒ read the next token if there is, and decide based on the token
or else the fact that there’s none left5
5
Sometimes “special terminal” $ used to mark the end (as mentioned).
146 / 330
Recursive descent: general set-up
Idea
For each non-terminal nonterm, write one procedure which:
• succeeds, if starting at the current token position, the “rest” of
the token stream starts with a syntactically correct word of
terminals representing nonterm
• fail otherwise
147 / 330
Recursive descent
1 void factor () {
2 switch ( tok ) {
3 c a s e LPAREN : e a t (LPAREN ) ; e x p r ( ) ; e a t (RPAREN ) ;
4 c a s e NUMBER: e a t (NUMBER ) ;
5 }
6 }
148 / 330
Recursive descent
let factor () = (∗ f u n c t i o n f o r f a c t o r s ∗)
match ! t o k w i t h
LPAREN −> e a t (LPAREN ) ; e x p r ( ) ; e a t (RPAREN)
| NUMBER −> e a t (NUMBER)
149 / 330
Slightly more complex
• previous 2 rules for factor : situation not always as immediate
as that
LL(1) principle (again)
given a non-terminal, the next token must determine the choice of
right-hand sidea
a
It must be the next token/terminal in the sense of First, but it need not be
a token directly mentioned on the right-hand sides of the corresponding rules.
1. left-factoring:
151 / 330
Recursive descent for left-factored if -stmt
1 procedure i f s t m t ( )
2 begin
3 match ( " i f " ) ;
4 match ("(");
5 exp ( ) ;
6 match ( " ) " ) ;
7 stmt ( ) ;
8 if token = " e l s e "
9 then match ( " e l s e " ) ;
10 stmt ( )
11 end
12 end ;
152 / 330
Left recursion is a no-go
Left-recursion
Left-recursive grammar never works for recursive descent.
6
And it would not help to look-ahead more than 1 token either.
153 / 330
Removing left recursion may help
p r o c e d u r e exp ( )
begin
term ( ) ;
exp ′ ( )
end
exp → term exp ′
exp ′ → addop term exp ′ ∣ p r o c e d u r e exp ′ ( )
addop → + ∣ − begin
factor term′
case token of
term → "+": match ( " + " ) ;
term′ → mulop factor term′ ∣ term ( ) ;
exp ′ ( )
mulop → ∗ " −": match ( " − " ) ;
factor → ( exp ) ∣ n term ( ) ;
exp ′ ( )
end
end
154 / 330
Recursive descent works, alright, but . . .
exp
term exp ′
Nr + factor term′
∗ ( exp )
term exp ′
Nr + factor term′
Nr
no left-rec.
• no left-recursion
• assoc. / precedence ok
• clean and straightforward
rules • rec. descent parsing ok
1 + 2 ∗ (3 + 4)
exp
Nr Nr ( exp )
Nr mulop Nr
157 / 330
The simple “original” expression grammar (even nicer)
Flat expression grammar
exp → exp op exp ∣ ( exp ) ∣ number
op → + ∣ − ∣ ∗
1 + 2 ∗ (3 + 4)
exp
exp op exp
Nr + exp op exp
Nr ∗ ( exp )
exp op exp
Nr + Nr
158 / 330
Associtivity problematic
exp
factor number
(3 + 4) + 5
number
159 / 330
Associtivity problematic
exp
factor number
(3 − 4) − 5
number
160 / 330
Now use the grammar without left-rec (but right-rec
instead)
No left-rec.
exp → term exp ′
exp ′ → addop term exp ′ ∣
addop → + ∣ −
term → factor term′
term′ → mulop factor term′ ∣
mulop → ∗
factor → ( exp ) ∣ n
exp
term exp ′
factor termaddop
′ term exp ′
−
3−4−5 number factor term′ addop term exp ′
161 / 330
Now use the grammar without left-rec (but right-rec
instead)
No left-rec.
exp → term exp ′
exp ′ → addop term exp ′ ∣
addop → + ∣ −
term → factor term′
term′ → mulop factor term′ ∣
mulop → ∗
factor → ( exp ) ∣ n
exp
term exp ′
3−4−5
factor termaddop
′ term exp ′
3 − (4 − 5) 162 / 330
But if we need a “left-associative” AST?
• we want (3 − 4) − 5, not 3 − (4 − 5)
exp
-6
3
term exp ′
4 -1
number
163 / 330
Code to “evaluate” ill-associated such trees correctly
f u n c t i o n exp′ ( v a l s o f a r : i n t ) : i n t ;
begin
i f t o k e n = ’+ ’ o r t o k e n = ’ − ’
then
case token of
’ + ’ : match ( ’ + ’ ) ;
v a l s o f a r := v a l s o f a r + term ;
’ − ’ : match ( ’ − ’ ) ;
v a l s o f a r := v a l s o f a r − term ;
end c a s e ;
r e t u r n exp′ ( v a l s o f a r ) ;
else return v a l s o f a r
end ;
164 / 330
“Designing” the syntax, its parsing, & its AST
• trade offs:
1. starting from: design of the language, how much of the syntax
is left “implicit” 7
2. which language class? Is LL(1) good enough, or something
stronger wanted?
3. how to parse? (top-down, bottom-up, etc.)
4. parse-tree/concrete syntax trees vs. ASTs
7
Lisp is famous/notorious in that its surface syntax is more or less an
explicit notation for the ASTs. Not that it was originally planned like this . . .
165 / 330
AST vs. CST
166 / 330
AST: How “far away” from the CST?
• AST: only thing relevant for later phases ⇒ better be clean . . .
• AST “=” CST?
• building AST becomes straightforward
• possible choice, if the grammar is not designed “weirdly”,
exp
-6
3
term exp ′
4 -1
exp
factor number
number
exp
exp op exp
number − number
169 / 330
AST: How “far away” from the CST?
number −
number number
170 / 330
AST: How “far away” from the CST?
• AST: only thing relevant for later phases ⇒ better be clean . . .
• AST “=” CST?
• building AST becomes straightforward
• possible choice, if the grammar is not designed “weirdly”,
−
number −
number number
exp ∶ number op ∶ −
Recipe
• turn each non-terminal to an abstract class
• turn each right-hand side of a given non-terminal as
(non-abstract) subclass of the class for considered non-terminal
• chose fields & constructors of concrete classes appropriately
• terminal: concrete class as well, field/constructor for token’s
value
174 / 330
Example in Java
1 p u b l i c c l a s s P a r e n t h e t i c E x p e x t e n d s Exp { // e x p −> ( op )
2 p u b l i c Exp e x p ;
3 p u b l i c P a r e n t h e t i c E x p ( Exp e ) { e x p = l ; }
4 }
176 / 330
3 − (4 − 5)
177 / 330
Pragmatic deviations from the recipe
op → + ∣ − ∣ ∗
as simply integers, for instance arranged like
1 p u b l i c c l a s s BinExp e x t e n d s Exp { // e x p −> e x p op e x p
2 p u b l i c Exp l e f t , r i g h t ;
3 public int op ;
4 p u b l i c BinExp ( Exp l , i n t o , Exp r ) { p o s=p ; l e f t =l ; o p e r=o ; r i g h
5 p u b l i c f i n a l s t a t i c i n t PLUS=0 , MINUS=1 , TIMES=2;
6 }
Do it systematically
A clean grammar is the specification of the syntax of the language
and thus the parser. It is also a means of communicating with
humans (at least with pros who (of course) can read BNF) what
the syntax is. A clean grammar is a very systematic and structured
thing which consequently can and should be systematically and
cleanly represented in an AST, including judicious and systematic
choice of names and conventions (nonterminal exp represented by
class Exp, non-terminal stmt by class Stmt etc)
BNF EBNF
but remember:
• EBNF just a notation, just because we do not see (left or
right) recursion in { . . . }, does not mean there is no recursion.
• not all parser generators support EBNF
• however: often easy to translate into loops- 8
• does not offer a general solution if associativity etc. is
problematic
8
That results in a parser which is somehow not “pure recursive descent”. It’s
“recusive descent, but sometimes, let’s use a while-loop, if more convenient
concerning, for instance, associativity”
180 / 330
Pseudo-code representing the EBNF productions
1 p r o c e d u r e exp ;
2 begin
3 term ; { recursive call }
4 w h i l e t o k e n = "+" o r t o k e n = "−"
5 do
6 match ( t o k e n ) ;
7 term ; // r e c u r s i v e c a l l
8 end
9 end
1 p r o c e d u r e term ;
2 begin
3 factor ; { recursive call }
4 w h i l e t o k e n = "∗"
5 do
6 match ( t o k e n ) ;
7 factor ; // r e c u r s i v e c a l l
8 end
9 end
181 / 330
How to produce “something” during RD parsing?
Recursive descent
So far: RD = top-down (parse-)tree traversal via recursive
procedure.a Possible outcome: termination or failure.
a
Modulo the fact that the tree being traversed is “conceptual” and not the
input of the traversal procedure; instead, the traversal is “steered” by stream of
tokens.
182 / 330
Evaluating an exp during RD parsing
1 f u n c t i o n exp ( ) : i n t ;
2 v a r temp : i n t
3 begin
4 temp := term ( ) ; { recursive call }
5 w h i l e t o k e n = "+" o r t o k e n = "−"
6 case token of
7 "+": match ( " + " ) ;
8 temp := temp + term ( ) ;
9 " −": match (" −")
10 temp := temp − term ( ) ;
11 end
12 end
13 r e t u r n temp ;
14 end
183 / 330
Building an AST: expression
1 f u n c t i o n exp ( ) : s y n t a x T r e e ;
2 v a r temp , newtemp : s y n t a x T r e e
3 begin
4 temp := term ( ) ; { recursive call }
5 w h i l e t o k e n = "+" o r t o k e n = "−"
6 case token of
7 "+": match ( " + " ) ;
8 newtemp := makeOpNode ( " + " ) ;
9 l e f t C h i l d ( newtemp ) := temp ;
10 r i g h t C h i l d ( newtemp ) := term ( ) ;
11 temp := newtemp ;
12 " −": match (" −")
13 newtemp := makeOpNode ( " − " ) ;
14 l e f t C h i l d ( newtemp ) := temp ;
15 r i g h t C h i l d ( newtemp ) := term ( ) ;
16 temp := newtemp ;
17 end
18 end
19 r e t u r n temp ;
20 end
185 / 330
Building an AST: conditionals
186 / 330
Building an AST: remarks and “invariant”
187 / 330
LL(1) parsing
• remember LL(1) grammars & LL(1) parsing principle:
• M[A, a] = w
• we assume: pure BNF
9
Often, the entry in the parse table does not contain a full rule as here,
needed is only the right-hand-side. In that case the table is of type
ΣN × ΣT → (Σ∗ +error). We follow the convention of this book.
188 / 330
Construction of the parsing table
Table recipe
1. If A → α ∈ P and α ⇒∗ aβ, then add A → α to table entry
M[A, a]
2. If A → α ∈ P and α ⇒∗ and S $ ⇒∗ βAaγ (where a is a
token (=non-terminal) or $), then add A → α to table entry
M[A, a]
Table recipe (again, now using our old friends First and
Follow )
Assume A → α ∈ P.
1. If a ∈ First(α), then add A → α to M[A, a].
2. If α is nullable and a ∈ Follow (A), then add A → α to M[A, a].
189 / 330
Example: if-statements
First Follow
stmt other, if $, else
if -stmt if $, else
else−part else, $, else
exp 0, 1 )
190 / 330
Example: if statement: “LL(1) parse table”
193 / 330
Expressions
Original grammar
exp → exp addop term ∣ term
addop → + ∣ −
term → term mulop factor ∣ factor
mulop → ∗
factor → ( exp ) ∣ number
First Follow
exp (, number $, )
exp ′ +, −, $, )
addop +, − (, number
term (, number $, ), +, −
term′ ∗, $, ), +, −
mulop ∗ (, number
factor (, number $, ), +, −, ∗
194 / 330
Expressions
Original grammar
exp → exp addop term ∣ term
addop → + ∣ −
term → term mulop factor ∣ factor
mulop → ∗
factor → ( exp ) ∣ number
First Follow
exp (, number $, )
exp ′ +, −, $, )
addop +, − (, number
term (, number $, ), +, −
term′ ∗, $, ), +, −
mulop ∗ (, number
factor (, number $, ), +, −, ∗ 195 / 330
Expressions
Left-rec removed
exp → term exp ′
exp ′ → addop term exp ′ ∣
addop → + ∣ −
term → factor term′
term′ → mulop factor term′ ∣
mulop → ∗
factor → ( exp ) ∣ n
First Follow
exp (, number $, )
exp ′ +, −, $, )
addop +, − (, number
term (, number $, ), +, −
term′ ∗, $, ), +, −
mulop ∗ (, number
factor (, number $, ), +, −, ∗
196 / 330
Expressions: LL(1) parse table
197 / 330
Error handling
198 / 330
Error messages
• important:
• try to avoid error messages that only occur because of an
already reported error!
• report error as early as possible, if possible at the first point
where the program cannot be extended to a correct program.
• make sure that, after an error, one doesn’t end up in a infinite
loop without reading any input symbols.
• What’s a good error message?
• assume: that the method factor() chooses the alternative
( exp ) but that it, when control returns from method exp(),
does not find a )
• one could report : left paranthesis missing
• But this may often be confusing, e.g. if what the program text
is: ( a + b c )
• here the exp() method will terminate after ( a + b, as c
cannot extend the expression). You should therefore rather
give the message error in expression or left
paranthesis missing.
199 / 330
Handling of syntax errors using recursive descent
200 / 330
Syntax errors with sync stack
201 / 330
Procedures for expression with "error recovery"
202 / 330
Outline
1. Parsing
First and follow sets
Top-down parsing
Bottom-up parsing
References
203 / 330
Bottom-up parsing: intro
unambiguous ambiguous
LL(k) LR(k)
LL(1) LR(1)
LALR(1)
SLR
LR(0)
LL(0)
205 / 330
LR-parsing and its subclasses
• right-most derivation (but left-to-right parsing)
• in general: bottom-up parsing more powerful than top-down
• typically: tool-supported (unlike recursive descent, which may
well be hand-coded)
• based on parsing tables + explicit stack
• thankfully: left-recursion no longer problematic
• typical tools: yacc and its descendants (like bison, CUP, etc)
• another name: shift-reduce parser
tokens + non-terms
206 / 330
Example grammar
S′ → S
S → ABt7 ∣ . . .
A → t4 t5 ∣ t1 B ∣ . . .
B → t2 t3 ∣ At6 ∣ . . .
10
That will later be relied upon when constructing a DFA for “scanning” the
stack, to control the reactions of the stack machine. This restriction leads to a
unique, well-defined initial state.
207 / 330
Parse tree for t1 . . . t7
S′
A B
B A
t 1 t2 t3 t4 t5 t6 t7
exp
term
term factor
factor
number ∗ number
210 / 330
Bottom-up parse: Growing the parse tree
exp
term
term factor
factor
number ∗ number
number ∗ number
211 / 330
Bottom-up parse: Growing the parse tree
exp
term
term factor
factor
number ∗ number
212 / 330
Bottom-up parse: Growing the parse tree
exp
term
term factor
factor
number ∗ number
213 / 330
Bottom-up parse: Growing the parse tree
exp
term
term factor
factor
number ∗ number
214 / 330
Bottom-up parse: Growing the parse tree
exp
term
term factor
factor
number ∗ number
215 / 330
Bottom-up parse: Growing the parse tree
exp
term
term factor
factor
number ∗ number
216 / 330
Reduction in reverse = right derivation
• underlined part:
• different in reduction vs. derivation
• represents the “part being replaced”
• for derivation: right-most non-terminal
• for reduction: indicates the so-called handle (or part of it)
• consequently: all intermediate words are right-sentential forms
217 / 330
Handle
Definition (Handle)
Assume S ⇒∗r αAw ⇒r αβw . A production A → β at position k
following α is a handle of αβw We write ⟨A → β, k⟩ for such a
handle.
Note:
• w (right of a handle) contains only terminals
• w : corresponds to the future input still to be parsed!
• αβ will correspond to the stack content (β the part touched
by reduction step).
• the ⇒r -derivation-step in reverse:
• one reduce-step in the LR-parser-machine
• adding (implicitly in the LR-machine) a new parent to children
β (= bottom-up!)
• “handle”-part β can be empty (= )
218 / 330
Schematic picture of parser machine (again)
... if 1 + 2 ∗ ( 3 + 4 ) ...
q2
Reading “head”
(moves left-to-right)
q3 ⋱
q2 qn ...
q1 q0
unbounded extra memory (stack)
Finite control
219 / 330
General LR “parser machine” configuration
• Stack:
• contains: terminals + non-terminals (+ $)
• containing: what has been read already but not yet “processed”
• position on the “tape” (= token stream)
• represented here as word of terminals not yet read
• end of “rest of token stream”: $, as usual
• state of the machine
• in the following schematic illustrations: not yet part of the
discussion
• later: part of the parser table, currently we explain without
referring to the state of the parser-engine
• currently we assume: tree and rest of the input given
• the trick ultimately will be: how do achieve the same without
that tree already given (just parsing left-to-right)
220 / 330
Schematic run (reduction: from top to bottom)
$ t1 t 2 t3 t4 t5 t6 t7 $
$ t1 t 2 t3 t4 t5 t6 t7 $
$ t1 t2 t3 t4 t5 t6 t7 $
$ t1 t2 t3 t4 t5 t6 t7 $
$ t1 B t4 t5 t6 t7 $
$A t4 t5 t6 t7 $
$ At4 t5 t6 t7 $
$ At4 t5 t6 t7 $
$ AA t6 t7 $
$ AAt6 t7 $
$ AB t7 $
$ ABt7 $
$S $
$ S′ $
221 / 330
2 basic steps: shift and reduce
Shift Reduce
Move the next input Remove the symbols of the
symbol (terminal) over to right-most subtree from the stack
the top of the stack and replace it by the non-terminal
(“push”) at the root of the subtree
(replace = “pop + push”).
• easy to do if one has the parse tree already!
• reduce step: popped resp. pushed part = right- resp. left-hand
side of handle
222 / 330
Example: LR parsing for addition (given the tree)
E′ → E
E → E +n ∣ n
E′
parse stack input action
1 $ n + n $ shift
E 2 $n + n $ red:. E → n
3 $E + n $ shift
4 $E + n $ shift
5 $E +n $ reduce E → E + n
E 6 $E $ red.: E ′ → E
7 $E ′
$ accept
S′ → S
S → (S )S ∣
side remark: unlike previous grammar, here:
• production with two non-terminals in the right
⇒ difference between left-most and right-most derivations (and
mixed ones)
224 / 330
Parentheses: tree, run, and right-most derivation
S′
parse stack input action
1 $ ( ) $ shift
2 $( ) $ reduce S →
S
3 $(S ) $ shift
4 $(S ) $ reduce S →
S S 5 $(S )S $ reduce S → (S )S
6 $S $ reduce S′ → S
7 $ S′ $ accept
( )
Note: the 2 reduction steps for the
productions
Right-most derivation and right-sentential forms
S ′ ⇒r S ⇒r ( S ) S ⇒r ( S ) ⇒r ( )
225 / 330
Right-sentential forms & the stack
Right-sentential form: right-most derivation
S ⇒∗r α
• right-sentential forms:
• part of the “run”
• but: split between stack and input
parse stack input action
1 $ n+n$ shift E ′ ⇒r E ⇒r E + n ⇒r n + n
2 $n +n$ red:. E → n
+n$
n+n ↪ E +n ↪ E ↪ E′
3 $E shift
4 $E + n$ shift
5 $E +n $ reduce E → E + n
6 $E $ red.: E ′ → E
7 $E′ $ accept
E ′ ⇒r E ⇒r E + n ∥ ∼ E + ∥ n ∼ E ∥ + n ⇒r n ∥ + n ∼∥ n + n
226 / 330
Viable prefixes of right-sentential forms and handles
• right-sentential form: E + n
• viable prefixes of RSF
• prefixes of that RSF on the stack
• here: 3 viable prefixes of that RSF: E , E +, E + n
• handle: remember the definition earlier
• here: for instance in the sentential form n + n
• handle is production E → n on the left occurrence of n in
n + n (let’s write n1 + n2 for now)
• note: in the stack machine:
• the left n1 on the stack
• rest + n2 on the input (unread, because of LR(0))
• if the parser engine detects handle n1 on the stack, it does a
reduce-step
• However (later): reaction depends on current state of the
parser engine
227 / 330
A typical situation during LR-parsing
228 / 330
General design for an LR-engine
229 / 330
But what are the states of an LR-parser?
General idea:
Construct an NFA (and ultimately DFA) which works on the stack
(not the input). The alphabet consists of terminals and
non-terminals ΣT ∪ ΣN . The language
is regular!
230 / 330
LR(0) parsing as easy pre-stage
• LR(0): in practice too simple, but easy conceptual step
towards LR(1), SLR(1) etc.
• LR(1): in practice good enough, LR(k) not used for k > 1
LR(0) item
production with specific “parser position” . in its right-hand side
LR(0) item
A → β.γ
8 items
S′ → .S
S′ → S.
S → .(S )S
S → ( .S ) S
S → ( S. ) S
S → ( S ) .S
S → ( S ) S.
S → .
A → β.γ
• β on the stack
• γ: to be treated next (terminals on the input, but can contain
also non-terminals)
234 / 330
State transitions of the NFA
• X ∈Σ
• two kind of transitions
11
We have explained shift steps so far as: parser eats one terminal (= input
token) and pushes it on the stack.
235 / 330
Transitions for non-terminals and
• so far: we never pushed a non-terminal from the input to the
stack, we replace in a reduce-step the right-hand side by a
left-hand side
• however: the replacement in a reduce steps can be seen as
1. pop right-hand side off the stack,
2. instead, “assume” corresponding non-terminal on input &
3. eat the non-terminal an push it on the stack.
• two kind of transitions
1. the -transition correspond to the “pop” half
2. that X transition (for non-terminals) corresponds to that
“eat-and-push” part
• assume production X → β and initial item X → .β
236 / 330
Initial and final states
initial states:
• we make our lives easier
• we assume (as said): one extra start symbol say S ′
(augmented grammar)
⇒ initial item S ′ → .S as (only) initial state
final states:
• NFA has a specific task, “scanning” the stack, not scanning
the input
• acceptance condition of the overall machine: a bit more
complex
• input must be empty
• stack must be empty except the (new) start symbol
• NFA has a word to say about acceptence
• but not in form of being in an accepting state
• so: no accepting states
• but: accepting action (see later)
237 / 330
NFA: parentheses
S
start S′ → .S S′ → S.
S→ .(S )S S→ . S→ ( S ) S.
(
S→ ( .S ) S S→ ( S. ) S S
S
)
S→ ( S ) .S
238 / 330
Remarks on the NFA
239 / 330
NFA: addition
E
start E′ → .E E′ → E.
n
E→ .E + n E→ .n E→ n.
E→ E. + n E→ E + .n E→ E + n.
+ n
240 / 330
Determinizing: from NFA to DFA
• standard subset-construction12
• states then contains sets of items
• especially important: -closure
• also: direct construction of the DFA possible
12
Technically, we don’t require here a total transition function, we leave out
any error state.
241 / 330
DFA: parentheses
0
S′ → .S 1
S
start S→ .(S )S S′ → S.
S→ .
( 2
S→ ( .S ) S 3
S
( S→ .(S )S S→ ( S. ) S
S→ .
)
4
(
S→ ( S ) .S 5
S
S→ .(S )S S→ ( S ) S.
S→ .
242 / 330
DFA: addition
0
1
E′ → .E
E E′ → E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
243 / 330
Direct construction of an LR(0)-DFA
-closure
• if A → α.Bγ is an item in a state where
• there are productions B → β1 ∣ β2 . . . ⇒
• add items B → .β1 , B → .β2 . . . to the state
• continue that process, until saturation
initial state
S ′ → .S
start
plus closure
244 / 330
Direct DFA construction: transitions
...
A1 → α1 .X β1 A1 → α1 X .β1
X
... A2 → α2 X .β2
A2 → α2 .X β2 plus closure
...
245 / 330
How does the DFA do the shift/reduce and the rest?
246 / 330
Stack contents and state of the automaton
247 / 330
State transition allowing a shift
X → α.aβ
• construction thus has transition as follows
s t
... ...
a
X→ α.aβ X→ αa.β
... ...
• shift is possible
• if shift is the correct operation and a is terminal symbol
corresponding to the current token: state afterwards = t
248 / 330
State transition: analogous for non-terminals
s t
X → α.Bβ ... B ...
X→ α.Bβ X→ αB.β
249 / 330
State (not transition) where a reduce is possible
• remember: complete items (those with a dot . at the end)
• assume top state s containing complete item A → γ.
s
...
A→ γ.
E′ → E
E → E +n ∣ n
E′
parse stack input action
1 $ n + n $ shift
E 2 $n + n $ red:. E → n
3 $E + n $ shift
4 $E + n $ shift
5 $E +n $ reduce E → E + n
E 6 $E $ red.: E ′ → E
7 $E ′
$ accept
252 / 330
DFA of addition example
0
1
E′ → .E
E E′ → E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
253 / 330
LR(0) grammars
LR(0) grammar
The top-state alone determines the next step.
254 / 330
Simple parentheses
A → (A) ∣ a
0
A′ → .A 1 • for shift:
A
start A→ .(A) A →
′
A. • many shift
A→ .a transitions in 1
state allowed
( a • shift counts as one
3 action (including
A→ ( .A ) 2 “shifts” on
a
( A→ .(A) A→ a. non-terms)
A→ .a • but for reduction:
also the production
A
4 5 must be clear
A→ ( A. ) A→ (A).
)
255 / 330
Simple parentheses is LR(0)
0
A →
′
.A 1
A
start A→ .(A) A′ → A.
A→ .a state possible action
0 only shift
(
only red: (with A′ → A)
a
3
1
A→ ( .A ) 2
2 only red: (with A → a)
a 3 only shift
( A→ .(A) A→ a.
A→ .a 4 only shift
5 only red (with A → ( A ))
A
4 5
A→ ( A. ) A→ (A).
)
256 / 330
NFA for simple parentheses (bonus slide)
A
start A′ → .A A′ → A.
a
A→ .(A) A→ .a A→ a.
(
A→ ( .A ) A→ ( A. ) A→ (A).
A )
257 / 330
Parsing table for an LR(0) grammar
• table structure: slightly different for SLR(1), LALR(1), and
LR(1) (see later)
• note: the “goto” part: “shift” on non-terminals (only 1
non-terminal A here)
• corresponding to the A-labelled transitions
• see the parser run on the next slide
1 $0 ((a))$ shift
2 $ 0 (3 (a))$ shift
3 $ 0 (3 (3 a))$ shift
4 $ 0 (3 (3 a 2 ))$ reduce A → a
5 $ 0 (3 (3 A 4 ))$ shift
6 $ 0 (3 (3 A 4 )5 )$ reduce A → ( A )
7 $ 0 (3 A 4 )$ shift
8 $ 0 (3 A 4 )5 $ reduce A → ( A )
9 $0 A 1 $ accept
A′
A
( ( a ) )
• As said:
• the reduction “contains” the parse-tree
• reduction: builds it bottom up
• reduction in reverse: contains a right-most derivation (which is
“top-down”)
• accept action: corresponds to the parent-child edge A′ → A of
the tree
260 / 330
Parsing of erroneous input
• empty slots it the table: “errors”
Invariant
important general invariant for LR-parsing: never shift something
“illegal” onto the stack
261 / 330
LR(0) parsing algo, given DFA
262 / 330
DFA parentheses again: LR(0)?
S′ → S
S → (S )S ∣
0
S′ → .S 1
S
start S→ .(S )S S →
′
S.
S→ .
( 2
S→ ( .S ) S 3
S
( S→ .(S )S S→ ( S. ) S
S→ .
)
4
(
S→ ( S ) .S 5
S
S→ .(S )S S→ ( S ) S.
S→ .
263 / 330
DFA parentheses again: LR(0)?
S′ → S
S → (S )S ∣
0
S′ → .S 1
S
start S→ .(S )S S′ → S.
S→ .
( 2
S→ ( .S ) S 3
S
( S→ .(S )S S→ ( S. ) S
S→ .
)
4
(
S→ ( S ) .S 5
S
S→ .(S )S S→ ( S ) S.
S→ .
E′ → E
E → E + number ∣ number
0
1
E′ → .E
E E′ → E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
265 / 330
DFA addition again: LR(0)?
E′ → E
E → E + number ∣ number
0
1
E′ → .E
E E′ → E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
266 / 330
Decision? If only we knew the ultimate tree already . . .
. . . especially the parts still to come
E′
267 / 330
Addition grammar (again)
0
1
E′ → .E
E E′ → E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
268 / 330
One look-ahead
269 / 330
Resolving LR(0) reduce/reduce conflicts
270 / 330
Resolving LR(0) reduce/reduce conflicts
271 / 330
Resolving LR(0) shift/reduce conflicts
272 / 330
Resolving LR(0) shift/reduce conflicts
273 / 330
Revisit addition one more time
0
1
E′ → .E
E E →
′
E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
• Follow (E ′ ) = {$}
⇒ • shift for +
• reduce with E ′ → E for $ (which corresponds to accept, in
case the input is empty)
274 / 330
SLR(1) algo
let s be the current state, on top of the parse stack
1. s contains A → α.X β, where X is a terminal and X is the next
token on the input, then
• shift X from input to top of stack. the new state pushed on
X
the stack: state t where s Ð
→ t 14
2. s contains a complete item (say A → γ.) and the next token in
the input is in Follow (A): reduce by rule A → γ:
• A reduction by S ′ → S: accept, if input is empty15
• else:
pop: remove γ (including “its” states from the stack)
back up: assume to be in state u which is now head state
push: push A to the stack, new head state t where
A
uÐ
→t
3. if next token is such that neither 1. or 2. applies: error
14
Cf. to the LR(0) algo: since we checked the existence of the transition
before, the else-part is missing now.
15
Cf. to the LR(0) algo: This happens now only if next token is $. Note that
the follow set of S ′ in the augmented grammar is always only $
275 / 330
LR(0) parsing algo, given DFA
276 / 330
Parsing table for SLR(1)
0
1
E′ → .E
E E′ → E.
start E→ .E + n
E→ E. + n
E→ .n
n +
2 3 4
n
E→ n. E→ E + .n E→ E + n.
16
by which it, strictly speaking, would no longer be an SRL(1)-table :-)
278 / 330
SLR(1) parser run (= “reduction”)
1 $0 n+n+n$ shift: 2
2 $ 0 n2 +n+n$ reduce: E → n
3 $0 E1 +n+n$ shift: 3
4 $0 E1 +3 n+n$ shift: 4
5 $0 E1 +3 n4 +n$ reduce: E → E + n
6 $0 E1 n$ shift 3
7 $0 E1 +3 n$ shift 4
8 $0 E1 +3 n4 $ reduce: E → E + n
9 $0 E1 $ accept
279 / 330
Corresponding parse tree
E′
280 / 330
Revisit the parentheses again: SLR(1)?
0
S′ → .S 1
S
start S→ .(S )S S′ → S.
S→ .
( 2
S→ ( .S ) S 3
S
( S→ .(S )S S→ ( S. ) S
S→ .
)
4
(
S→ ( S ) .S 5
S
S→ .(S )S S→ ( S ) S.
S→ . 281 / 330
SLR(1) parse table
282 / 330
Parentheses: SLR(1) parser run (= “reduction”)
284 / 330
Ambiguity & LR-parsing
• in principle: LR(k) (and LL(k)) grammars: unambiguous
• definition/construction: free of shift/reduce and reduce/reduce
conflict (given the chosen level of look-ahead)
• However: ambiguous grammar tolerable, if (remaining)
conflicts can be solved “meaningfully” otherwise:
286 / 330
Simplified conditionals
Follow-sets
Follow
′
S {$}
S {$, else}
I {$, else}
287 / 330
DFA of LR(0) items
0 1
S ′ → .S S ′ → S.
S
S → .I 2
I
start S → .other S → I.
I → .if S I I
I → .if S else S
if 4
6
I → if .S
I → if S else .S
I → if .S else S
other
3 S → .I
S → .I
S → other. S → .other
other S → .other
I → .if S
if
I → .if S
I → .if S else S
I → .if S else S
if
S else S
other
5
7
I → if S .
I → if S else S.
I → if S .else S
288 / 330
Simple conditionals: parse table
289 / 330
Parser run (= reduction)
I I S
S S S
292 / 330
Use of ambiguous grammars
E′ → E
E → E + E ∣ E ∗ E ∣ number
293 / 330
DFA for + and ×
0
1
E ′ → .E
E′ → E.
E → .E + E E
start E → E. + E
E → .E ∗ E
E → E. ∗ E
E → .n 3 +
5
E → E + .E
E → E + E.
E → .E + E +
E → E. + E
E → .E ∗ E
E → E. ∗ E
E → .n E
∗
6
n
+ E → E ∗ E. ∗
E → E. + E
n E → E. ∗ E
∗
4
E → E ∗ .E
2 E E → .E + E
E→ n.
n E → .E ∗ E
E → .n 294 / 330
States with conflicts
• state 5
• stack contains ...E + E
• for input $: reduce, since shift not allowed from $
• for input +; reduce, as + is left-associative
• for input ∗: shift, as ∗ has precedence over +
• state 6:
• stack contains ...E ∗ E
• for input $: reduce, since shift not allowed from $
• for input +; reduce, a ∗ has precedence over +
• for input ∗: shift, as ∗ is left-associative
• see also the table on the next slide
295 / 330
Parse table + and ×
296 / 330
For comparison: unambiguous grammar for + and ∗
Follow
E′ {$} (as always for start symbol)
E {$, +}
T {$, +, ∗}
297 / 330
DFA for unambiguous + and ×
0
E ′ → .E 2
1
E → .E + T E → E + .T
E E′ → E. +
start E → .T T → .T ∗ n
E → E. + T
E → .T ∗ n T → .n
E → .n
n T
n
6
3
E → E +T.
T → n.
T → T.∗n
T
∗
4
5 7
E → T. ∗
T → T ∗ .n T → T ∗ n.
T → T.∗n n
298 / 330
DFA remarks
299 / 330
LR(1) parsing
A help to remember
SRL(1) “improved” LR(0) parsing LALR(1) is “crippled” LR(1)
parsing.
300 / 330
Limits of SLR(1) grammars
301 / 330
non-SLR(1): Reduce/reduce conflict
302 / 330
Situation can be saved: more look-ahead
303 / 330
LALR(1) (and LR(1)): Being more precise with the
follow-sets
LR(1) items
[A → α.β, a] (9)
• a: terminal/token, including $
18
Not to mention if we wanted look-ahead of k > 1, which in practice is not
done, though.
305 / 330
LALR(1)-DFA (or LR(1)-DFA)
306 / 330
Remarks on the DFA
307 / 330
Full LR(1) parsing
SLR(1) LALR(1)
LR(0)-item-based parsing, with LR(1)-item-based parsing, but
afterwards adding some extra afterwards throwing away
“pre-compiled” info (about precision by collapsing states,
follow-sets) to increase expressivity to save space
308 / 330
LR(1) transitions: arbitrary symbol
X -transition
X
[A → α.X β, a] [A → αX .β, a]
309 / 330
LR(1) transitions:
-transition
for all
B → β1 ∣ β2 . . . and all b ∈ First(γa)
[ A → α.Bγ ,a] [ B → .β ,b]
[ A → α.B ,a] [ B → .β ,a]
310 / 330
LALR(1) vs LR(1)
LR(1)
LALR(1)
311 / 330
Core of LR(1)-states
312 / 330
LALR(1)-DFA by as collapse
313 / 330
Concluding remarks of LR / bottom up parsing
19
If designing a new language, there’s also the option to massage the
language itself. Note also: there are inherently ambiguous languages for which
there is no unambiguous grammar.
314 / 330
LR/bottom-up parsing overview
advantages remarks
LR(0) defines states also used by not really used, many con-
SLR and LALR flicts, very weak
SLR(1) clear improvement over weaker than LALR(1). but
LR(0) in expressiveness, often good enough. Ok
even if using the same for hand-made parsers for
number of states. Table small grammars
typically with 50K entries
LALR(1) almost as expressive as method of choice for most
LR(1), but number of generated LR-parsers
states as LR(0)!
LR(1) the method covering all large number of states
bottom-up, one-look-ahead (typically 11M of entries),
parseable grammars mostly LALR(1) preferred
Remeber: once the table specific for LR(0), . . . is set-up, the parsing
algorithms all work the same
315 / 330
Error handling
316 / 330
Error handling
Minimal requirement
Upon “stumbling over” an error (= deviation from the grammar):
give a reasonable & understandable error message, indicating also
error location. Potentially stop parsing
317 / 330
Error messages
• important:
• avoid error messages that only occur because of an already
reported error!
• report error as early as possible, if possible at the first point
where the program cannot be extended to a correct program.
• make sure that, after an error, one doesn’t end up in an
infinite loop without reading any input symbols.
• What’s a good error message?
• assume: that the method factor() chooses the alternative
( exp ) but that it , when control returns from method exp(),
does not find a )
• one could report : left paranthesis missing
• But this may often be confusing, e.g. if what the program text
is: ( a + b c )
• here the exp() method will terminate after ( a + b, as c
cannot extend the expression). You should therefore rather
give the message error in expression or left
paranthesis missing.
318 / 330
Error recovery in bottom-up parsing
• panic recovery in LR-parsing
• simple form
• the only one we shortly look at
• upon error: recovery ⇒
• pops parts of the stack
• ignore parts of the input
• until “on track again”
• but: how to do that
• additional problem: non-determinism
• table: constructed conflict-free under normal operation
• upon error (and clearing parts of the stack + input): no
guarantee it’s clear how to continue
⇒ heuristic needed (like panic mode recovery)
320 / 330
Possible error situation
321 / 330
Panic mode recovery
Algo
1. Pop states for the stack until a state is found with non-empty
goto entries
2. • If there’s legal action on the current input token from one of
the goto-states, push token on the stack, restart the parse.
• If there’s several such states: prefer shift to a reduce
• Among possible reduce actions: prefer one whose associated
non-terminal is least general
3. if no legal action on the current input token from one of the
goto-states: advance input until there is a legal action (or until
end of input is reached)
322 / 330
Example again
323 / 330
Example again
324 / 330
Panic mode may loop forever
325 / 330
Typical yacc parser table
some variant of the expression grammar again
command → exp
exp → term ∗ factor ∣ factor
term → term ∗ factor ∣ factor
factor → number ∣ ( exp )
326 / 330
Panicking and looping
328 / 330
Outline
1. Parsing
First and follow sets
Top-down parsing
Bottom-up parsing
References
329 / 330
References I
330 / 330