Syntax Analysis: COP5621 Compiler Construction
Syntax Analysis: COP5621 Compiler Construction
Syntax Analysis
Part I
Chapter 4
Symbol Table
3
The Parser
• A parser implements a C-F grammar
• The role of the parser is twofold:
1. To check syntax (= string recognizer)
– And to report syntax errors accurately
2. To invoke semantic actions
– For static semantics checking, e.g. type checking of
expressions, functions, etc.
– For syntax-directed translation of the source code to
an intermediate representation
4
Syntax-Directed Translation
• One of the major roles of the parser is to produce
an intermediate representation (IR) of the source
program using syntax-directed translation methods
• Possible IR output:
– Abstract syntax trees (ASTs)
– Control-flow graphs (CFGs) with triples, three-address
code, or register transfer list notation
– WHIRL (SGI Pro64 compiler) has 5 IR levels!
5
Error Handling
• A good compiler should assist in identifying and
locating errors
– Lexical errors: important, compiler can easily recover
and continue
– Syntax errors: most important for compiler, can almost
always recover
– Static semantic errors: important, can sometimes
recover
– Dynamic semantic errors: hard or impossible to detect
at compile time, runtime checks are required
– Logical errors: hard or impossible to detect
6
Viable-Prefix Property
• The viable-prefix property of parsers allows
early detection of syntax errors
– Goal: detection of an error as soon as possible
without further consuming unnecessary input
– How: detect an error as soon as the prefix of the
input does not match a prefix of any string in
the language
Error is
Error is detected here
… detected here
…
Prefix
Prefix
for (;) DO 10 I = 1;0
… …
7
Grammars (Recap)
• Context-free grammar is a 4-tuple
G = (N, T, P, S) where
– T is a finite set of tokens (terminal symbols)
– N is a finite set of nonterminals
– P is a finite set of productions of the form
α → β
where α ∈ (N∪T)* N (N∪T)* and β ∈ (N∪T)*
– S ∈ N is a designated start symbol
9
Derivations (Recap)
• The one-step derivation is defined by
α A β ⇒ α γ β
where A → γ is a production in the grammar
• In addition, we define
– ⇒ is leftmost ⇒lm if α does not contain a nonterminal
– ⇒ is rightmost ⇒rm if β does not contain a nonterminal
– Transitive closure ⇒* (zero or more steps)
– Positive closure ⇒+ (one or more steps)
• The language generated by G is defined by
L(G) = {w ∈ T* | S ⇒+ w}
11
Derivation (Example)
Grammar G = ({E}, {+,*,(,),-,id}, P, E) with
productions P =
E → E + E
E → E * E
E → ( E )
E → - E
E → id
Example derivations:
E ⇒ - E ⇒ - id
E ⇒rm E + E ⇒rm E + id ⇒rm id + id
E ⇒* E
E ⇒* id + id
E ⇒+ id * id + id
12
Chomsky Hierarchy
Examples:
Every finite language is regular! (construct a FSA for strings in L(G))
L1 = { anbn | n ≥ 1 } is context free
L2 = { anbncn | n ≥ 1 } is context sensitive
14
Parsing
• Universal (any C-F grammar)
– Cocke-Younger-Kasimi
– Earley
• Top-down (C-F grammar with restrictions)
– Recursive descent (predictive parsing)
– LL (Left-to-right, Leftmost derivation) methods
• Bottom-up (C-F grammar with restrictions)
– Operator precedence parsing
– LR (Left-to-right, Rightmost derivation) methods
• SLR, canonical LR, LALR
15
Top-Down Parsing
• LL methods (Left-to-right, Leftmost
derivation) and recursive-descent parsing
Grammar: Leftmost derivation:
E → T + T E ⇒lm T + T
T → ( E ) ⇒lm id + T
T → - E ⇒lm id + id
T → id
E
E
E
E
T
T
T
T
T
T
+
id
+
id
+
id
16
Immediate Left-Recursion
Elimination
Rewrite every left-recursive production
A → A α
| β
| γ
| A δ
into a right-recursive production:
A → β AR
| γ AR
AR → α AR
| δ AR
| ε
19
i = 1:
nothing to do
i = 2, j = 1:
B → C A | A b
⇒
B → C A | B C b | a b
⇒(imm)
B → C A BR | a b BR
BR → C b BR | ε
i = 3, j = 1:
C → A B | C C | a
⇒
C → B C B | a B | C C | a
i = 3, j = 2:
C → B C B | a B | C C | a
⇒
C → C A BR C B | a b BR C B | a B | C C | a
⇒(imm)
C → a b BR C B CR | a B CR | a CR
CR → A BR C B CR | C CR | ε
20
Left Factoring
• When a nonterminal has two or more productions
whose right-hand sides start with the same
grammar symbols, the grammar is not LL(1) and
cannot be used for predictive parsing
• Replace productions
A → α β1 | α β2 | … | α βn | γ
with
A → α AR | γ
AR → β1 | β2 | … | βn
21
Predictive Parsing
• Eliminate left recursion from grammar
• Left factor the grammar
• Compute FIRST and FOLLOW
• Two variants:
– Recursive (recursive-descent parsing)
– Non-recursive (table-driven parsing)
22
FIRST (Revisited)
• FIRST(α) = { the set of terminals that begin all
strings derived from α }
FIRST(a) = {a}
if a ∈ T
FIRST(ε) = {ε}
FIRST(A) = ∪A→α FIRST(α)
for A→α ∈ P
FIRST(X1X2…Xk) =
if for all j = 1, …, i-1 : ε ∈ FIRST(Xj) then
add non-ε in FIRST(Xi) to FIRST(X1X2…Xk)
if for all j = 1, …, k : ε ∈ FIRST(Xj) then
add ε to FIRST(X1X2…Xk)
23
FOLLOW
• FOLLOW(A) = { the set of terminals that can
immediately follow nonterminal A }
FOLLOW(A) =
for all (B → α A β) ∈ P do
add FIRST(β)\{ε} to FOLLOW(A)
for all (B → α A β) ∈ P and ε ∈ FIRST(β) do
add FOLLOW(B) to FOLLOW(A)
for all (B → α A) ∈ P do
add FOLLOW(B) to FOLLOW(A)
if A is the start symbol S then
add $ to FOLLOW(A)
24
LL(1) Grammar
• A grammar G is LL(1) if it is not left recursive
and for each collection of productions
A → α1 | α2 | … | αn
for nonterminal A the following holds:
1.
FIRST(αi) ∩ FIRST(αj) = ∅ for all i ≠ j
2.
if αi ⇒* ε then
2.a.
αj ⇒* ε for all i ≠ j
2.b.
FIRST(αj) ∩ FOLLOW(A) = ∅
for all i ≠ j
25
Non-LL(1) Examples
Grammar
Not LL(1) because:
S → S a | a
Left recursive
S → a S | a
FIRST(a S) ∩ FIRST(a) ≠ ∅
S → a R | ε
R → S | ε
For R: S ⇒* ε and ε ⇒* ε
S → a R a For R:
R → S | ε
FIRST(S) ∩ FOLLOW(R) ≠ ∅
26
Recursive-Descent Parsing
(Recap)
• Grammar must be LL(1)
• Every nonterminal has one (recursive) procedure
responsible for parsing the nonterminal’s syntactic
category of input tokens
• When a nonterminal has multiple productions,
each production is implemented in a branch of a
selection statement based on input look-ahead
information
27
where
FIRST(+ term rest) = { + }
FIRST(- term rest) = { - }
FOLLOW(rest) = { $ }
28
Non-Recursive Predictive
Parsing: Table-Driven Parsing
• Given an LL(1) grammar G = (N, T, P, S)
construct a table M[A,a] for A ∈ N, a ∈ T
and use a driver program with a stack
input
a
+
b
$
stack
Predictive parsing
X
output
program (driver)
Y
Z
Parsing table
$
M
29
Example Table
A → α
FIRST(α)
FOLLOW(A)
E → T ER
( id
$ )
ER → + T ER
+
$ )
E → T ER
ER → + T ER | ε ER → ε
ε
$ )
T → F TR T → F TR
( id
+ $ )
TR → * F TR | ε TR → * F TR
*
+ $ )
F → ( E ) | id
TR → ε
ε
+ $ )
F → ( E )
(
* + $ )
F → id
id
* + $ )
id
+
*
(
)
$
E
E → T ER
E → T ER
ER
ER → + T ER
ER → ε
ER → ε
T
T → F TR
T → F TR
TR
TR → ε
TR → * F TR
TR → ε
TR → ε
F
F → id
F → ( E )
31
id
+
*
(
)
$
E
E → T ER
E → T ER
synch
synch
ER
ER → + T ER
ER → ε
ER → ε
T
T → F TR
synch
T → F TR
synch
synch
TR
TR → ε
TR → * F TR
TR → ε
TR → ε
F
F → id
synch
synch
F → ( E )
synch
synch
synch:
the driver pops current nonterminal A and skips input till
synch token or skips input until one of FIRST(A) is found
35
Phrase-Level Recovery
Change input stream by inserting missing tokens
For example: id id is changed into id * id
Pro:
Can be automated
Cons:
Recovery not always intuitive
Can then continue here
id
+
*
(
)
$
E
E → T ER
E → T ER
synch
synch
ER
ER → + T ER
ER → ε
ER → ε
T
T → F TR
synch
T → F TR
synch
synch
TR
insert *
TR → ε
TR → * F TR
TR → ε
TR → ε
F
F → id
synch
synch
F → ( E )
synch
synch
Error Productions
E → T ER Add “error production”:
ER → + T ER | ε
TR → F TR
T → F TR to ignore missing *, e.g.: id id
TR → * F TR | ε Pro:
Powerful recovery method
F → ( E ) | id
Cons:
Cannot be automated
id
+
*
(
)
$
E
E → T ER
E → T ER
synch
synch
ER
ER → + T ER
ER → ε
ER → ε
T
T → F TR
synch
T → F TR
synch
synch
TR
TR → F TR
TR → ε
TR → * F TR
TR → ε
TR → ε
F
F → id
synch
synch
F → ( E )
synch
synch