Lecture 4- Syntax Analysis (1)
Lecture 4- Syntax Analysis (1)
Compiler construction
Introduction to syntax analysis
• The parser (syntax analyzer) receives the source
code in the form of tokens from the lexical
analyzer and performs syntax analysis, which
create a tree-like intermediate representation
that depicts the grammatical structure of the
token stream.
• Syntax analysis is also called parsing.
• It involves analyzing the structure of source code
to ensure it adheres to the grammatical rules of
the language
• A typical representation is an abstract syntax tree
where;
• Each interior node represents an operation
• The children of the node represent the
arguments of the operation
Compiler construction
Syntactic Analysis
• Input: sequence of tokens from scanner
• Output: abstract syntax tree
• Actually,
• parser first builds a parse tree
• AST is then built by translating the parse
tree
• parse tree rarely built explicitly; only
determined by, say, how parser pushes
stuff to stack
Compiler construction
Introduction to syntax analysis
Parse Tree/Abstract Syntax Tree
(AST)
+
NUM(4)
NUM(2) NUM(3)
Compiler construction
Another example
• Source Code
• if (x == y) { a=1; }
• Parser input
• IF LPAR ID EQ ID RPAR LBR ID AS INT
SEMI RBR
• Parser output (AST):
IF-THEN
== =
ID ID ID IN
T
Compiler construction
Introduction to syntax analysis
Example:
For a simple expression like a + b * c, the syntax
analysis might involve:
• Tokenizing: a, +, b, *, c
• Using grammar rules to recognize:
An expression consists of terms and
operators.
The multiplication operator has higher
precedence than the addition operator.
Constructing an AST:
Compiler construction
Syntax Analysis Analogy
Syntax analysis for natural languages
• Identify the function of each word
• Recognize if a sentence is grammatically
correct
• Example: I gave Ali the card.
Syntax analysis for natural languages
• Identify the function of each word
• Recognize if a sentence is grammatically
correct
Compiler construction
Introduction to syntax analysis
.
Compiler construction
Introduction to syntax analysis
Compiler construction
Position of Syntax Analyzer
Compiler construction
Overview
Compiler construction
Context-free Grammars: Definition
Formally, a context-free grammar G is a 4-
tuple G = (V, T, P, S), where:
• V is a finite set of variables (or
nonterminals). These describe sets of
“related” strings.
• T is a finite set of terminals (i.e., tokens).
• P is a finite set of productions, each of the
form
•A
• where A V is a variable, and (V
T)* is a sequence of terminals and
nonterminals. 14
Compiler construction
Context-free Grammars: An Example
A grammar for palindromic bit-strings:
G = (V, T, P, S), where:
• V = { S, B }
• T = {0, 1}
• P = { S B,
S ,
S 0 S 0,
S 1 S 1,
B 0,
B1
}
15
Compiler construction
Context-free Grammars: Terminology
• Derivation: Suppose that
• and are strings of grammar symbols,
and
• A is a production.
• Then, A (“A derives ”).
Compiler construction
Derivations: Example
• Grammar for palindromes: G = (V, T,
P, S),
• V = {S},
• T = {0, 1},
•P={S0S0 | 1S1 | 0 | 1 |
}.
• A derivation of the string 10101:
•S
•1S1 (using S 1S1)
• 1 0S0 1 (using S 0S0) 17
Compiler construction
Leftmost and Rightmost Derivations
• A leftmost derivation is one where, at each step,
the leftmost nonterminal is replaced.
• (analogous for rightmost derivation)
• Example: a grammar for arithmetic expressions:
• E E + E | E * E | id
• Leftmost derivation:
• E E * E E + E * E id + E * E id +
id * E id + id * id
• Rightmost derivation:
• E E + E E + E * E E + E * id E +
id * id id + id * id
Compiler construction
Context-free Grammars: Terminology
• The language of a grammar G =
(V,T,P,S) is
• L(G) = { w | w T* and S * w }.
• The language of a grammar
contains only strings of terminal
symbols.
Compiler construction
Parse Trees
• A parse tree is a tree representation of a
derivation.
• Constructing a parse tree:
• The root is the start symbol S of the grammar.
• Given a parse tree for X , if the next
derivation step is
• X 1…n then the parse tree is
. obtained as:
20
Compiler construction
Approaches to Parsing
• Top-down parsing:
• attempts to figure out the derivation for
the input string, starting from the start
symbol.
• Bottom-up parsing:
• starting with the input string, attempts to
“derive in reverse” and end up with the
start symbol;
• forms the basis for parsers obtained from
parser-generator tools such as yacc,
bison. 21
Compiler construction
Top-down Parsing
• “top-down:” starting with the start symbol
of the grammar, try to derive the input
string.
Compiler construction
Bottom-up Parsing
24
Compiler construction
Parsing Problems: Ambiguity
25
Compiler construction
Dealing with Ambiguity
1. Transform the grammar to an equivalent
unambiguous grammar.
2. Use disambiguating rules along with the
ambiguous grammar to specify which
parse to use.
Comment: It is not possible to determine
algorithmically whether:
• Two given CFGs are equivalent;
• A given CFG is ambiguous.
26
Compiler construction
Removing Ambiguity: Operators
• Basic idea: use additional nonterminals to
enforce associativity and precedence:
• Use one nonterminal for each precedence
level:
• E E * E | E + E | id
• needs 2 nonterminals (2 levels of
precedence).
• Modify productions so that “lower precedence”
nonterminal is in direction of precedence:
•EE+E E E + T (+ is
left-associative)
CSc 453: Syntax Analysis 27
Compiler construction
Example
• Original grammar:
•EE*E | E/E | E+E | E–E |
( E ) | id
• precedence levels: { *, / } > { +, – }
• associativity: *, /, +, – are all left-
associative.
• Transformed grammar:
•EE+T | E–T | T (precedence
level for: +, -)
•T T*F | T/ F | F (precedence 28
29
Compiler construction
FIRST Sets
Defn: For any string of grammar symbols ,
FIRST() = { a | a is a terminal and *
a}.
if * then is also in FIRST().
Example: E T E′
E′ + T E′ |
T F T′
T′ * F T′ |
F ( E ) | id
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E′) = { +, }
FIRST(T′) = { *, }
CSc 453: Syntax Analysis 30
Compiler construction
Computing FIRST Sets
Given a sequence of grammar symbols A:
if A is a terminal or A = then FIRST(A)
= {A}.
if A is a nonterminal with productions A
1 | … | n then:
• FIRST(A) = FIRST(1) FIRST(n).
if A is a sequence of symbols Y1 … Yk
then:
• for i = 1 to k do:
– add each a FIRST(Yi), such that a
, to FIRST(A).
– if FIRST(Yi) then break;
• if is in each of FIRST(Y1), …, FIRST(Yk) 31
Compiler construction
Computing FIRST sets: cont’d
• For each nonterminal A in the grammar,
initialize FIRST(A) = .
• repeat {
• for each nonterminal A in the grammar {
• compute FIRST(A); /* as described
previously */
•}
• } until there is no change to any FIRST set.
32
Compiler construction
Example (FIRST Sets)
X YZ | a
Y b |
Z c |
• X a, so add a to FIRST(X).
• X YZ, b FIRST(Y), so add b to FIRST(X).
• Y , i.e. FIRST(Y), so add non- symbols from
FIRST(Z) to FIRST(X).
• ► add c to FIRST(X).
• FIRST(Y) and FIRST(Z), so add to FIRST(X).
• Final: FIRST(X) = { a, b, c, }.
33
Compiler construction
FOLLOW Sets
Definition: Given a grammar G = (V, T, P, S),
for any nonterminal A V:
• FOLLOW(A) = { a T | S * Aa for
some , }.
i.e., FOLLOW(A) contains those terminals
that can appear after A in something
derivable from the start symbol S.
• if S * A then $ is also in FOLLOW(A).
($ EOF, “end of input.”)
Example:
E E + E | id
FOLLOW(E) = { +, $ }.
34
Compiler construction
Computing FOLLOW Sets
Given a grammar G = (V, T, P, S):
1. add $ to FOLLOW(S);
2. repeat {
• for each production A B in P, add
every non- symbol in FIRST() to
FOLLOW(B).
• for each production A B in P, where
FIRST(), add everything in
FOLLOW(A) to FOLLOW(B).
• for each production A B in P, add
everything in FOLLOW(A) to FOLLOW(B).
} until no change to any FOLLOW set.
Compiler construction
Example (FOLLOW Sets)
X YZ | a
Y b |
Z c |
• X is start symbol: add $ to FOLLOW(X);
• X YZ, so add everything in FOLLOW(X) to FOLLOW(Z).
• ►add $ to FOLLOW(Z).
• X YZ, so add every non- symbol in FIRST(Z) to
FOLLOW(Y).
• ►add c to FOLLOW(Y).
• X YZ and FIRST(Z), so add everything in FOLLOW(X)
to FOLLOW(Y).
• ►add $ to FOLLOW(Y).
36
Compiler construction
Shift-reduce Parsing
• An instance of bottom-up parsing
• Basic idea: repeat
1. in the string being processed, find a
substring α such that A → α is a
production;
2. replace the substring α by A (i.e., reverse
a derivation step).
until we get the start symbol.
• Technical issues: Figuring out
1. which substring to replace; and
2. which production to reduce with. 37
Compiler construction
Shift-reduce Parsing: Example
Grammar: S → aABe
A → Abc | b
B→d
Compiler construction
Shift-Reduce Parsing: cont’d
• Need to choose reductions carefully:
• abbcde aAbcde aAbcBe …
• doesn’t work.
• A handle of a string s is a substring
s.t.:
• matches the RHS of a rule A → ;
and
• replacing by A (the LHS of the
rule) represents a step in the
reverse of a rightmost derivation of 39
s. Compiler construction
Shift-reduce Parsing: Implementation
• Data Structures:
• a stack, its bottom marked by ‘$’.
Initially empty.
• the input string, its right end marked by
‘$’. Initially w.
• Actions:
• repeat
• Shift some ( 0) symbols from the
input string onto the stack, until a
handle appears on top of the stack.
• Reduce to the LHS of the appropriate
production.
• until ready to accept. 40
Compiler construction
Conflicts
• Can’t decide whether to shift or to reduce ―
both seem OK (“shift-reduce conflict”).
• Example: S → if E then S | if E then
S else S | …
Compiler construction
LR Parsing
• A kind of shift-reduce parsing. An LR(k)
parser:
• scans the input L-to-R;
• produces a Rightmost derivation (in
reverse); and
• uses k tokens of lookahead.
• Advantages:
• very general and flexible, and handles a
wide class of grammars;
• efficiently implementable.
• Disadvantages:
• difficult to implement by hand (use tools 43
Compiler construction
LR Parsing: Schematic
45
Compiler construction
LR Parsing: Roadmap
• LR parsing algorithm:
• parse table structure
• parsing actions
Compiler construction
LR Parse Tables
• The parse table has two parts: the action
function and the goto function.
Compiler construction
LR Parser Actions: shift
• Suppose:
• the parser configuration is s0 X1s1 … Xmsm,
ai … an, and
• action[sm, ai] = ‘shift sn’.
• Effects of shift move:
• push the next input symbol ai; and
• push the state sn
Compiler construction
LR Parser Actions: reduce
• Suppose:
• the parser configuration is s0 X1s1 … Xmsm, ai …
an, and
• action[sm, ai] = ‘reduce A → ’.
• Effects of reduce move:
• pop n states and n grammar symbols off the
stack (2n symbols total), where n = ||.
• suppose the (newly uncovered) state on top of
the stack is t, and goto[t, A] = u.
• push A, then u.
• New configuration: s0 X1s1 … Xm-nsm-n A u, ai … an
49
Compiler construction
LR Parsing Algorithm
1. set ip to the start of the input string w$.
2. while TRUE do:
1. let s = state on top of parser stack, a = input
symbol pointed at by ip.
2. if action[s,a] == ‘shift t’ then: (i) push the input
symbol a on the stack, then the state t; (ii)
advance ip.
3. if action[s,a] == ‘reduce A → ’ then: (i) pop 2*|
| symbols off the stack; (ii) suppose t is the
state that now gets uncovered on the stack; (iii)
push the LHS grammar symbol A and the state u
= goto[A, t].
4. if action[s,a] == ‘accept’ then accept;
5. else signal a syntax error.
50
Compiler construction
LR parsing: Viable Prefixes
• Goal: to be able to identify handles, and so
produce a rightmost derivation in reverse.
• Given a configuration s0 X1s1 … Xmsm, ai … an:
• X1 X2 … Xm ai … an is obtainable on a rightmost
derivation.
• X1 X2 … Xm is called a viable prefix.
• The set of viable prefixes of a grammar are
recognizable using a finite automaton.
• This automaton is used to recognize handles.
51
Compiler construction
Viable Prefix Automata
• An LR(0) item of a grammar G is a
production of G with a dot “” somewhere in
the RHS.
• Example: The rule A → a A b gives these
LR(0) items:
•A→ aAb
•A→ aAb
•A→ aAb
•A→ aAb
• Intuition: ‘A → ’ denotes that:
• we’ve seen something derivable from ;
and
52
• it would be legal to see something
Compiler construction
Overall Approach
Given a grammar G with start symbol S:
• Construct the augmented grammar by
adding a new start symbol S′ and a new
production S′ → S.
Compiler construction
Viable Prefix NFA for LR(0) items
• Each state is labeled by an LR(0) item. The initial
state is labeled S′ → S.
• Transitions:
1.
where X is a terminal
or nonterminal.
2.
where X is a nonterminal, and X → is a
production.
54
Compiler construction
Viable Prefix NFA:
Example
Grammar :
S→0S1
S→
55
Compiler construction
Viable Prefix NFA DFA
• Given a set of LR(0) items I, the set closure(I) is
constructed as follows:
• repeat
• add every item in I to closure(I);
• if A → B closure(I) and B is a
nonterminal, then for each production B → ,
add the item B → to closure(I).
• until no new items can be added to
closure(I).
• Intuition:
• A → B closure(I) means something
derivable from B is legal at this point. This 56
means that something derivable from
Compiler B (and
construction
Viable Prefix NFA DFA (cont’d)
• Given a set of LR(0) items I, the set goto(I,X) is
defined as
• goto(I, X) = closure({ A → X | A → X
I })
• Intuition:
• if A → X I then (a) we’ve seen something
derivable from ; and (b) something derivable
from X would be legal at this point.
• Suppose we now see something derivable from
X.
• The parser should “go to” a state where (a)
we’ve seen something derivable from X; and (b)
something derivable from would be legal.
57
Compiler construction
Example
Compiler construction
SLR(1) Parse Table Construction I
Given a grammar G with start symbol S:
• Construct the augmented grammar G′
with start symbol S′.
• Construct the set of states {I0, I1, …, In}
for the Viable Prefix DFA for the
augmented grammar G′.
• Each DFA state Ii corresponds to a parser
state si.
• The initial parser state s0 coresponds to
the DFA state I0 obtained from the item S′
→ S. 60
Compiler construction
SLR(1) Parse Table Construction II
• Parsing action for parser state si:
• action table entries:
• if DFA state Ii contains an item A → a
where a is a terminal, and goto(Ii, a) = Ij : set
action[i, a] = shift j.
• if DFA state Ii contains an item A → , where
A S′: for each b FOLLOW(A), set action[i,
b] = reduce A → .
• if state Ii contains the item S′ → S : set
action[i, $] = accept.
• goto table entries:
• for each nonterminal A, if goto(Ii, A) = Ij, then
goto[i, A] = j.
• any entry not defined byAnalysis
CSc 453: Syntax these steps is an error61
state. Compiler construction
SLR(1) Shortcomings
• SLR(1) parsing uses reduce actions too
liberally. Because of this it fails on many
reasonable grammars.
• Example (simple pointer assignments):
S→R | L=R
L → *R | id
R→L
The SLR parse table has a state { S → L =
R, R → L }, and FOLLOW(L) = { =, $ }.
shift-reduce conflict.
62
Compiler construction
Improving LR Parsing
• SLR(1) parsing weaknesses can be
addressed by incorporating lookahead into
the LR items in parser states.
• The lookahead makes it possible to
remove some “spurious” reduce actions
in the parse table.
• The LALR(1) parsers produced by bison
and yacc incorporate such lookahead
items.
Compiler construction
Error Handling
Possible reactions to lexical and syntax errors:
• ignore the error. Unacceptable!
• crash, or quit, on first error. Unacceptable!
• continue to process the input. No code
generation.
• attempt to repair the error: transform an
erroneous program into a similar but legal
input.
• attempt to correct the error: try to guess
what the programmer meant. Not
worthwhile.
64
Compiler construction
Error Reporting
• Error messages should refer to the source
program.
• prefer “line 11: X redefined” to “conflict
in hash bucket 53”
• Error messages should, as far as possible,
indicate the location and nature of the error.
• avoid “syntax error” or “illegal
character”
• Error messages should be specific.
• prefer “x not declared in function foo”
to “missing declaration”
• They should not be redundant.
65
Compiler construction
Error Recovery
• Lexical errors: pass the illegal character to
the parser and let it deal with the error.
• Syntax errors: “panic mode error
recovery”
• Essential idea: skip part of the input
and pretend as though we saw
something legal, then hope to be able to
continue.
• Pop the stack until we find a state s such
that goto[s,A] is defined for some
nonterminal A.
• discard input tokens until we find some
token a that can legitimately follow A 66
Compiler construction