0% found this document useful (0 votes)
29 views66 pages

Lecture 4 - Syntax Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views66 pages

Lecture 4 - Syntax Analysis

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 66

Syntax Analysis

Compiler construction
Introduction to syntax analysis
• The parser (syntax analyzer) receives the source
code in the form of tokens from the lexical
analyzer and performs syntax analysis, which
create a tree-like intermediate representation
that depicts the grammatical structure of the
token stream.
• Syntax analysis is also called parsing.
• It involves analyzing the structure of source code
to ensure it adheres to the grammatical rules of
the language
• A typical representation is an abstract syntax tree
where;
• Each interior node represents an operation
• The children of the node represent the
arguments of the operation
Compiler construction
Syntactic Analysis
• Input: sequence of tokens from scanner
• Output: abstract syntax tree
• Actually,
• parser first builds a parse tree
• AST is then built by translating the parse
tree
• parse tree rarely built explicitly; only
determined by, say, how parser pushes
stuff to stack

Compiler construction
Introduction to syntax analysis
Parse Tree/Abstract Syntax Tree
(AST)

• Parse Tree: A tree representation


of the syntactic structure of the
input based on the grammar rules.

• Abstract Syntax Tree (AST): A


simplified version of the parse tree
that abstracts away certain
syntactic details, focusing instead
on the logical structure of
Compiler the
construction
Example
• Source Code
• 4*(2+3)
• Parser input
• NUM(4) TIMES LPAR NUM(2) PLUS
NUM(3) RPAR
• Parser output (AST):
*

+
NUM(4)
NUM(2) NUM(3)

Compiler construction
Another example

• Source Code
• if (x == y) { a=1; }
• Parser input
• IF LPAR ID EQ ID RPAR LBR ID AS INT
SEMI RBR
• Parser output (AST):
IF-THEN

== =

ID ID ID IN
T

Compiler construction
Introduction to syntax analysis
Example:
For a simple expression like a + b * c, the syntax
analysis might involve:
• Tokenizing: a, +, b, *, c
• Using grammar rules to recognize:
An expression consists of terms and
operators.
The multiplication operator has higher
precedence than the addition operator.
Constructing an AST:

Compiler construction
Syntax Analysis Analogy
Syntax analysis for natural languages
• Identify the function of each word
• Recognize if a sentence is grammatically
correct
• Example: I gave Ali the card.
Syntax analysis for natural languages
• Identify the function of each word
• Recognize if a sentence is grammatically
correct

Compiler construction
Introduction to syntax analysis
.

Compiler construction
Introduction to syntax analysis

Compiler construction
Position of Syntax Analyzer

Compiler construction
Overview

Main Task: Take a token sequence from the


scanner and verify that it is a syntactically
correct program.
Secondary Tasks:
 Process declarations and set up symbol table
information accordingly, in preparation for semantic
analysis.
 Construct a syntax tree in preparation for 12
intermediate code generation. Compiler construction
Context-free Grammars
• A context-free grammar for a language
specifies the syntactic structure of
programs in that language.
• Components of a grammar:
• a finite set of tokens (obtained from the
scanner);
• a set of variables representing “related”
sets of strings, e.g., declarations,
statements, expressions.
• a set of rules that show the structure of
these strings.
• an indication of the “top-level” set of 13

Compiler construction
Context-free Grammars: Definition
Formally, a context-free grammar G is a 4-
tuple G = (V, T, P, S), where:
• V is a finite set of variables (or
nonterminals). These describe sets of
“related” strings.
• T is a finite set of terminals (i.e., tokens).
• P is a finite set of productions, each of the
form
•A  
• where A  V is a variable, and   (V 
T)* is a sequence of terminals and
nonterminals. 14

Compiler construction
Context-free Grammars: An Example
A grammar for palindromic bit-strings:
G = (V, T, P, S), where:
• V = { S, B }
• T = {0, 1}
• P = { S  B,
S  ,
S  0 S 0,
S  1 S 1,
B  0,
B1
}

15

Compiler construction
Context-free Grammars: Terminology
• Derivation: Suppose that
•  and  are strings of grammar symbols,
and
• A   is a production.
• Then, A   (“A derives ”).

•  : “derives in one step”


• * : “derives in 0 or more steps”
•  *  (0 steps)
•  *  if    and  *  ( 1 steps)
16

Compiler construction
Derivations: Example
• Grammar for palindromes: G = (V, T,
P, S),
• V = {S},
• T = {0, 1},
•P={S0S0 | 1S1 | 0 | 1 | 
}.
• A derivation of the string 10101:
•S
•1S1 (using S  1S1)
•  1 0S0 1 (using S  0S0) 17

Compiler construction
Leftmost and Rightmost Derivations
• A leftmost derivation is one where, at each step,
the leftmost nonterminal is replaced.
• (analogous for rightmost derivation)
• Example: a grammar for arithmetic expressions:
• E  E + E | E * E | id
• Leftmost derivation:
• E  E * E  E + E * E  id + E * E  id +
id * E  id + id * id
• Rightmost derivation:
• E  E + E  E + E * E  E + E * id  E +
id * id  id + id * id

CSc 453: Syntax Analysis 18

Compiler construction
Context-free Grammars: Terminology
• The language of a grammar G =
(V,T,P,S) is
• L(G) = { w | w  T* and S * w }.
• The language of a grammar
contains only strings of terminal
symbols.

• Two grammars G1 and G2 are


equivalent if
• L(G1) = L(G2). 19

Compiler construction
Parse Trees
• A parse tree is a tree representation of a
derivation.
• Constructing a parse tree:
• The root is the start symbol S of the grammar.
• Given a parse tree for  X , if the next
derivation step is
•  X    1…n  then the parse tree is
. obtained as:

20
Compiler construction
Approaches to Parsing
• Top-down parsing:
• attempts to figure out the derivation for
the input string, starting from the start
symbol.

• Bottom-up parsing:
• starting with the input string, attempts to
“derive in reverse” and end up with the
start symbol;
• forms the basis for parsers obtained from
parser-generator tools such as yacc,
bison. 21

Compiler construction
Top-down Parsing
• “top-down:” starting with the start symbol
of the grammar, try to derive the input
string.

• Parsing process: use the current state of the


parser, and the next input token, to guide
the derivation process.

• Implementation: use a finite state


automaton augmented with a runtime stack
(“pushdown automaton”).
22

Compiler construction
Bottom-up Parsing

• “bottom-up:” work backwards from the


input string to obtain a derivation for it.

• Parsing process: use the parser state to


keep track of:
• what has been seen so far, and
• given this, what the rest of the input
might look like.

• Implementation: use a finite state


automaton augmented with a runtime stack 23

(“pushdown automaton”). Compiler construction


Parsing: Top-down vs. Bottom-up
 .

24

Compiler construction
Parsing Problems: Ambiguity

• A grammar G is ambiguous if some string in L(G)


has more than one parse tree.
• Equivalently: if some string in L(G) has more than
one leftmost (rightmost) derivation.
• Example: The grammar
• E  E + E | E * E | id
• is ambiguous, since “id+id*id” has multiple
parses:
.

25
Compiler construction
Dealing with Ambiguity
1. Transform the grammar to an equivalent
unambiguous grammar.
2. Use disambiguating rules along with the
ambiguous grammar to specify which
parse to use.
Comment: It is not possible to determine
algorithmically whether:
• Two given CFGs are equivalent;
• A given CFG is ambiguous.

26

Compiler construction
Removing Ambiguity: Operators
• Basic idea: use additional nonterminals to
enforce associativity and precedence:
• Use one nonterminal for each precedence
level:
• E  E * E | E + E | id
• needs 2 nonterminals (2 levels of
precedence).
• Modify productions so that “lower precedence”
nonterminal is in direction of precedence:
•EE+E  E  E + T (+ is
left-associative)
CSc 453: Syntax Analysis 27

Compiler construction
Example
• Original grammar:
•EE*E | E/E | E+E | E–E |
( E ) | id
• precedence levels: { *, / } > { +, – }
• associativity: *, /, +, – are all left-
associative.

• Transformed grammar:
•EE+T | E–T | T (precedence
level for: +, -)
•T T*F | T/ F | F (precedence 28

level for: *, /) Compiler construction


Bottom-up parsing: Approach
1. Preprocess the grammar to compute some
info about it.
(FIRST and FOLLOW sets)
2. Use this info to construct a pushdown
automaton for the grammar:
• the automaton uses a table (“parsing
table”) to guide its actions;
• constructing a parser amounts to
constructing this table.

29

Compiler construction
FIRST Sets
Defn: For any string of grammar symbols ,
FIRST() = { a | a is a terminal and  *
a}.
if  *  then  is also in FIRST().
 Example: E  T E′
E′  + T E′ | 
T  F T′
T′  * F T′ | 
F  ( E ) | id
FIRST(E) = FIRST(T) = FIRST(F) = { (, id }
FIRST(E′) = { +,  }
FIRST(T′) = { *,  }
CSc 453: Syntax Analysis 30

Compiler construction
Computing FIRST Sets
Given a sequence of grammar symbols A:
 if A is a terminal or A =  then FIRST(A)
= {A}.
 if A is a nonterminal with productions A
 1 | … | n then:
• FIRST(A) = FIRST(1)    FIRST(n).
 if A is a sequence of symbols Y1 … Yk
then:
• for i = 1 to k do:
– add each a  FIRST(Yi), such that a 
, to FIRST(A).
– if   FIRST(Yi) then break;
• if  is in each of FIRST(Y1), …, FIRST(Yk) 31

Compiler construction
Computing FIRST sets: cont’d
• For each nonterminal A in the grammar,
initialize FIRST(A) = .
• repeat {
• for each nonterminal A in the grammar {
• compute FIRST(A); /* as described
previously */
•}
• } until there is no change to any FIRST set.

32

Compiler construction
Example (FIRST Sets)
X  YZ | a
Y b | 
Z c | 

• X  a, so add a to FIRST(X).
• X  YZ, b  FIRST(Y), so add b to FIRST(X).
• Y  , i.e.   FIRST(Y), so add non- symbols from
FIRST(Z) to FIRST(X).
• ► add c to FIRST(X).
•   FIRST(Y) and   FIRST(Z), so add  to FIRST(X).
• Final: FIRST(X) = { a, b, c,  }.

33

Compiler construction
FOLLOW Sets
Definition: Given a grammar G = (V, T, P, S),
for any nonterminal A  V:
• FOLLOW(A) = { a  T | S * Aa for
some , }.
i.e., FOLLOW(A) contains those terminals
that can appear after A in something
derivable from the start symbol S.
• if S * A then $ is also in FOLLOW(A).
($  EOF, “end of input.”)
Example:
E  E + E | id
FOLLOW(E) = { +, $ }.
34

Compiler construction
Computing FOLLOW Sets
Given a grammar G = (V, T, P, S):
1. add $ to FOLLOW(S);
2. repeat {
• for each production A  B in P, add
every non- symbol in FIRST() to
FOLLOW(B).
• for each production A  B in P, where
  FIRST(), add everything in
FOLLOW(A) to FOLLOW(B).
• for each production A  B in P, add
everything in FOLLOW(A) to FOLLOW(B).
} until no change to any FOLLOW set.
Compiler construction
Example (FOLLOW Sets)
X  YZ | a
Y b | 
Z c | 
• X is start symbol: add $ to FOLLOW(X);
• X  YZ, so add everything in FOLLOW(X) to FOLLOW(Z).
• ►add $ to FOLLOW(Z).
• X  YZ, so add every non- symbol in FIRST(Z) to
FOLLOW(Y).
• ►add c to FOLLOW(Y).
• X  YZ and   FIRST(Z), so add everything in FOLLOW(X)
to FOLLOW(Y).
• ►add $ to FOLLOW(Y).
36

Compiler construction
Shift-reduce Parsing
• An instance of bottom-up parsing
• Basic idea: repeat
1. in the string being processed, find a
substring α such that A → α is a
production;
2. replace the substring α by A (i.e., reverse
a derivation step).
until we get the start symbol.
• Technical issues: Figuring out
1. which substring to replace; and
2. which production to reduce with. 37

Compiler construction
Shift-reduce Parsing: Example

Grammar: S → aABe
A → Abc | b
B→d

Input: abbcde (using A → b)


 aAbcde (using A → Abc)
 aAde (using B → d)
 aABe (using S → aABe)
 S
38

Compiler construction
Shift-Reduce Parsing: cont’d
• Need to choose reductions carefully:
• abbcde  aAbcde  aAbcBe  …
• doesn’t work.
• A handle of a string s is a substring 
s.t.:
•  matches the RHS of a rule A → ;
and
• replacing  by A (the LHS of the
rule) represents a step in the
reverse of a rightmost derivation of 39

s. Compiler construction
Shift-reduce Parsing: Implementation
• Data Structures:
• a stack, its bottom marked by ‘$’.
Initially empty.
• the input string, its right end marked by
‘$’. Initially w.
• Actions:
• repeat
• Shift some ( 0) symbols from the
input string onto the stack, until a
handle  appears on top of the stack.
• Reduce  to the LHS of the appropriate
production.
• until ready to accept. 40

• Acceptance: when input Compiler


is empty and
construction
Example
Stack (→) Input Action
$ abbcde$ shift
$a bbcde$ shift
$ab bcde$ reduce: A → b Grammar :
$aA bcde$ shift S → aABe
$aAb cde$ shift A → Abc | b
$aAbc de$ reduce: A → Abc B→d
$aA de$ shift
$aAd e$ reduce: B → d
$aAB e$ shift
$aABe $ reduce: S → aABe
$S $ accept
41

Compiler construction
Conflicts
• Can’t decide whether to shift or to reduce ―
both seem OK (“shift-reduce conflict”).
• Example: S → if E then S | if E then
S else S | …

• Can’t decide which production to reduce


with ― several may fit (“reduce-reduce
conflict”).
• Example: Stmt → id ( args ) | Expr
• Expr → id ( args )
42

Compiler construction
LR Parsing
• A kind of shift-reduce parsing. An LR(k)
parser:
• scans the input L-to-R;
• produces a Rightmost derivation (in
reverse); and
• uses k tokens of lookahead.
• Advantages:
• very general and flexible, and handles a
wide class of grammars;
• efficiently implementable.
• Disadvantages:
• difficult to implement by hand (use tools 43

Compiler construction
LR Parsing: Schematic

• The driver program is the same for all LR


parsers (SLR(1), LALR(1), LR(1), …). Only
the parse table changes.
• Different LR parsing algorithms involve
different tradeoffs between parsing power,
parse table size. 44
Compiler construction
LR Parsing: the parser stack
• The parser stack holds strings of the form
• s0 X1s1 X2s2 … Xmsm (sm is on top)
• where si are parser states and Xi are grammar
symbols.
• (Note: the Xi and si always come in pairs, with
the state component si on top.)

• A parser configuration is a pair


• stack contents, unexpended input

45

Compiler construction
LR Parsing: Roadmap
• LR parsing algorithm:
• parse table structure
• parsing actions

• Parse table construction:


• viable prefix automaton
• parse table construction from this
automaton
• improving parsing power: different LR
parsing algorithms
46

Compiler construction
LR Parse Tables
• The parse table has two parts: the action
function and the goto function.

• At each point, the parser’s next move is


given by action[sm, ai], where:
• sm is the state on top of the parser stack,
and
• ai the next input token.

• The goto function is used only during


reduce moves.
47

Compiler construction
LR Parser Actions: shift
• Suppose:
• the parser configuration is s0 X1s1 … Xmsm,
ai … an, and
• action[sm, ai] = ‘shift sn’.
• Effects of shift move:
• push the next input symbol ai; and
• push the state sn

• New configuration: s0 X1s1 … Xmsm ai sn, ai+1 … an


48

Compiler construction
LR Parser Actions: reduce
• Suppose:
• the parser configuration is s0 X1s1 … Xmsm, ai …
an, and
• action[sm, ai] = ‘reduce A → ’.
• Effects of reduce move:
• pop n states and n grammar symbols off the
stack (2n symbols total), where n = ||.
• suppose the (newly uncovered) state on top of
the stack is t, and goto[t, A] = u.
• push A, then u.
• New configuration: s0 X1s1 … Xm-nsm-n A u, ai … an

49

Compiler construction
LR Parsing Algorithm
1. set ip to the start of the input string w$.
2. while TRUE do:
1. let s = state on top of parser stack, a = input
symbol pointed at by ip.
2. if action[s,a] == ‘shift t’ then: (i) push the input
symbol a on the stack, then the state t; (ii)
advance ip.
3. if action[s,a] == ‘reduce A → ’ then: (i) pop 2*|
| symbols off the stack; (ii) suppose t is the
state that now gets uncovered on the stack; (iii)
push the LHS grammar symbol A and the state u
= goto[A, t].
4. if action[s,a] == ‘accept’ then accept;
5. else signal a syntax error.
50

Compiler construction
LR parsing: Viable Prefixes
• Goal: to be able to identify handles, and so
produce a rightmost derivation in reverse.
• Given a configuration s0 X1s1 … Xmsm, ai … an:
• X1 X2 … Xm ai … an is obtainable on a rightmost
derivation.
• X1 X2 … Xm is called a viable prefix.
• The set of viable prefixes of a grammar are
recognizable using a finite automaton.
• This automaton is used to recognize handles.

51

Compiler construction
Viable Prefix Automata
• An LR(0) item of a grammar G is a
production of G with a dot “” somewhere in
the RHS.
• Example: The rule A → a A b gives these
LR(0) items:
•A→  aAb
•A→ aAb
•A→ aAb
•A→ aAb
• Intuition: ‘A →  ’ denotes that:
• we’ve seen something derivable from ;
and
52
• it would be legal to see something
Compiler construction
Overall Approach
Given a grammar G with start symbol S:
• Construct the augmented grammar by
adding a new start symbol S′ and a new
production S′ → S.

• Construct a finite state automaton whose


start state is labeled by the LR(0) item S′
→  S.

• Use this automaton to construct the


parsing table.
53

Compiler construction
Viable Prefix NFA for LR(0) items
• Each state is labeled by an LR(0) item. The initial
state is labeled S′ →  S.

• Transitions:
1.

where X is a terminal
or nonterminal.

2.
where X is a nonterminal, and X →  is a
production.

54

Compiler construction
Viable Prefix NFA:
Example
Grammar :
S→0S1
S→

55
Compiler construction
Viable Prefix NFA  DFA
• Given a set of LR(0) items I, the set closure(I) is
constructed as follows:
• repeat
• add every item in I to closure(I);
• if A →   B  closure(I) and B is a
nonterminal, then for each production B → ,
add the item B →   to closure(I).
• until no new items can be added to
closure(I).

• Intuition:
• A →   B  closure(I) means something
derivable from B is legal at this point. This 56
means that something derivable from
Compiler B (and
construction
Viable Prefix NFA  DFA (cont’d)
• Given a set of LR(0) items I, the set goto(I,X) is
defined as
• goto(I, X) = closure({ A →  X   | A →   X
  I })

• Intuition:
• if A →   X   I then (a) we’ve seen something
derivable from ; and (b) something derivable
from X would be legal at this point.
• Suppose we now see something derivable from
X.
• The parser should “go to” a state where (a)
we’ve seen something derivable from X; and (b)
something derivable from  would be legal.
57

Compiler construction
Example

 Let I0 = {S′ → S}.


 I1 = closure(I0) = { S′ → S, /*
from I0 */
S →  0 S 1, S →  }
 goto (I1, 0) = closure( { S → 0  S 1 } )
= {S → 0  S 1, S →  0 S 1, S →  }
58
Compiler construction
Viable Prefix DFA for LR(0) Items
1. Given a grammar G with start symbol S, construct
the augmented grammar with new start symbol S′
and new production S′ → S.
2. C = { closure({ S′ → S }) }; // C = a set of sets of
items = set of parser states
3. repeat {
for each set of items I  C {
for each grammar symbol X {
if ( goto(I,X)   && goto(I,X)  C ) {
// new state
add goto(I,X) to C;
}
}
}
} until no change to C;
59

Compiler construction
SLR(1) Parse Table Construction I
Given a grammar G with start symbol S:
• Construct the augmented grammar G′
with start symbol S′.
• Construct the set of states {I0, I1, …, In}
for the Viable Prefix DFA for the
augmented grammar G′.
• Each DFA state Ii corresponds to a parser
state si.
• The initial parser state s0 coresponds to
the DFA state I0 obtained from the item S′
→  S. 60

Compiler construction
SLR(1) Parse Table Construction II
• Parsing action for parser state si:
• action table entries:
• if DFA state Ii contains an item A →   a 
where a is a terminal, and goto(Ii, a) = Ij : set
action[i, a] = shift j.
• if DFA state Ii contains an item A →  , where
A  S′: for each b  FOLLOW(A), set action[i,
b] = reduce A → .
• if state Ii contains the item S′ → S : set
action[i, $] = accept.
• goto table entries:
• for each nonterminal A, if goto(Ii, A) = Ij, then
goto[i, A] = j.
• any entry not defined byAnalysis
CSc 453: Syntax these steps is an error61
state. Compiler construction
SLR(1) Shortcomings
• SLR(1) parsing uses reduce actions too
liberally. Because of this it fails on many
reasonable grammars.
• Example (simple pointer assignments):
S→R | L=R
L → *R | id
R→L
The SLR parse table has a state { S → L  =
R, R → L  }, and FOLLOW(L) = { =, $ }.
 shift-reduce conflict.
62

Compiler construction
Improving LR Parsing
• SLR(1) parsing weaknesses can be
addressed by incorporating lookahead into
the LR items in parser states.
• The lookahead makes it possible to
remove some “spurious” reduce actions
in the parse table.
• The LALR(1) parsers produced by bison
and yacc incorporate such lookahead
items.

• This improves parsing power, but at the


cost of larger parse tables. 63

Compiler construction
Error Handling
Possible reactions to lexical and syntax errors:
• ignore the error. Unacceptable!
• crash, or quit, on first error. Unacceptable!
• continue to process the input. No code
generation.
• attempt to repair the error: transform an
erroneous program into a similar but legal
input.
• attempt to correct the error: try to guess
what the programmer meant. Not
worthwhile.
64

Compiler construction
Error Reporting
• Error messages should refer to the source
program.
• prefer “line 11: X redefined” to “conflict
in hash bucket 53”
• Error messages should, as far as possible,
indicate the location and nature of the error.
• avoid “syntax error” or “illegal
character”
• Error messages should be specific.
• prefer “x not declared in function foo”
to “missing declaration”
• They should not be redundant.
65

Compiler construction
Error Recovery
• Lexical errors: pass the illegal character to
the parser and let it deal with the error.
• Syntax errors: “panic mode error
recovery”
• Essential idea: skip part of the input
and pretend as though we saw
something legal, then hope to be able to
continue.
• Pop the stack until we find a state s such
that goto[s,A] is defined for some
nonterminal A.
• discard input tokens until we find some
token a that can legitimately follow A 66

Compiler construction

You might also like