Lec 4
Lec 4
1. Introduction 2
2. Lexical analysis 31
3. LL parsing 58
1 2
Things to do Compilers
What is a compiler?
find http://www.cs.purdue.edu/homes/palsberg/cs352/F00/index.html
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
Many of the same issues arise in interpreters
learning algorithms
algorithms graph algorithms
dynamic programming
theory DFAs for scanning
5 6
interesting problems
#
"
real results
'
7 8
Experience Abstract view
compiler
*
code code
*
*
Each of these shapes your feelings about the correct contents of this course Big step up from assembler — higher level notations
)
9 10
IR FORTRAN front
source front back machine
+
code end
8 7
:
back <
target1
C++ front end
7
code end
8 7
target2
end
7
:
9
CLU front
code end
8
7
back <
target3
Implications: ;
9
end
7
Smalltalk front
code end
8 7
m compilers with n = @ ?
m components?
1
simplify retargeting C
tokens
F
tokens
F
source source
+
code code
*
errors errors
Responsibilities: Scanner:
becomes
report errors
S
produce IR _
c d e f g h i j k
13 14
source
+
scanner parser IR
+
code
*
y z { {
| } ~ ~
sheep noise
errors
This grammar defines the set of noises that a sheep makes under normal
Parser: circumstances
s
construct IR(s)
S is the star t symbol
goal
3 term
Í
¡ ¢ £ ¤ ¥ ¦ Î Ï
5 2 expr op term
§ ¨ © Ð Ñ Ò Ó Ô Õ
6 ª op « ::= ¬
5 expr Ö × Ø op Ù Ú
7 ®
7 expr Û Ü Ý Þ
2 expr op term
ß à á â ã ä å æ
4 expr é ê op ë ì í î
6 expr ï ð ñ ò ó ô
3 term
ö ÷ ø ù ú
5
S= goal
û ü ý þ ÿ
w ¸
T = ¹ º » ¼ ½ ¾ , ¿ À , , Á Â
w Ä Ê Ë
Ã Å Æ Ç È É
build up a parse
17 18
goal -
expr
+ <id:y>
expr op term
<id:x> <num:2>
term + <num:2> Abstract syntax trees (ASTs) are often used as an IR between front end
and back end
<id:x>
errors errors
Instruction selection:
Responsibilities
– dynamic programming
Automation has been less successful here
21 22
IR IR
IR instruction
machine +
errors errors
limited resources
goal is to reduce runtime
Pass 2
errors
$
Environ-
ments
Modern optimizers are usually built as a set of passes Pass 1 Pass 3 Pass 4
Source Program
Abstract Syntax
Tables
Reductions
Translate
IR Trees
IR Trees
Tokens
Assem
Parsing Semantic Canon- Instruction
Frame
Frame
&
'
code motion
Assembly Language
Machine Language
Interference Graph
reduction of operator strength
Flow Graph
(
Control Data
Assem
Register Code
Flow Flow Allocation Emission Assembler Linker
)
25 26
CompoundStm
Parsing Build a piece of abstract syntax tree for each phrase
-
Semantic Determine what each phrase means, relate uses of variables to their Stm > ? ExpList @ A B C D E PrintStm
Exp
/
F G H
phrase
0
Exp NumExp
N
J K L M
Frame Place variables, function parameters, etc., into activation records (stack
Layout frames) in a machine-dependent way Exp O Exp Binop Exp OpExp
Translate Produce intermediate representation trees (IR trees), a notation that is Exp P Stm ExpQ R S EseqExp
ExpList Exp ExpList PairExpList
V
T U
Canonicalize Hoist side effects out of expressions, and clean up conditional branches, ExpList W Exp LastExpList
for convenience of later phases
Binop
X Y
Plus
V
Instruction Group IR-tree nodes into clumps that correspond to actions of target-
Binop
Z [
Minus
\
Control Flow Analyse sequence of instructions into control flow graph showing all
4
] ^
Analysis possible flows of control program might follow when it runs Binop
0
3
_ ` Div
Data Flow Gather information about flow of data through variables of program; e.g.,
Analysis liveness analysis calculates places where each variable holds a still-
5
a
e.g., b
d e
Register Choose registers for variables and temporary values; variables not si-
Allocation multaneously live can share same register prints:
27 28
Tree representation Java classes for trees
¤ ¥ ¦ § ¨ ¤ © § ª « ¬ ® ¯ ° ± ² ´ µ ¶ · · ¸ ¹ º » ¼ ½ ¾ ¿ À ¾ Á Â Ã Ä Å Æ Ç
È É Ê Ë Ì Í Î
³ ´ µ ¶ ¶ · ¸ ¹ º ¸ » ¼ ½ ¾ ¿ ¹ À Á Â À Ã Ä Å Æ Ç È
d
: 5 3; : 1 10
;
É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ
Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ü à á
Ö × Ø Ù × Ú Û Ü Ý Þ Ø ß Ý Þ Ø à á â ã ä å æ ç è
CompoundStm é ê ë ì í î ê í ï ð ñ ò ó ô ð ó õ ö
ã ä å æ æ ç è é ê è ë ì í ë î ï ð ñ ò ó ô
÷ ø ù ú û û ü ý ý þ ÿ
õ ö ÷ ø ù ú û ü ý þ ÿ
AssignStm CompoundStm
! " # $ % & ' ( ) *
+ , - . , / - . , 0 1 2 3 4 5 6 7 8 9 : ;
/ 0 1 2 0 3 4 5 6 7 4 8 9 Q
LastExpList
£
R S T U U V W X Y V Z [ \ ] ^ \ _ ` a b c d e
IdExp f g h i j k l m n o p q r s
5 3
; < = > > ? @ A B C D C E F G H F I J K L M N O
P Q R S T U V W X Y Z [ t u v w t x y z { | } ~
b
PrintStm OpExp \ ] ^ _ ` a ` b c d e f g ^ h ` i j
k l m n o p l q r
PairExpList
NumExp Times IdExp
s
¡ ¢ £ ¤
¥ ¦ § ¨ ¨ © ª « ¬ ® ¯ ° « ± ² ³ ´ µ ³ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À
IdExp LastExpList 10 a t u v w x t y w z { | } } ~
Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô
a OpExp
Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â Ý ã ä å ß à á æ ç è é ê ë ì í î ï ð
ñ ò ó ô õ ö ò ÷ ø ù ú û ü ø ý þ
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ® ¯ ° ® ± ²
a 1 !
<
29 30
Scanner
tokens
F
source
+
scanner parser IR
+
code
*
errors
S T U V W X Y Z [
_ ` a
b
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
c
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
< <
31 32
Specifying patterns Specifying patterns
ws ::= ws
d
e f g h i
j k
ws
l m n o m
p q r
s t u v t
numbers
w x y z {
33 34
Operation Definition
union of L and M L M s s L or s M Notations used to describe a regular language (or a regular set) include
written L M
written LM
∞ Li
¢
Kleene closure of L L
¡
i 0
¢ £
written L
¥
∞ Li Σ, then a is a RE denoting a
¢
positive closure of L L 2. if a
¬
¯ ° ¯ ¯ ²
i 1
¢ ©
¦ § ¨
±
written L ª
³ ´ µ
¶
r is a RE denoting L r
¸ ¹ º
r s is a RE denoting L r Ls
À Ã
¿ Ä
¼ ½ ¾ Á Â
r s is a RE denoting L r L s
Æ Ç
È
É Ê Ë
Ì
r Î Ï is a RE denoting L r Ð Ñ Ò
precedence.
< <
35 36
Examples Algebraic properties of REs
a b c Õ
Ö ×
Ø Ù Ú Ú Ú Û
Ü
z A B C Ý
Þ ß
à á
â ã ã ã ä
å
Z
æ
rs sr )
* +
is commutative ,
ç
r st - rs t .
/
0 1 2 3
4 5
is associative 6
digit
è é
ò ó ô õ
0 1 2 3 4 5 6 7 8 9
ê ë í î ð ñ ù ú
ö ÷
ì ï ø
rs t r st 8 9 :
;
concatenation is associative
r st rs rt concatenation distributes over
ç þ ÿ
< > ? @ A
û ü ý
s t r sr tr C
D E
F
ε
ê
0
1 2 3
ù
9 digit
ç
rε r H
ç ç
r I J K L M N O
r r P P Q R is idempotent S
real
b
integer decimal
ç ç
digit
«
! " # "
37 38
Examples Recognizers
ab W
1. a b denotes a b
Ö Ö \
letter
c
digit
2. a b a b denotes aa ab ba bb
Ö _ ` Ö b Ö f Ö g
¯ ^ ¯ a ¯ d ¯ e
i.e., a b a b
Ö j k Ö m n Ö q Ö
aa ab ba bb letter other
¯ i ¯ l ¯ o ¯ p
0 1 2
h
digit accept
other
3. a denotes ε a aa aaa
s
¯ r ¯ u ¯ v ¯ w x x x y
3
t
error
Ö | } Ö
¯ { ¯
identifier
i.e., a b
Ö Ö
a b ~
¯ ¯
letter
Ö
¯
a b c Ø
Ü
z A B C ¡ ¢ £ ¤ ¤ ¤ ¥ Z ¦
ç § ¨
digit
ò ® ô ¯
ê © ù ²
0 1 2 3 4 5 6 7 8 9
ö °
ª « ¬ ±
¯ ¯ ¯ ¯ ¯ ¯ ¯
ç ¶ ·
39 40
Code for the recognizer Tables for the recognizer
¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç
È É Ê É Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Ù Û Ü Ý Þ
ê ë ì í î ï ð ñ ò ó ô õ õ ö ÷ ø ù ú û ü ý þ ÿ
! " # $ %
a z A Z 0 9 other
Þ ê ù
å
¯
Ü
û ü ý þ ÿ
Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s
t u v w x y z { | } ~
¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « § ¬ § © ®
¯ ° ± ² ³ ´ µ ¶ · ¸
¹ º » ¼ ½ ¾
class 0 1 2 3
¿ À Á Â Ã Ä Å Æ Ç È È É È Ê Ë
letter
1 1 — —
digit 3 1 — —
Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × × Ø × Ù
Ú Û Ü Ý Þ ß à á â ã
ä å æ ç è é
other 3 2 — —
ê
ì í î ï ì ð ñ ò ó ô õ ö ÷ ø ù ú
Scanner generators automatically construct code from regular expression- Can we place a restriction on the form of a grammar to ensure that it de-
like descriptions scribes a regular language?
construct a dfa
Provable fact:
use state minimization techniques For any RE r, there is a grammar g such that L r Lg.
The grammars that generate regular sets are called regular grammars
"
1. A #
¯
aA
2. A
Þ $
emits C code for scanner where A is any non-terminal and a is any terminal symbol ¯
43 44
More regular languages More regular expressions
Example: the set of strings containing an even number of zeros and an What about the RE a b abb ?
Ö @ A
¯ ? ¯
>
ab D
1
s0 &
s1 B
s0
C
a B
s1
b B
s2
b B
s3
1
0 0 0 0
State s0 has multiple transitions on a!
¤
¯
s2 '
s3 (
a ¯
Ö
b
1
s0 ¤ H
s0 s1
¤ I
J K
s0 ¤ L
s1 – M
s2 N
!
s2 O
– P
s3 Q R
The RE is 00 11 01 10 00 11 01 10 00 11
* + * / * 2 * 6 * 9
) , - . . 0 1 3 4 5 7 8 : ; < =
45 46
1. a set of states S
S
T U V
s0 ¤ W X X X W
sn Y Z
3. a transition function move mapping state-symbol pairs to sets of states 2. Any NFA can be converted into a DFA, by simulating sets of simulta-
[
¤
neous states:
2. for each state s and input symbol a, there is at most one edge labelled
¯
a leaving s.
¯
A DFA accepts x iff. there exists a unique path through the transition graph
`
from the s0 to an accepting state such that the labels along the edges spell
a
x.
47 48
NFA to DFA using the subset construction: example 1 Constructing a DFA from a regular expression
g
ab f
DFA
minimized
a b b
d
s0 d
s1 d
s2 d
s3
RE DFA
NFA
h
a ¯
b ε moves
o
s0
¤ i j
s0 s1
¤ k
l m
s0
¤ n
x
s0 s1
¤ p
q r
s0 s1
¤ s
t u
s0 s2 ¤ v
O w
s0 s2
¤ y
O z {
s0 s1
¤ |
} ~
s0 s3 ¤
Q
s0 s3
¤
Q
s0 s1
¤
s0
¤
b
a DFA minimized DFA
¡
construct Rikj ¢
¢
¢
a
a
¢
49 50
ε
µ
¦ Ö · ¸
a b abb
¶ ¯
N ε
£ ¤
a a
2 3
N a
£ §
¯ ¨
ε ε
1 6
N(A)
A
®
ε ε
¦ ¦
ε ε
4 5
¯
ab ¹
b
ε
ε ε
¦ ¦
N(B) B
®
N AB
£ ª
« ¬
a
2 3
ε ε
N(A) A N(B) B
®
ε ε
0 1 6 7
N AB
£ ¯
ε
¦
ε ε
4 5
ε b
¦
ε ε
¦
N(A) A
®
Ö ¼ ½
N A
±
² ³
¯
ab »
a b b
7 8 9 10
¯
abb
´ ¾
51 52
NFA to DFA: the subset construction NFA to DFA using subset construction: example 2
Input: NFA N
¿
À
ε
Output: A DFA D with states Dstates and transitions Dtrans
Á
Ã
Â
such that L D LN
¿ É
Å Æ Ç È
Ë Ë a
Method: Let s be a state in N and T be a set of states, 2 3
¿ Ì
ε ε
À
ε ε a b b
0 1 6 7 8 9 10
Operation Definition
ε-closure s set of NFA states reachable from NFA state s on ε-transitions alone
Ä Ð
Í Í
Ê Ï Ê
ε ε
ε-closure T set of NFA states reachable from some NFA state s in T on ε-
Ä Ð
Í Í
Ñ Ò
4 5
transitions alone
Ó
Ö × Ö
Ô Õ
Ø Ù
Ã
Ý
Í
Ü
a b
A B C
Þ
à
mark T
Ì
ò ÿ ô
* õ ù
A 01247 D 1245679
ö ù ö
B B D
à à
ó ô ö ÷ ø ú û ü ý þ
ô ò ô
B 1234678 E 1 2 4 5 6 7 10
í ð í ð
à ö ö
U ε-closure move T a
Ì ã
C B C
Ö ä ä
à á â
ò ô
C 124567
ö
ç Ã
Ý
D B E
Dtrans T a U
Ì ê
Ö ë ì
E B C
à
endfor
í
endwhile
í
ð Ù
53 54
reserved words
'
PL/I had no reserved words
L pk qk
" # $
! % &
B C D E F G H I J K L M N O P Q R S Q T U V W U X Y Z X [ \ ] ^ _ `
wcwr w Σ
+ ,
L
. / 0
" ) *
significant blanks
S
a b c d e f g h i j
But, this is a little subtle. One can construct DFAs for: string constants
S
, , v
, w x y u v
z { | } ~
ε 1 01 ε 0
3
* 7 8 9 * ;
4 5 6 :
finite closures
<
sets of pairs of 0’s and 1’s some languages limit identifier lengths
adds states to count length
=
* >
01 10 ? @
FORTRAN 66 6 characters
55 56
How bad can it get?
¡ ¢ £ ¢ ¤ ¥ ¦ § ¨ ¥ © ª
« ¬ ® ¯ ¬ ° ¬ ± ² ³ ´ µ ´ ² ¶ · µ ¸ ¹ ´ º » ¼ ¹ ´ º » ¼
½ ¾ ¿ À Á Â Á Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ä Ê Ë Ì Í Î Ð Å Ñ Ò Ë
Ó Ô Õ Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à Ü á ß
â ã ä ä å æ ç è é ê ë ì í î ï ð í
ñ ò ó ô õ ö ÷ ö
ø ù ú û ü ý þ ý ÿ
Chapter 3: LL Parsing
( ) * + ,
- . / 0 1 2 3 4 5 6 7 7 8 9 :
; < = > ? @ A B
C D E F G
H I J K L M N O P Q R S T U V W X Y X Z [ \ ] ^ _ ` a b c d e f g h f g i j k f l
m n
57 58
| }
Y ~
{
errors
Vt is the set of terminal symbols in the grammar.
|
Parser For our purposes, Vt is the set of tokens returned by the scanner.
|
p
q
s
in L G .
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
v
x
Each production must have a single non-terminal on its left hand side.
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
y
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or !
fee. Request permission to publish from [email protected]. The set V Vt Vn is called the vocabulary of G
|
n
59 60
Notation and terminology Syntax analysis
Grammars are often written in Backus-Naur form (BNF).
abc
Vt |
Example:
ABC Vn Y
1 ×
Ø
goal Ù
:: Ú Û
Ü
expr Ý
U VW V
2 expr :: expr op expr
Ü Ü Ü
Þ ß à á â ã ä å æ
αβγ V 3
¤ ¥ ¦ ¦ ¦ §
ç è é ê
¢ £ ¨
4
ë ì í î
uvw « ¬
Vt |
ï ð
5 op ::
ª ® ® ® ¯ °
ñ ò ó
ô õ ö
6
7
÷ ø ù
¤ ¤
8
³
ú û
µ ¶ · ¸ ¹ º
ü ý þ ÿ ü ÿ
LG w Vt S w ,w L G is called a sentence of G
T Ä Å
" ½ " È
¾ ¿ À Á Â Æ Ç É
Note, L G β V S β Vt
T Ñ Ò
" Ê
Ë Ì Í Î Ï Ð Ó Ô Õ
Ö
61 62
term ::
*
0
9
0
+
op :: , - . / 0 1 2 3
Using our example CFG:
expr :: term op term R
goal expr
4 5 6 7
Ø S T U Ü
expr op expr
W X
Ü Ü
Ü Ü Ü
` a b c d e f g h
k l m n
Ü Ü
Ü Ü
REs are more concise and simpler for tokens than a grammar
| } ~
more efficient scanners can be built from REs (DFAs) than grammars
¡ ¢ £
P ·
63 64
Derivations Rightmost derivation
goal Á Â Ã Ü
expr Ä
expr Ç È op expr É Ê
Ü
expr op id,
Ì Í
Î Ï Ð Ñ Ò Ó
expr id,
Ô Õ
Ü Ü
Ý Þ ß à á â ã ä å
ñ ò ó
Ü ì í î ï ð
è é ê ë
leftmost derivation
ñ ý þ
Ü ì ù ú û ü
ö ÷ ø
rightmost derivation
º
Again, goal Ø
65 66
Precedence Precedence
1 #
Ø
goal $
:: % &
Ü
expr '
2
)
Ü Ü
* + , - . /
3 expr term
0 7
expr op expr 2 3 4 5 6
4 term
ë 8 9
0 :
ï ;
= > B
ô C D
6 term factor
0 E F G
7 factor
I J K
8 N factor O :: P Q R S
9
T U V
<id,x> + <num,2>
Should be ( ! )
Y
67 68
Precedence Precedence
goal a b c Ü
expr d
expr term
0 j
e f
expr
g h i
m n o s
v w x | }
ñ
expr + term
expr num, id,
Ü
ñ
¡ ¢ £ ¤ ¥ ¦ § ¨ ©
term
Ä
term * factor
id, num, id,
ª «
¬ ® ¯ ° ± ² ³ ´ µ
Again, goal ¶
Ø · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â
<id,x> <num,2>
69 70
Ambiguity Ambiguity
If a grammar has more than one derivation for a single sentential form,
May be able to eliminate ambiguities by rearranging the grammar:
then it is ambiguous
Ê
Example:
Ì
unmatched
stmt ::= Í Î Ï Ð
Ü
expr Ñ Ò Ó Ô Õ Ö stmt ×
matched ::=
expr matched
matched
Ü
! "
#
$
#
"
Ø Ù Ú Û
::=
è é ê ë ì í î ï ð ï î
unmatched &
1
'
2
(
3
)
4
Ü
expr * + , - . / stmt 0
expr
unmatched @
ñ ò
ó
E1
ó
E2 O ú û ü ý
T
S1
T
S2 O
This generates the same language as the ambiguous grammar, but applies
ô õ ö ÷ ø ù þ ÿ þ
This ambiguity is purely grammatical. match each C D E C with the closest unmatched F G H I
It is a context-free ambiguity.
z
tokens
[
Example:
L M N O P Q R
grammar parser
Y
X
code IR
Z
Rather than complicate parsing, we will handle this separately. Our goal is a flexible parser generator system
73 74
Top-down parsers A top-down parser starts with the root of the parse tree, labelled with the
f
a
appropriate child for each symbol of α
Bottom-up parsers 2. When a terminal is added to the fringe that doesn’t match the input
b
start in a state valid for legal first tokens 3. Find the next node to be expanded (must have a label in Vn) j
75 76
Simple expression grammar Example
£
¯ ° ± ² ³ ´ µ
2 expr term
º » ¼ ½ ¾ ¿ À Á
¶ · ¸ ¹
1 l
Ø
goal m
:: n o
Ü
expr p
4 term
Â
º
term
Ã Ä Å º Æ Ç È É Ê Ë Ì
7 factor term
0 x
º Ñ Ò Ó Ô Õ Ö ×
Ü Ü
Í Î Ï Ð
r s t u v w
Ø Ù Ú Û Ü
9 term
º Ý Þ ß à á â ã
3 expr term
0 ~
y z { | }
– term
º è é ê ë ì í î
4 term
0
ä å æ ç
– expr
ï ð ñ ò ó ô õ ö
ï
÷ ø
3 expr term
º ü ý þ ÿ
ù ú û
ô
6 term factor
4 term term
0
º º
7 factor term
º
7 factor
Ø
9 term
º ! " # $
8 factor ::
– % term& ' (
º ) * + , - . /
9 – 0 term1 2 3
º 4 5 6 7 8 9 :
7
F
;
G
factor
<
H
=
I J
>
K L
? @
M
A
N
B
O
C
P
D
Q
E
8
Consider the input string ¡ ¢
– S T U V W X Y Z [ \ ] ^
– term
º c d e f g h i
_ ` a b
5
k l m n
r s t u v w x y
term factor
º o p q
z { | } ~
7
r r
factor
factor
8 factor
–
r ¡ ¢ £ ¤ ¥ ¦ §
factor
–
Ø
¨
·
©
¸
ª
¹
«
º
¬
»
¼
®
½ ¾
¯ factor
¿
° ±
À
²
Á
³
Â
´
Ã
µ
Ä
¶
9
– Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô
77 78
Example Left-recursion
Another possible parse for Õ Ö × Ø Ù Top-down parsers cannot handle left-recursion in a grammar
– goal Ú
Ø Û Ü Ý Þ ß à á
1 expr â
Ü
ã ä å æ ç è é
A
0 î ï ð ñ ò ó ô
ê ë ì í
& ' (
õ ö ÷ ø
2 expr term
0
2 expr term
0
2 ! " #
79 80
Eliminating left-recursion Example
Our expression grammar contains two cases of left-recursion
B
Ü Ü
C D E F G H
J K
expr term
0 O
L M N
P Q
term
0 R
U V Z
foo α
[ \
term factor
+ .
foo ::
0 ] ^ _
, -
β
a b
factor c
1 2
Ü Ü
f g h k l
Ü Ü
n o p q r u v
ε
w
β bar
4 7 8
foo :: x y z
term expr
5 6
0 { |
α bar
9
7 : 7 =
bar ::
} ~
0 0
ε
>
? @
factor term
0
81 82
Example Example
A
goal
::
Ü
expr
2 ¡
Ü
expr ¢ :: £ ¤
0
term
¥ ¦ §
expr
Ü
¨
1 Ï
Ø
goal Ð
:: Ñ expr Ò
Ü
0 Ø Ù
Ü Ü
3 term expr
0 « ¬
Õ Ö × Ú Û
© ª ®
Ü Ü
4
¯ °
term
0 ±
Ü Ý Þ ß à á ä å
4 term expr
0 é ê
ï ²
æ ç è ë ì
ε
´ µ ¶ · ¸
ï í
6
º »
factor ¼ ½ ¾
0
term
¿
5
ô î
7 factor
À Á
ð ñ ò ó
0 ÷ ø 0 þ ÿ
8 Ã factor Ä :: Å Æ Ç È
ù ú û ü ý
8 factor term
0
9
É Ê Ë
ε
9
It is 10 factor ::
11
right-recursive
free of ε productions
Î
83 84
How much lookahead is needed? Predictive parsing
Basic idea:
a
We saw that top-down parsers may need to backtrack when they select the
wrong production
For any two productions A α β, we would like a distinct way of
±
Do we need arbitrary lookahead to parse CFGs? choosing the correct production to expand.
in general, yes For some RHS α G, define FIRST α as the set of tokens that appear
use the Earley or Cocke-Younger, Kasami algorithms first in some string derived from α
! "
#
large subclasses of CFGs can be parsed with limited lookahead we would like
most programming language constructs can be expressed in a gram- FIRST & α ' ( FIRST ) β * + φ
mar that falls in these subclasses
This would allow the parser to make a correct choice with a lookahead of
Among the interesting subclasses are: only one symbol!
LL(1): left to right scan, left-most derivation, 1-token lookahead; and
LR(1): left to right scan, right-most derivation, 1-token lookahead The example grammar has this property!
) )
85 86
goal @
:: A B
Ü
expr C
0 H I J
Ü Ü
3 term expr
0 N O P
L M Q
4
R S
term
0 T
ï U
ô ] ^
6 factor term
0 b
_ ` a
7 factor
c d
B ±
8 factor ::
-
f g h i j k
/
0 1 2 2 2 3
9
with
?
A αA 4 5
To choose between productions 2, 3, & 4, the parser must see past the o p q
A β1 β26 βn
7
8
0 9 : : : ;
2
y z
3
ë
4
<
87 88
Example Example
There are two nonterminals that must be left factored: Substituting back into the grammar yields
Ü Ü
term
expr
1 Ì
Ø
goal Í
:: Î expr Ï
Ü
0 Õ Ö
Ü Ü
term
0
Ò Ó Ô × Ø
3 Ù
Ü
expr Ú Û :: Ü expr
Ý Þ
Ü
4 à exprá â
Ü
ε
factor term
0
ï ä
¡ ¢
5
factor £
6
ô å
term
æ
:: ç factor termè
é ê
0 ë ì
7 term :: term
í
0 î ï 0 ó
ð ñ ò
8 term
0 ÷
ε
ø
¤
9
expr :: term expr
0 ¨ ©
Ü Ü
¬
¥ ¦ § ª «
10 ù factor ú :: û ü ý þ
expr ® :: ¯
³
expr °
´
±
µ
Ü
11 ÿ
expr Ü
ε
·
º » ¼ ½
term :: term
0 Á Â 0 Æ
Ã Ä Å
Ç È É
term
0 Ê
ε
Ë
89 90
2 term expr
if
® : ; : B :
11 term expr
1 2 3
A Aα β γ
. / 0 4 5 6 7 8 9 : ;
– term expr
? @ A
< = > ? ? ? @
< = > B C D E F G H I
ε expr
J K L
9 with
M N
O P Q R S T
U V W X Y
4 expr ®
A NA
: A
B C
Z [ \ ] ^ _ `
– a expr b c d
®
e f g h i j k
β γ
l m n o p
2 N
B D
term expr
q r
s t u v w x y z
E F F F G
{ | } ~
αA ε
®
A
: H I : J K
10 term expr
– term expr
?
¥ ¦ §
¡ ¢ £ ¤ ¨ © ª « ¬ ® ¯
° ± ² ³ ´ µ ¶ · ¸
7 term expr
¹ º
» ¼ ½ ¾ ¿ À Á Â
– term expr
Ë Ì
Õ Ö × Ø Ù Ú Û Ü
6
Ý Þ ß
ã ä å æ ç è é ê
11 term expr
õ ö ÷ ø
ë ì í î ï ð ñ ò ó ô ù ú û ü ý þ ÿ
– term expr
õ
J
9
( ) *
expr+ , - . / 0 1
®
! "
2
#
3
$
4
%
5
&
6
'
8 8
91 92
Generality Recursive descent parsing
Now, we can produce a simple recursive descent parser from the (right-
Question:
associative) grammar.
By left factoring and eliminating left-recursion, can we transform
$
[ \ ] ^ _
r s t u v w x t y z { | | } | ~
Answer:
¡ ¢ £ ¤ ¥ ¦ ¦ § ¦ ¨ © ª « ¬
Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å
æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷
ø ù ú û ø ü ý þ ÿ
an 0bn n an1b2n n
j j R j j X
Q Q
P
O S T
1 U V W
O S Y
1 Z
Must look past an arbitrary number of a’s to discover the 0 or the 1 and so
! " # $ % & ' ( ) * +
P
, - . / , 0 1 2 3 4 5 6 7
8 C
93 94
\ ] ^ _ \ ` a b b c b d
e f g e h i j k h l m n o p q r s t u v w x
y z { | } ~
To build an abstract syntax tree, we can simply insert code at the appropri-
ate points:
¡ ¢ £ ¤ ¥ ¦
§ ¨ © ª § « ¬ ® ¯ ° ± ²
] ^ _ ` a b c d e
³ ´ µ ³ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
k l m n o p q r s t u v
y z { | } ~
Ù Ú Û Ü Ù Ý Þ ß à á â ã ä
å æ ç å è é ê ë è ì í î ï
ð ñ ò ó ô õ ö
5 6 7 8 9 : ; < = > ? @ A B C D E F
G H I J G K L M N
O P Q O R S T U R V W X X Y X Z
[
95 96
Non-recursive predictive parsing Non-recursive predictive parsing
stack
time stack, or call stack.
be particularly efficient.
scanner IR
code parser
This suggests other implementation methods:
97 98
¬ ® ¯ ° ± ² ³ ´ µ ¶ · ¸
Start Symbol
¹ º » ¼ ½ ¾ ¿ ¿ º À Á Â Ã
stack
¢
Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô
Õ Ö × Ö Ø Ù
Ú Û Ü Ý Þ ß à á Ý â ã ä
å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û
scanner IR
¢
ü ý þ ÿ
code parser
¤
! ! " ! # $
% & ' % ( )
+ ,
grammar
¦
M X Y1Y2 Yk
- .
generator tables
£
; < = > ?
/ 0 1 2 3 4 5 6 7 8 9
: : :
@ A @ B
C D E F
Yk Yk 1 ; G ; H
I J J J K
Y1
This is true for both top-down (LL) and bottom-up (LR) parsers L M N L O P P Q P R S
T U V W X Y Z [ \ ]
99 100
Non-recursive predictive parsing FIRST
For a string of grammar symbols α, define FIRST α as:
What we need now is a parsing table M .
^
½ ¾
O Â O
à Ä
Å Æ Ç
1 a
goal b
:: expr e
$† If α ε then ε α
¨
¡ ¢ £ ¤ ¥ ¦ §
` c d f
FIRST
È
É Ê Ë Ì Í
e e
ª
g h i j n o
goal a «
1 1 – – – – –
3 expr :: expr α contains the set of tokens valid in the initial position in α
e e
p q r s t u v ¬
FIRST
expr 2 2 – – – – –
Î Ï
4
x y z
expr e
To build FIRST X :
{
®
Ò Ó
±
5
| }
ε ²
expr e
¯ ° – – 3 4 – – 5 Ñ
term 6 6 – – – – –
k ³
~
1. If X Vt then FIRST X is X
À
Ò Ô Ò Ö Ò Ø
Ã
´
term – – 9 9 7 8 9
k µ ¶
Õ ×
7 term :: term
k k
2. If X
·
À À
Ò Ù Ò Û
8 term
k
factor ¸ 11 10 – – – – – Ú
ε
±
3. If X Yk :
9
Ü Y1Y2 Ý Þ Þ Þ
ß
10 factor ::
(a) Put Y1 ε in FIRST X
Ò æ
FIRST
11
à ä å
á â ã
i : 1 i k, if ε FIRST Y1
(b)
ê
ç FIRST Yi 1 è é ë ì
í î ï ï ï ð
ñ
ò ó
(i.e., Y1 Yi 1 ε) õ õ õ
ò ö
÷ ø
ò ú û ü
ù ý þ ÿ
FIRST Y1
FIRST
ß
† ¹
we use $ to represent º » ¼
Repeat until no more additions can be made.
101 102
: ( : )
* * / 0
Ð
What if A 1 2 ε?
Thus, a non-terminal’s FOLLOW set specifies the tokens that can legally
appear after it. Revised definition
A grammar G is LL(1) iff. for each set of productions A α1 α2 αn:
3 4
To build FOLLOW A :
: = C
:
; <
2. If αi ε then αj φ 1 j.
À
A j ni
R S R
FIRST FOLLOW
T U
W
E F G I J K
H L M N O P Q V
1. Put $ in FOLLOW
a
goal
2. If A αBβ:
(a) Put FIRST β ε in FOLLOW B
FIRST ! " #
in FOLLOW B % &
103 104
LL(1) grammars LL(1) parse table construction
α:
4. A ε–free grammar where each alternative expansion for A begins with
X
(a) a α , add A α to M A a
3 r À 3 t
^ s
FIRST
\ o \ u
n p q
(b) If ε v FIRST w α: x
Example i. A , add A α to M A b
z { z
b
3 } 3 ~ À 3
^
y FOLLOW |
S aS a
[
\ ] \
FOLLOW
\ _ ` \ b c d \ e
^ a
Z f
S aS \ g
Z h i
S
\ j k
Note: recall a b Vt , so a b ε
105 106
stmt q :: r
}
s
~
t
u
e
expr v w x y z { stmt |
E TE T T ¡ ¢ £ T ε ¤
E ¥ ¦ E § ¨ © E ε F ª « ¬ ® ¯ ° ±
Left-factored:
stmt ::
e
FIRST FOLLOW
ε
¡
² ³ ´ µ ¶ · ¸ ¹ º »
¼ ½
stmt ¢ £ :: ¤ ¥ ¦ § ¥ stmt ¨ © ª
S ¼
$ Ç
E $
Now, FIRST stmt ε
¾ ¿ À Á Â Ã Ä Å Æ
ε
¼
$
% & ' ( ) * + , -
¼ Ñ
E $
« ¬ ® ¯ ° ± ² ³ ´ µ ³ ¶
Ê Ë Ì Í Î Ï Ð
È É
T
Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
¼
$
à
S S
² ² .
E S
/
² 0
E
/ 1 2 3 4 5
ε φ
¼ í
T $ E E TE E TE
Ò á â / / 6 Ò 7 / 8 Ò 9 : ; < = >
FOLLOW Õ
ε
Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ð Ó Ô
F $ E E EE E E
î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ / ? @ A / B C D / / E F G / H I / J K
Ö × Ø Ö
T T ε T ε T TT TT ε
Ò U V W Ò X Y Ò Z [ Ò \ ] ^ Ò Ò _ ` a Ò Ò b c
Ê Ê Ê
ε
Ù
F F d e f F g h i j k l m n o
! " # $
The fix:
è é ê :: ë ì í î ì ï ð ñ ò ó ñ
est previous . ô õ ö ÷
107 108
Error recovery
Key notion:
ø
û ü
is found
Building SYNCH:
û
1. a \ þ
FOLLOW ÿ A
\
a
û
SYNCH
A
Chapter 4: LR Parsing
2. place keywords that start statements in SYNCH A
3
û
A
û
SYNCH
\ \
Ã
109 110
Recall Goal:
'
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
%
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected]. The final result is a rightmost derivation, in reverse.
111 112
Example Handles
AB (
2 A A
) 3 * 3 + ,
3
D
- .
4 B
w
/ 0
3
a handle of a right-sentential form γ is a production A β and a po-
*
2 A
3 = > ? @
4 A
previous right-sentential form in a rightmost derivation of γ
A B C
1 aABe
–
Z
S
i.e., if S rm αAw rm αβw then A β in the position following α is a
Z F G À
I J
The trick appears to be scanning the input and finding valid sentential handle of αβw
forms.
Because γ is a right-sentential form, the substring to the right of a handle
*
Handles Handles
Theorem:
S
dle.
α
K
1. G is unambiguous N
rightmost derivation is unique
* *
E
E Q
O P
A
β is applied
± R
β w
L
β
X U
4. a unique handle A
3 V
The handle A
3 M
115 116
Example Handle-pruning
The left-recursive expression grammar (original form) The process to construct a bottom-up parse is called handle-pruning.
\
goal X
:: e
expr – goal
a
γ0 γ1 γ2 γn 1 γn
W Y Z [
S
Í
1 expr w
* * * * *
Î Ï
k c 9 Ö
9 Ô
e e
Ð Ñ Ò Ò Ò Ó Õ
] ^ _ ` a b
±
3 expr term
k
3 expr term
e
k i
d e f g h
k l
e
j k
9
| m
o p t
7
~ u v
6 term factor
expr factor
k w x y
n
Ø Ù Ú Û Ü
× Ý Þ ß à á Þ â
z
¡ ¢ £ ¤ ¥ ¦ § ¨
7 factor
{ |
8 expr
©
e
ª « ¬ ® ¯ ° ±
8 factor ::
X ²
4 term
k ³ ´ µ ¶ · ¸ ¹ º
1. Ai βi γi
~
E ð E ñ ò E
9
ã ä å æ ç è é ê ë ì í î ï
7 factor
» ¼ ½ ¾ ¿ À Á Â Ã
9 Ä Å Æ Ç È É Ê Ë Ì
2. ó ô õ ö ÷ ø ô βi E ù ú û ü
Ai E ý þ ÿ
*
γi 1E
× ×
117 118
J
K
$
M N O P Q R S T U
shift
L W X Y Z [ \ ] ^ _
shift-reduce parser. L
$ `
reduce 9
$ factor a b c d e f g h reduce 7
1
goal
::
expr L
$ term
i
j k l m n o p q
reduce 4
Shift-reduce parsers use a stack and an input buffer
L r
$ expr shift
V
! "
3 expr term
s t u v w x y z
&
L {
$ expr shift
# $ %
4 term
) | } ~
2 3
L
$ expr
reduce 8
5 term :: term factor
, / 0 1
L
2 :
4
6
5 6
term factor
7 8 9
L
2. Repeat until the top of the stack is the goal symbol and the input token
2 >
7 factor
; < =
L
? @
8 factor :: ¡
is $
L ¦
$ expr reduce 9
A B C D E
term
ª « ¬
F G H I
9 L ®
§ ¨ ©
$ expr reduce 5
2 µ
term factor
² ³ ´
¯ ° ±
$ expr
¶
· ¸ ¹
term
º
reduce 3
L »
L
$ expr ½
¼ reduce 1
$ goal accept
¿
¾
119 120
LR k grammars
à Ä
Shift-reduce parsing Informally, we say that a grammar G is LR k if, given a rightmost derivation Æ
S Ç
S γ0 γ1 γ2 γn w
Z È
* * * *
Î É
Ê Ë Ì Ì Ì Í
2. reduce — right end of handle is on top of stack; by scanning γi from left to right, going at most k symbols beyond the right
À
locate left end of handle within the stack; end of the handle of γi.
*
"
121 122
H
×
Ø
H
We call these parsers LR(1) parsers.
2. S rm γBx rm αβy, and
Z Ù Ú
Û
3. FIRST k w FIRST k y Ý Þ Ý á
Ü â
of the input
sult.
ñ
3 ê
ò ô
123 124
Left versus right recursion Parsing review
Left Recursion:
limits required stack space An LR k parser must be able to recognize the occurrence of the right
S
left associative operators hand side of a production after having seen all that is derived from that
right hand side with k symbols of lookahead.
S
Rule of thumb:
125 126
The JavaCC grammar can have embedded action code written in Java,
just like a Yacc grammar can have embedded action code written in C.
127 128
The JavaCC input format
"
header
Example of a production:
token specifications for lexical analysis
À
grammar
ç
D E F G H I J I K L K M I N O P I Q K I R S M T U V
W X
Z [ \ ] \ ^ _ ^ ` \ a b c d e f g h i f j e k l m n o p p q r s t u v w v
129 130
y z { z | | } ~ ¡ ¢
£ ¤ ¥ ¤ ¦ § ¨ © ª « ¬ ¨ ¨ ® ® ¯ ° ± ² ³ ´ ° µ ° ¶ · ¸ ¹ º » ¸ ¼ ½ ¾ ¿ À À Á Â Ã Ä Å Æ Ç È É Ê È
of the objects.
each class must have an accept method. ! " " # $ % & ' ( ) *
+ , - . / 0 1 2
3 4 5 6 7 8 9 : ;
<
133 134
= > ? @ A B C C D E F G H I J K L M N O P J
Z [ [ \ ] ^ _ ` a b c d d e f g h i j k
l m n o p q r s t u v v w x y
The first approach is not object-oriented!
z { | } ~
® ¯ ° ± ² ³ ´ µ ¶ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã
Ä Å Æ Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ì Ò
ý þ ÿ
Ó Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç æ è
Advantage: The code is written without touching the classes ë ì í and We can now compute the sum of all components of a given -object
.
î ï ð ñ
by writing .
ó ô õ ö ÷ ô ø ù ú û
The Idea:
4 5 6 7 8 9 : ; < = > ? @ A B
Divide the code into an object structure and a Visitor (akin to Func-
C D E F C G H I
tional Programming!)
ï
Insert an method in each class. Each accept method takes a ¨ © © ª « ¬
L M N O O P Q R S T U V W X U X Y Z [ \ ] ^ _ `
Visitor as argument.
¯
° ± ² ± ³
i j k l m n o p q
r s t u v w x y z { | } ~
´ µ ¶ · ¸ ¹ º » · ¼ ½ ¾ ¿ À
Á Â Ã Ä Å Æ Æ Ç È É Ê Ë Ì Í Ì É Î Ï Ð Ñ Ò
ò
Ô Õ Ö × Ø Ù Ú Û × Ü Ý Þ Ý ß à á â
Third Approach: The Visitor Pattern Third Approach: The Visitor Pattern
ò
The control flow goes back and forth between the methods in
invoke the method in the Visitor which can handle the current
object.
¡ ¢ £ ¤ ¥ ¦ £ ¦ § ¨ © ª « ¬ « ® ¯ °
± ² ³ ´ µ ¶ · ¸ ¹
º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Å Ç È É Å Ê Ë Ì Í Î
Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ú Ü Ý Þ ß à Û á â ã
7 8 7 9 : 9 ; < ; = 9 : > ?
ä å æ ç è é ê ë ì í î ï ð ñ ò
ó ô õ ö ÷ ø ô ö ù ù ú û õ ü õ ý ÷ þ ÿ
A
B C D E E F G H I J K L M N K N O P Q R S T U V
W X Y Z [ \ ] ^
_ ` a b c d e f g
+ , - . / 0 1 2 3 . 1 4 5 6 7 . 8 7 9 - : 1 - 3 0 ; <
h i j k l m n o p q r s s t u v w x y z y v { | } ~
139 140
Comparison Visitors: Summary
ò
The Visitor pattern combines the advantages of the two other approaches.
E
Yes No
G
Yes ones.
ò
Requirement for using Visitors: All classes must have an accept method.
J
visitors do their job. As a result, the pattern often forces you to provide
D
JJTree (from Sun Microsystems) and the Java Tree Builder (from Pur-
public operations that access internal state, which may compromise
ü
due University), both frontends for The Java Compiler Compiler from
Sun Microsystems. its encapsulation.
141 142
The Java Tree Builder (JTB) has been developed here at Purdue in my The produced JavaCC grammar can then be processed by the Java Com-
L
JTB is a frontend for The Java Compiler Compiler. The produced syntax trees can now be traversed by a Java program by
writing subclasses of the default visitor.
M
JTB supports the building of syntax trees which can be traversed using
Program
visitors.
S T W X
grammar [
Y
with embedded Compiler
Java code
N
tree;
ï
classes [
143 144
Example (simplified)
Using JTB
C D E F G H H I J K L M N
JTB produces:
` a b c d e f e g h i j j k k l m n m o p q m r s t u v w x t v s s
O P P Q R S T U S V W X X Y Z [ \ ] [ ^ _ ` a
y z { z | | } ~ ~ } }
b c d e f g d h i j k d l m m e n o p q r
¡ ¢ £ ¢ ¤ ¥ ¦ § ¨ © ª ¦ « ¦ ¬ ¬ ® ¯ ° ± ² ® ³ ® ´ µ ¶ · ¸ ¹ ¶ º » ¼ ½ ¾ ¾ ¿ À Á Â Ã Ä Å Æ Ç È Æ
s t t u v w x y w z { | y } ~ z }
É Ê Ë Ì Í Î Î Ï Ð Ñ Ò Ó Ô Ó Õ Ö × Ô
Ø Ù Ú Ù Û Ü Ý Þ ß à á â ã ä å æ æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ö ÷ ø ù ú û ü ý þ ÿ
¡ ¢ £
¤ ¥ ¦ § ¨ ¨ © ª ¤ « ¬ ¤ ® ¯ ¬ ° ± ² ° ³ ´
µ ¶ · ¸ ¹ º » ¼ ½ ½ ¾ ¿ µ À Á
Â Ã Ä Å Æ Ã Ç È É Ê Ë Ì Ì Í Î Ï Ð Ñ Ï Ò Ó Ï Ô Õ Ï Ö Õ Ï × Ø Ù Ú
145 146
¸ ¹ º » ¼ ½ ¾ ¿ À Á Á Â Ã Ä Å Æ Ç È É Ê Å Ë È Ê È Å Ì É Í Î Ï Ð Ñ Î Ñ Ò Ó Ô Õ Ö × Ö Ø Ù Ú Û
ì í î ï ð ñ ò ó ô õ õ ö ÷ ÷ ø ù ú û ü ú ý þ ÿ ÿ
Ü Ü Ü
Ý Ý
+ , - . / 0 0 1 2 3 4 5 6
Þ Þ ß à á â ã ä å æ ç ä è é ê ë ä ì í í å î ï ð ñ
ò ò ó ô õ ö ÷ ø ø ù ú û ü ý û þ ÿ ý þ
Q R R S T U V W U X Y Z W [ \ X ] [ ^ _ `
a b c d e f f g h i j k l
m n o p q r s t u v w x y z { | } ~
. / 0 1 / 2 3 3 4 5 6 7 6 8 9 : ; <
= > ? @ > A B B C D E F E G H I J K
L M N O M P Q Q R S T U T V W X Y Z
¡ ¢ £
Notice the method; it invokes the method ¦ § § ¨ © ª « ¬ ¬ ® for ¯ ° ° ± ² ³ ´ µ ³ ¶ in Notice the body of the method which visits each of the three subtrees of
the node.
· ] ^ ^ _ ` a b c a d
147 148
Example (simplified)
Here is an example of a program which operates on syntax trees for Java
1.1 programs. The program prints the right-hand side of every assignment.
e
f g h i j k l m n o o p q r s t u v w w s x t y z { | } ~ |
¡ ¢ £ £ ¤ ¡ ¥ ¦ £ ¢ ¡ § ¨ © ª « ¬ ® ¯ ° ° ± ® ² ³ ° ¯ ® ´ µ ¶
· ¸ ¹ º ¸ » ¼ ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È Å É Ê Ë Ì È Í Ì Î Ï Ð
Ñ Ò Ó Ô Ò Õ Ö Ö × Ø Ù Ú Ù Û Ü Ý Þ ß
When this visitor is passed to the root of the syntax tree, the depth-first
traversal will begin, and when nodes are reached, the method
·
ã ä ä å æ ç è é ç ê
in ì í
is executed. ì î
ï ð ñ ò ó ô õ ö ö ò ÷ ó ø ù ú
JTB is bootstrapped.
149 150
The compilation process is driven by the syntactic structure of the program What context-sensitive questions might the compiler ask?
as discovered by the parser
5. Is an expression type-consistent ?
– finish analysis by deriving context-sensitive information
6. Does the dimension of a reference match the declaration?
– begin synthesis by generating the IR or target code 7. Where can be stored? (heap, stack, )
associated with individual productions of a context free grammar or 8. Does reference the result of a malloc()?
subtrees of a syntax tree
11. Does function produce a constant value?
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected]. These cannot be answered with a context-free grammar
151 152
Symbol tables
Context-sensitive analysis
For compile-time efficiency, compilers often use a symbol table:
variable names
#
"
defined constants
Several alternatives: %
&
'
â
What information is needed?
What kind of information might the compiler need?
textual name
·
scope
data type
(for aggregates) innermost scope overrides declarations from outer scopes
dimension information
9
declaring procedure
Key point: new declarations (usually) occur only in current scope
.
0
What operations do we need?
/
offset in storage
: ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X
Attribute information e
Attributes are internal representation of declarations 1. basic types: boolean, char, integer, real, etc.
types: type descriptor, data size/alignment (b) T1 T2 denotes Cartesian product of type expressions T1 and T2
·
cal decls.), frame size (d) pointer T denotes the type “pointer to object of type T ”
¦ §
157 158
Type compatibility
Type descriptors
Type checking needs to determine type equivalence
Type descriptors are compile-time structures representing type expres-
Two approaches:
sions
Name equivalence: each type name is a distinct type
e.g., char
¬
char
¦
pointer integer
®
Structural equivalence: two types are equivalent iff. they have the same
°
¹ º
s »
pointer or
¶
pointer ¼
array s1 s2 ½
º º
array t1 t2 iff. s1 Á
º
t1 and s2 º
t2
¾ ¿ À Â Ã Ä Å
Æ º
s1 Ç
º
s2 È É
t1 Ê
t2 iff. s1
È
º
Ë
t1 and s2 º
È Ì
t2
È
char integer Í Î
pointer s Ï º Ð Ñ Î
pointer t iff. s Ò
Ó
º Ô
t
Õ º
s1 Ö
º
s2 È ×
t1 Ø
t2 iff. s1
È
º
Ù
t1 and s2 º
È Ú
t2 È
159 160
Type compatibility: example
Consider:
Type compatibility: Pascal-style name equivalence
Û Ü Ý Þ ß à á â ã ä å æ ç ç è
é ê ë ì í î ï ð ñ ò ó ô õ
Build compile-time structure called a type graph:
ö ÷ ø ù ú û ü ý þ ÿ
'
in the graph
$
161 162
Consider:
= > ? @ A B C D E F G H I I J
K L M M N O P Q R O S
T U V W X Y Z [ \ ] \ ^ _
â
j k l m
record
Eliminating name n o p q from type graph for record:
t u v v
record
integer pointer
x
}
integer pointer
y z { |
~
163 164
Tiger IR trees: Expressions
CONST
Integer constant i
i
NAME
Symbolic constant n
[a code label]
n
TEMP
t
BINOP
Application of binary operator o:
o e1 e2 ¡
PLUS, MINUS, MUL, DIV
AND, OR, XOR [bitwise logical]
LSHIFT, RSHIFT [logical shifts]
Chapter 7: Translation and Simplification ARSHIFT [arithmetic right-shift]
to integer operands e1 (evaluated first) and e2 (evaluated second)
¢
MEM
CALL ª
£ ¤
« ¬ ¬ ¬
en
¨
f e1 en ¥ ¦ ¦ ¦ §
¨ ©
ESEQ ¯ ¯
®
se °
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
±
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
²
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
165 166
TEMP e
ª
Kinds of expressions
t
´ ´
MEM e2
e1 Nx(stm) statements: expressions that compute no value
EXP
Evaluate e and discard result
µ
JUMP
RelCx(op, left, right)
Transfer control to address e; l1 ln are all possible values for e
· ·
e l1 ln ¶
·
¸ ¹ ¹ ¹ º
·
»
½ ¾ ¾ ¾ ¿
CJUMP
Evaluate e1 then e2 , yielding a and b, respectively; compare a with b
¢
´
µ
À
´
À
EQ, NE [signed and unsigned integers] unEx convert to tree expression that computes value of inner tree
LT, GT, LE, GE [signed]
Â
ULT, ULE, UGT, UGE [unsigned] unNx convert to tree statement that computes inner tree but returns no
jump to t if true, f if false
£
value
#
SEQ
¯
s1 s2 ® ®
LABEL
Define constant value of name n as current code address; NAME n
Ã
Å
n
167 168
Translating Tiger
Translating Tiger
Tiger record variables: Again, records are pointers to record base, so fetch like other
variables. For e. :
Ó
MEM
Ex(MEM( (TEMP fp, CONST k)))
È
BINOP
×
É
offset of variable in that level where the literal will be emitted as:
Ê e
Tiger array variables: Tiger arrays are pointers to array base, so fetch ß à á â ß ã
Ù
ä
Ú
å
Û
æ
Ü
ç
Ý
è è
Þ Þ
é ê ë ì ì í î ï ð ñ ò ó
f1 e1 f2
£
e2
£
fn
Ex(MEM( (TEMP fp, CONST k)))
È
¨ û
ô õ
ö ÷ ø ù ú ú ú ü
Thus, for e i :
Ì Í
Î Ï
SEQ(. . . ,
i is index expression and w is word size – all values are word-sized MOVE(MEM(+(TEMP r, CONST n 1 w)), þ
ÿ
Ö
Array creation: e1
e2 : Ex(externalCall(”initArray”, [e1 .unEx, e2 .unEx]))
169 170
while loops
while c do s:
º
s
Overview of control structure translation:
jump test
171 172
for loops
for := e1 to e2 do s
f e1 ' ( ( ( )
*
t1 e1
0
if t1 t2 jump done
È
body : s where sl is the static link for the callee f , found by following n static links
%
º
t1 t1 1
0
from the caller, n being the difference between the levels of the caller and
,
if t1 t2 jump body !
·
the callee
done:
when done translating loop and its body, pop the label
7
173 174
Conditionals
Comparisons 2
3 4 6 3 7 8
9
: :
SEQ(SEQ(LABEL t,
CJUMP(op, a.unEx, b.unEx, t, f )
%
SEQ(MOVE(TEMP r, e2.unEx),
;
JUMP join)),
â
SEQ(LABEL f,
When used as a value unEx yields: SEQ(MOVE(TEMP r, e3.unEx), :
SEQ(LABEL f, TEMP r)
As a conditional unCx t f yields:
% >
TEMP r)
SEQ(LABEL ff, e3.unCx(t, f ))))
%
175 176
One-dimensional fixed arrays
Conditionals: Example D E F G H I J J I K L M N N O P Q R S T U V W V X Y
Z Z Z
3 B C
? @
[ \ ] ^
translates to:
·
177 178
Multidimensional arrays
Array layout:
Multidimensional arrays
#
i
Contiguous:
Array allocation: 1. Row major
constant bounds Rightmost subscript varies most quickly:
– allocate in static area, stack, or heap
j k l m l n m o p q r s t r u u u
dynamic arrays: bounds fixed at run-time Used in PL/1, Algol, Pascal, C, Ada, Modula-3
– allocate in stack or heap 2. Column major
Leftmost subscript varies most quickly:
h
– descriptor is needed
h
– allocate in heap
Used in FORTRAN
h
– descriptor is needed
By vectors
Contiguous vector of pointers to (non-contiguous) subarrays
179 180
Multi-dimensional arrays: row-major layout
¡ ¢ ¢ £ ¤ ¡ ¢ ¢ ¥ ¦ § ¨ ©
ª « ¬ ¬ « ® ¯ ° ° ± ² ³ ´ µ ¶ ¶ µ · ¸ ¹ º º » ¼ ½ ¾ ¿
case statements
no. of elt’s in dimension j:
À
Dj Á Â
Uj Á Ã
Lj Á Ä
in
Ì
Í
Ln Î
3. execute statement associated with value found
in 1 Ln 1 Dn 4. jump to next statement after case
Í Õ
Ï Ð
Ñ Ó
Ò Ô
in 2 Ln 2 DnDn 1
Í Õ Õ
Ö ×
È Ù È Û
Ø Ú Ü
Ý Þ Þ Þ
ß à
i1 L1 Dn ã ã ã
O cases
which can be rewritten as
O log2 cases
È
ä å æ ç
È è è è È : ê ê ê
é ë ì ì ì í ï
î
L1D2 Dn L2D3 Dn ñ
Í Õ Õ
ó
Í Õ
Ln 1Dn : ô ô ô
Õ
õ ö ö ö ÷
Í
ø
Õ
ù
Í
Ln
ú
constant part O1
address of i1 in : ÿ
Simplification
case statements
%
&
'
Ö ) Ö Ö
ESEQ(SEQ(s1,s2), e) ® ®
ESEQ(s, BINOP(op, e1 , e2 ))
®
Ö
L2 :È
code for S2 È
Ö + Ö
jump next
® ®
... ¼
JUMP(ESEQ(s, e1 )) ®
Ö ,
SEQ(s, JUMP(e1 ))
®
Ö
Ln :
code for Sn
· ·
jump next
ESEQ(s, e1 ), e2 , l1 , l2 ) ®
Ö
· ·
Ö -
test: if t V1 jump L1
BINOP(op, e1 , ESEQ(s, e2 ))
Ö . Ö
"
®
ESEQ(MOVE(TEMP t, e1 ),
ESEQ(s, ®
if t V2 jump L2
Í
CJUMP(op, Ö Ö /
SEQ(MOVE(TEMP t, e1 ),
e1 , ESEQ(s, e2 ), l1 , l2 ) SEQ(s,
· ·
if t Vn jump Ln
® ®
· ·
SEQ(s, MOVE(e1 , e2 ))
®
Ö
next:
Ö 1 Ö
´ ´
TEMP(t))
183 184
Register allocation
IR instruction machine
selection code
4 ; 5 < 6 < = 7 > 8 ; 9 : 6 = 5 ? 3
: 8
errors
2
Register allocation:
limited resources
B
G
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this
H
work for personal or classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and full citation on the first page. To
copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission
and/or fee. Request permission to publish from [email protected].
I
185 186
temporaries with disjoint live ranges can map to same register Out-edges from node n lead to successor nodes, succ n
R
, , S
T
Example:
V
X
L1 : b a 1
Í
Y
c c b Z [
]
a b 2 \
if a N goto L1
` Í
_
return c
187 188
Liveness analysis Liveness analysis
liveness of variables “flows” around the edges of the graph in n : variables live-out at n
, ,
Then:
c d
e
– def v d f g
out n
,
in s
º
h
– def n , i j
s
succ n
k d
succ n φ out n φ
-
, ,
l m
d n o d
, q r ,
Note:
Liveness : v is live on edge e if there is a directed path from e to a use of v
d d
in n use n
l
, ,
s
d t
in n out n def n
d , ,
, , ¡ ¢ , ¤
d , ,
§
, ¦ , ¨
use n
v v live-in at n
l v
d u , w x d ,
³
d © , « d ¬ , ® d ¯ , ± ² , ´
pred n
|
d , y d z { , }
¿
, ¶ · , ¹ º » , ½ ¾ , À Á
d , ~ d , d ,
189 190
foreach n in n
, Ã , Å Æ Ç , É Ê
Ä È Ë
repeat
N nodes in CFG
foreach n ,
N variables
, Î Ï , Ñ
Ì Í Ð
Ç , Ô Õ Ç , ×
Ò Ó Ö
in n use n out n de f n Ø
, Ù Ú
Û
, Ü Ý Þ Ç
ß
, à á
ã
, ä å
Ç , ç è é º î
æ í
s succ n ê ® ë ì
in n ó
, ô õ Ç
out n ö ÷
, ø ù Ç
out n ú
, û ü ý ,
Notes: þ
worst-case complexity of O N 4
O N or O N 2 in practice
could do one variable at a time, from uses back to defs, noting liveness
: 8
There is often more than one solution for a given dataflow problem (see
example). Register allocation:
Any solution to dataflow equations is a conservative approximation:
"
limited resources
v out n
$ &
d % , '
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this
3
The iterative liveness computation computes this least fixpoint. work for personal or classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and full citation on the first page. To
copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission
and/or fee. Request permission to publish from [email protected].
4
193 194
6
(b) if adding non-spill node there must be a color for it as that was the
(b) if G G m can be colored then so can G, since nodes adjacent
7 8 9 :
5 ;
<
(c) each such simplification will reduce degree of remaining nodes colored) then mark as an actual spill
B
> ? @
(a) target some node (temporary) for spilling (optimistically, spilling after each definition
node will allow coloring of remaining nodes) (b) recalculate liveness and repeat
(b) remove and continue simplifying
195 196
Coalescing Simplification with aggressive coalescing
h
C º
interfere: F
build
– coalesce them into a new node whose edges are the union of those
of s and d
â
done
H
aggressive
E
e
lesc
– unfortunately, the union is more constrained and new graph may
coa
no longer be K -colorable
A
any
– overly aggressive
G
simplify
done
G
spill
l
spil
any
G
select
197 198
Interleave simplification with coalescing to eliminate most moves while without extra spills
Apply tests for coalescing that preserve colorability.
1. Build interference graph G; distinguish move-related from non-move-related nodes
U
J J
h
2. Simplify : remove non-move-related nodes of low degree one at a time
Briggs: coalesce only if ab has K neighbors of significant degree K
M
L N
simplify will first remove all insignificant-degree neighbors remove associated move instruction
M
O
W
R J
fere with b
K
(c) now treat as a non-move-related and resume iteration of simplify and coalesce
simplify can remove all insignificant-degree neighbors of a
M
S J
T J
(a) rewrite code to fetch actual spills before each use and store after each definition
(b) recalculate liveness and repeat
199 200
Spilling
Iterated register coalescing
b
SSA constant
To avoid increasing number of spills in future rounds of build can sim-
2 b
propagation
c
(optional)
ply discard coalescences
Alternatively, preserve coalescences from before first potential spill,
e
build h
simplify f
conservative
tions
coalesce
freeze
potential
spill
select
done
actual
spills
spill
y
an
201 202
select and coalesce can give an ordinary temporary the same color as 2. move callee-save registers to fresh temporaries on procedure entry,
M D
a precolored register, if they don’t interfere and back on exit, spilling between as necessary
e.g., argument registers can be reused inside procedures for a tempo- 3. register pressure will spill the fresh temporaries as necessary, other-
=
m
n
rary wise they can be coalesced with their precolored counterpart and the
simplify, freeze and spill cannot be performed on them moves deleted
M M
This also avoids needing to store large adjacency lists for precolored nodes;
coalescing can use the George criterion
203 204
Caller-save and callee-save registers Example
Variables whose live ranges span calls should go to callee-save registers,
s t u s v w
otherwise to caller-save
x y z { |
} ~
forcing a spill
choose nodes with high degree but few uses, to spill the fresh callee- ¡ ¢ £ ¤ ¥ ¦ ¥ § ¨ ¨ ©
this makes the original callee-save register available for coloring the
¯ ° ± ² ³
cross-call variable ´ µ ¶ · ´ ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
Temporaries are , , , , È É Ê Ë Ì
Ï Ð Ñ Ò
(callee-save)
Ô
205 206
Interference graph:
b e
Ú
b e
r2 r2
r1 a d r1 a d
D
n
Û
h
significant degree K) Ü
h
degree before)
Any coalesce will produce a new node adjacent to K significant- r3
D
degree nodes Ú
b
r2
Must spill based on priorities:
M
h h
2 10 0 ã
ä å æ ç 4 è 0.50
1 10 1 4 2.75
é ê
ë ì í î ï
2 ð
10 0 ñ
ò ó ô õ 6 ö 0.33
2 10 2 4 5.50
÷ ø
ù ú û ü ý
= =
1þ
10 3 ÿ
3 10.30
207 208
Example (cont.) Example (cont.)
Cannot coalesce with because the move is constrained : the
D " # $ % D
n &
r3
r3
r2b
r2b
r1 ae d
r1ae
Coalescing and (could also coalesce with ):
n
Graph now has only precolored nodes, so pop nodes from stack col-
r3
oring along the way
r2b – ) * + ,
r1ae d
– , , have colors by coalescing
- . /
Introduce new temporaries and for each use/def, add loads be- 2 3 4 5
209 210
; < = > ? @
r3
A B C D E F C G H I J K
c1
Ú
b e
r2
L M N O P
c2
r1 a d
Q R S T U
V W X Y
¤ ¥ ¨ ©
Z [ \ ]
r3c1c2
^ _ _ ` a
b e
b c d e f g
r2
h i j k l m
r1 a d
n o p q r s t u t v w w x
As before, coalesce
I
y z { | }
r3c1c2
~
r2b
r1 ae d
¡
211 212
Example (cont.) Example (cont.)
As before, coalesce with and simplify : Rewrite the program with this assignment:
D ³ ´ M ·
² Ð
µ ¶
r3c1c2
Ñ Ò Ó Ñ Ô Õ
Ö × Ø Ù Ú Û
r2b
Ü Ý Þ ß à á Þ â ã ä å æ
r1ae
ç è é ê ë ì
Pop from stack: select ¹ º » . All other nodes were coalesced or precol- í î ï ð ñ ò
– ¼ ½ ¾ ¿
ø ù ú û ü ý
– À Á Â Ã
þ ÿ ÿ
– Ä Å Æ Ç
– È É Ê Ë
–
Ì Í Î Ï
! " # $
% & ' ( ) * + , - . + /
0 1 2 3 4 5
6 7 8 9 6 : ; < = > ? @ A B C D E F G H
213 214
Example (cont.)
I
O P Q R S T Q U V W X Y
Z [ \ ] ^
_ ` ` a b
c d e f g h i j k
l m n o p q r s
t u v w x y z { | { } ~ ~
¡ ¢ £
215 216
The procedure abstraction The procedure abstraction
¯
The essentials:
Separate compilation:
on entry, establish ’s environment
±
procedure P ¸
procedure Q
¨
machine dependent
«
epilogue epilogue
0
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
1
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
®
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
Each system has a standard linkage
M
217 218
higher addresses
The linkage divides responsibility between caller and callee
D D
argument n
previous frame
.
arguments
¼ ¼
Caller Callee
incoming
.
.
¼
argument 2
argument 1
frame 1. allocate basic frame 1. save registers, state
pointer
2. store FP (dynamic link)
¾
3. set new FP
V
4. jump to child
an associated activation record or frame (at
B B
lower addresses
At run time, that code manipulates the frame & data areas
219 220
Run-time storage organization Run-time storage organization
¯
To maintain the illusion of procedures, the compiler can adopt some con-
Typical memory layout
ventions to govern memory use.
high address
Code space Ï
stack
Å
fixed size
(link time)
Ç
statically allocated
free memory
Data space
heap
È
low address
Control stack ¯
«
The classical scheme
Ì
return addresses
Ò
Where do local variables go? Each variable must be assigned a storage class
When can we allocate them on a stack?
Static variables:
Key issue is lifetime of local names
Ü
dynamic scoping à
lexical scoping
continuation-passing style
ä
Û
Link editor must handle duplicate definitions
With only downward exposure, the compiler can allocate the frames on the
run-time call stack
223 224
å
225 226
k
l local value
– look up a name, want its most recent declaration
k
l
k l cannot occur
Given a (level,offset ) pair, what’s the address?
Ç
227 228
The display Calls: Saving and restoring registers
¯
callee saves 1 3 5
table of access links for lower levels
¬
1
caller saves 2 4 6
lookup is index from known offset 1. Call includes bitmap of caller’s registers to be saved/restored
(best with save/restore instructions to interpret bitmap directly)
takes slight amount of time at call
¬
a single display or one per frame Unstructured returns (e.g., non-local gotos, exceptions) create some problems, since
code to restore must be located and executed
2
Non-local gotos and exceptions must unwind dynamic chain restoring callee-saved
registers
6
ý
5. Easy
$
Non-local gotos and exceptions must restore all registers from “outermost callee”
“Setting up the basic frame” now includes display manipulation Á
5. caller sets return address, callee’s static chain, performs call 4–7 a0–a3 first 4 scalar arguments
¯
8
31 ra Expression evaluation, pass return address in calls
Alternative is to put actuals below callee’s stack frame in caller’s: common
when hardware supports stack management (e.g., VAX)
:
231 232
MIPS procedure call convention MIPS procedure call convention
¯
possible)
argument 1
B
frame offset
@
framesize
saved $ra
C
argument build
B
low memory
233 234
local variables
0
saved registers
struction), saves return address in register ra K
n o p q r s t u v w x y z { x | t u v w x } t t y x ~ p y
¢ £ ¤ ¥ ¦ § ¢ ¢ ¨ ¦ ©
time constants
¬
235 236
MIPS procedure call convention
Epilogue:
1. Copy return values into result registers (if not already there)
2. Restore saved registers
ª « ¬ ® ¯ ° ¬ ± ² ³ ´ µ ¶ ° ¬ ± ² · ° ° ³ ¸ ¹ º » ¼ ³ ½ ¾
4. Clean up stack
Ó Ô Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ × ß à Þ
5. Return
á â ã ä
237