0% found this document useful (0 votes)
28 views60 pages

Lec 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views60 pages

Lec 4

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

CS 352 Compilers: Principles and Practice

1. Introduction 2

2. Lexical analysis 31

3. LL parsing 58

4. LR parsing 110 Chapter 1: Introduction


5. JavaCC and JTB 127

6. Semantic analysis 150

7. Translation and simplification 165

8. Liveness analysis and register allocation 185

9. Activation Records 216

1 2

Things to do Compilers

What is a compiler?


make sure you have a working mentor account


a program that translates an executable program in one language into


start brushing up on Java


an executable program in another language


review Java development tools 

we expect the program produced by the compiler to be better, in some


way, than the original


find http://www.cs.purdue.edu/homes/palsberg/cs352/F00/index.html


add yourself to the course mailing list by writing (on a CS computer)


What is an interpreter?

mailer add me to cs352


a program that reads an executable program and produces the results


of running that program




usually, this involves executing the source program in some fashion




Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work


This course deals mainly with compilers




for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
Many of the same issues arise in interpreters


fee. Request permission to publish from [email protected].


3 4
Motivation Interest

Compiler construction is a microcosm of computer science


Why study compiler construction?
artificial intelligence greedy algorithms


learning algorithms
algorithms graph algorithms


Why build compilers?


union-find


dynamic programming
theory DFAs for scanning


Why attend class?


parser generators
lattice theory for analysis
systems allocation and naming
locality
synchronization
architecture pipeline management
hierarchy management
instruction set use

Inside a compiler, all these things come together


 

5 6

Isn’t it a solved problem? Intrinsic Merit

Machines are constantly changing Compiler construction is challenging and fun

Changes in architecture  changes in compilers !

interesting problems
#

"

primary responsibility for performance (blame)




new features pose new problems


$

new architectures % new challenges




changing costs lead to different concerns


&

real results
'

extremely complex interactions




old solutions need re-engineering

Compilers have an impact on how computers are used


Changes in compilers should prompt changes in architecture

Compiler construction poses some of the most interesting problems in


New languages and features
computing
(

7 8
Experience Abstract view

You have used several compilers


source machine
+

compiler
*

code code
*
*

What qualities are important in a compiler?

1. Correct code errors

2. Output runs fast


Implications:
3. Compiler runs fast
4. Compile time proportional to program size
5. Support for separate compilation ,

recognize legal (and illegal) programs


6. Good diagnostics for syntax errors
generate correct code


7. Works well with the debugger


8. Good diagnostics for flow anomalies
.

manage storage of all variables and code

9. Cross language calls /

agreement on format for object (or assembly) code


10. Consistent, predictable optimization

Each of these shapes your feelings about the correct contents of this course Big step up from assembler — higher level notations
)

9 10

Traditional two pass compiler A fallacy

IR FORTRAN front
source front back machine
+

code end
8 7

code end end code


*
*

:
back <

target1
C++ front end
7

code end
8 7

errors back <

target2
end
7

:
9

CLU front
code end
8
7

back <

target3
Implications: ;

9
end
7

Smalltalk front
code end
8 7

intermediate representation (IR)


Can we build n = > ?

m compilers with n = @ ?

m components?
1

front end maps legal code into IR


2

back end maps IR onto target machine A

must encode all the knowledge in each front end


B


simplify retargeting C

must represent all the features in one IR


B


allows multiple front ends D

must handle all the features in each back end


B


multiple passes 6 better code


E

Limited success with low-level IRs


11 12
Front end Front end

tokens
F

tokens
F

source source
+

scanner parser IR scanner parser IR


+

code code
*

errors errors

Responsibilities: Scanner:

maps characters into tokens – the basic unit of syntax


G

recognize legal procedure M N O P Q R

becomes
report errors
S

id, T U V W id, X Y Z [ id, \ ] ^

produce IR _

character string value for a token is a lexeme


preliminary storage map
b

typical tokens: number, id, , , , , ,




c d e f g h i j k

shape the code for the back end


l

eliminates white space (tabs, blanks, comments )


m

a key issue is speed


use specialized recognizer (as opposed to )
n

Much of front end construction can be automated


o p q

13 14

Front end Front end

tokens Context-free syntax is specified with a grammar


w

source
+

scanner parser IR
+

code
*

sheep noise ::=


x

y z { {

| } ~ ~ 

sheep noise €

errors
This grammar defines the set of noises that a sheep makes under normal
Parser: circumstances


The format is called Backus-Naur form (BNF)


‚

recognize context-free syntax


Formally, a grammar G SN T P
†
‡ ˆ Š ‹

guide context-sensitive analysis




ƒ „ ‰
s

construct IR(s)
S is the star t symbol
Œ

produce meaningful error messages


N is a set of non-terminal symbols
‡

attempt error correction


T is a set of terminal symbols

P is a set of productions or rewrite rules (P : N N T)


‡ Ž ‡ 

Parser generators mechanize much of the work


15 16
Front end Front end

Context free syntax can be put to better use


Given a grammar, valid sentences can be derived by repeated substitution.
1 goal ::= expr
 ‘

 ’ “

2 expr ::= expr op term Prod’n. Result




” • – — ˜ ™ š ›

goal
œ 

3 term


 Í

4 term ::= 1 expr




Ÿ ¡ ¢ £ ¤ ¥ ¦ Î Ï

5 2 expr op term


§ ¨ © Ð Ñ Ò Ó Ô Õ

6 ª op « ::= ¬
5 expr Ö × Ø op Ù Ú

7 ­ ®
7 expr Û Ü Ý Þ

2 expr op term


ß à á â ã ä å æ

This grammar defines simple expressions with addition and subtraction


ç è

4 expr é ê op ë ì í î

over the tokens and ¯ ° ± ² ³ ´ µ ¶

6 expr ï ð ñ ò ó ô

3 term


ö ÷ ø ù ú

5
S= goal
û ü ý þ ÿ

w ¸

T = ¹ º » ¼ ½ ¾ , ¿ À , , Á Â

N= goal , expr , term , op


‡

w Ä  Ê Ë

Ã Å Æ Ç È É

To recognize a valid sentence in some CFG, we reverse this process and


P = 1, 2, 3, 4, 5, 6, 7
Š

build up a parse


17 18

Front end Front end

A parse can be represented by a tree called a parse or syntax tree




 Œ

So, compilers often use an abstract syntax tree


B

goal -


expr


+ <id:y>

expr op term


 

<id:x> <num:2>

expr op term - <id:y>


 

This is much more concise

term + <num:2> Abstract syntax trees (ASTs) are often used as an IR between front end
and back end


<id:x>

Obviously, this contains a lot of unnecessary information


19 20
Back end Back end

      

instruction register machine IR       


register machine
IR allocation code
*

selection allocation code


+ *

errors errors

Instruction selection:
Responsibilities

produce compact, fast code


translate IR into target machine code


use available addressing modes




choose instructions for each IR operation 

pattern matching problem




decide what to keep in registers at each point


– ad hoc techniques


ensure conformance with system interfaces – tree pattern matching




– string pattern matching




– dynamic programming
Automation has been less successful here
21 22

Back end Traditional three pass compiler

IR IR
      

IR instruction      
machine +

source front middle back machine


selection code
+   *

code end end end code


*

errors errors

Register Allocation: Code Improvement

have value in a register when used


analyzes and changes IR


limited resources
goal is to reduce runtime


changes instruction choices "

must preserve values




can move loads and stores




optimal allocation is difficult

Modern allocators often use an analogy to graph coloring


23 24
Optimizer (middle end)
The Tiger compiler
%
IR ... IR
IR opt1 opt n IR
# #

Pass 2
errors
$

Environ-
ments


Modern optimizers are usually built as a set of passes Pass 1 Pass 3 Pass 4

Source Program

Abstract Syntax
Tables

Reductions

Translate

IR Trees

IR Trees
Tokens

Assem
Parsing Semantic Canon- Instruction


Lex Parse Translate


Typical passes Actions Analysis calize Selection
,

Frame
Frame
&

constant propagation and folding Layout

'

code motion

Relocatable Object Code


Register Assignment

Assembly Language

Machine Language
Interference Graph
reduction of operator strength

Flow Graph
(

Control Data

Assem
Register Code
Flow Flow Allocation Emission Assembler Linker
)

common subexpression elimination Analysis Analysis

Pass 5 Pass 6 Pass 7 Pass 8 Pass 9 Pass 10


*

redundant store elimination




dead code elimination

25 26

The Tiger compiler phases A straight-line programming language




A straight-line programming language (no loops or conditionals):


Lex Break source file into individual words, or tokens
Parse Analyse the phrase structure of program Stm 8 Stm ; Stm
9

CompoundStm
Parsing Build a piece of abstract syntax tree for each phrase
-

Actions Stm : : Exp


; < = AssignStm
.

Semantic Determine what each phrase means, relate uses of variables to their Stm > ? ExpList @ A B C D E PrintStm
Exp
/

Analysis definitions, check types of expressions, request translation of each IdExp


I

F G H

phrase
0

Exp NumExp
N

J K L M

Frame Place variables, function parameters, etc., into activation records (stack
Layout frames) in a machine-dependent way Exp O Exp Binop Exp OpExp
Translate Produce intermediate representation trees (IR trees), a notation that is Exp P Stm ExpQ R S EseqExp
ExpList Exp ExpList PairExpList
V

not tied to any particular source language or target machine


1

T U

Canonicalize Hoist side effects out of expressions, and clean up conditional branches, ExpList W Exp LastExpList
for convenience of later phases ‚

Binop
X Y

Plus
V

Instruction Group IR-tree nodes into clumps that correspond to actions of target-
Binop
‚ Z [

Minus
\

Selection machine instructions


Binop Times
2

Control Flow Analyse sequence of instructions into control flow graph showing all
4

] ^

Analysis possible flows of control program might follow when it runs Binop
0

3
_ ` Div
Data Flow Gather information about flow of data through variables of program; e.g.,
Analysis liveness analysis calculates places where each variable holds a still-
5
a

e.g., b

d e

needed (live) value : 5 c 3; f : g h i j k l m n o p q r 1 10 s t u v w ; x y z { | } ~ 

Register Choose registers for variables and temporary values; variables not si-
Allocation multaneously live can share same register prints: € 

Code Replace temporary names in each machine instruction with registers


Emission ‚ ƒ

27 28
Tree representation Java classes for trees
¤ ¥ ¦ § ¨ ¤ © § ª « ¬ ­ ­ ® ¯ ° ± ² ´ µ ¶ · · ¸ ¹ º » ¼ ½ ¾ ¿ À ¾ Á Â Ã Ä Å Æ Ç

È É Ê Ë Ì Í Î

³ ´ µ ¶ ¶ · ¸ ¹ º ¸ » ¼ ½ ¾ ¿ ¹ À Á Â À Ã Ä Å Æ Ç È

d †

: 5 3; ‡ : ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ 1 10
” • – — ˜ ; ™ š › œ  ž Ÿ
É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ
Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß Ü à á

Ö × Ø Ù × Ú Û Ü Ý Þ Ø ß Ý Þ Ø à á â ã ä å æ ç è

CompoundStm é ê ë ì í î ê í ï ð ñ ò ó ô ð ó õ ö

ã ä å æ æ ç è é ê è ë ì í ë î ï ð ñ ò ó ô

÷ ø ù ú û û ü ý ý þ ÿ

õ ö ÷ ø ù ú û ü ý þ ÿ        

                   

AssignStm CompoundStm
        !    "  # $ %   &  ' (  ) *

                + , - . , / - . , 0 1 2 3 4 5 6 7 8 9 : ;

a OpExp AssignStm PrintStm    ! " # $ % & # $ ' " ! ( ) * + , - .


< = > ? @ A = B C D E F G C H I J K L M N I O P

/ 0 1 2 0 3 4 5 6 7 4 8 9 Q

LastExpList
£

NumExp Plus NumExp b EseqExp :

R S T U U V W X Y V Z [ \ ] ^ \ _ ` a b c d e

IdExp f g h i j k l m n o p q r s

5 3
; < = > > ? @ A B C D C E F G H F I J K L M N O

P Q R S T U V W X Y Z [ t u v w t x y z { | } ~  €  ‚ ƒ „

b
PrintStm OpExp \ ] ^ _ ` a ` b c d e f g ^ h ` i j
† ‡ ˆ ‰ † Š ‹ Œ  Ž ‹  

k l m n o p l q r ‘

PairExpList
NumExp Times IdExp
s

’ “ ” • – ’ — • ˜ ™ š › › œ  ž Ÿ ¡ ¢ £ ¤

¥ ¦ § ¨ ¨ © ª « ¬ ­ ® ¯ ° « ± ² ³ ´ µ ³ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À

IdExp LastExpList 10 a t u v w x t y w z { | } } ~  €  ‚
Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô

a OpExp
ƒ „ † † ‡ ˆ ‰ Š ‹ Œ  Ž Œ   ‘ ’ “ ” •
Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â Ý ã ä å ß à á æ ç è é ê ë ì í î ï ð

ñ ò ó ô õ ö ò ÷ ø ù ú û ü ø ý þ

– — ˜ ™ š › œ  ž

Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­ ® ¯ ° ® ± ²

IdExp Minus NumExp


¢

                       

a 1       !

" # $ % & ' ( ) * + , - . ( / * + 0 , - . 1 2 3 4 5 6 7 8 4 9 :




This is a convenient internal representation for a compiler to use. ;

<

29 30

Scanner

tokens
F

source
+

scanner parser IR
+

code
*

errors

maps characters into tokens – the basic unit of syntax


> ? @ A B C

Chapter 2: Lexical Analysis becomes D

id, E F G H id, I J K L id, M N O

character string value for a token is a lexeme


typical tokens: number, id, , , , , ,


S T U V W X Y Z [

eliminates white space (tabs, blanks, comments )


]

a key issue is speed


use specialized recognizer (as opposed to )
^

_ ` a

 b

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work


for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
c

otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
< <

31 32
Specifying patterns Specifying patterns

A scanner must recognize various parts of the language’s syntax




A scanner must recognize various parts of the language’s syntax


Some parts are easy: Other parts are much harder:

white space identifiers


alphabetic followed by k alphanumerics ( , $, &, . . . )
‚

ws ::= ws
d

e f g h i

j k

ws
 l m n o m

p q r

s t u v t

numbers

integers: 0 or digit from 1-9 followed by digits from 0-9


keywords and operators  

decimals: integer digits from 0-9


specified as literal patterns: ,
ƒ „ ƒ

w x y z {

reals: (integer or decimal) † (+ or -) digits from 0-9


comments


complex: ‡ ˆ ‡ real ‰ Š ‰ real ‹ Œ ‹

opening and closing delimiters: | } ~ ~ ~  €

We need a powerful notation to specify these patterns


< <

33 34

Operations on languages Regular expressions

Patterns are often specified as regular languages


«

Operation Definition
union of L and M L M s s L or s M Notations used to describe a regular language (or a regular set) include


‘ ’ ‘ “ ‘ ”

Ž   •

written L M –

both regular expressions and regular grammars


concatenation of L and M LM st s L and t M


‘ ™ ‘ š

Regular expressions (over an alphabet Σ):


— ˜ › œ

written LM
∞ Li
¢

Kleene closure of L L
ž ž Ÿ ¡ ž

i 0
¢ £

written L

ž ¥

1. ε is a RE denoting the set ε ­ ®

∞ Li Σ, then a is a RE denoting a
¢

positive closure of L L 2. if a
¬

¯ ° ¯ ¯ ²

i 1
¢ ©

¦ § ¨
±

written L ª

3. if r and s are REs, denoting L r and L s , then:


‘

³ ´ µ
‘ ¶

r is a RE denoting L r
¸ ¹ º

r s is a RE denoting L r Ls
ž À ž Ã

‘ ¿ ‘ Ä

¼ ½ ¾ Á Â

r s is a RE denoting L r L s
Æ Ç
‘ È

É Ê Ë
‘ Ì

r Î Ï is a RE denoting L r Ð Ñ Ò

If we adopt a precedence for operators, the extra parentheses can go away.




We assume closure, then concatenation, then alternation as the order of


  B

precedence.
< <

35 36
Examples Algebraic properties of REs

identifier Axiom Description


letter
Q Ó Ô

a b c Õ
Ö ×

Ø Ù Ú Ú Ú Û
Ü

z A B C Ý
Þ ß
à á

â ã ã ã ä
å

Z
æ

rs sr )
‘ * ‘ +

is commutative ,

ç
r st - rs t .
‘ /

0 1 2 3
‘ 4 5

is associative 6

digit
è é

ò ó ô õ

0 1 2 3 4 5 6 7 8 9
ê ë í î ð ñ ù ú
ö ÷

ì ï ø

rs t r st 8 9 :
‘ ;

concatenation is associative
r st rs rt concatenation distributes over
ç þ ÿ

id letter letter digit


‘ =

< > ? @ A

û ü ý

s t r sr tr ‘ C

D E
‘ F

numbers εr r G ε is the identity for concatenation


integer
    

ε
ê

0


1 2 3
ù

9 digit

ç  

rε r H

rε relation between and ε


  

ç  ç  
r I J K L M N O

decimal integer . digit 

r r P P Q R is idempotent S

real
b

integer decimal
 ç        ç

digit
«  

complex real real


 « $ « % & %

! " # "

Numbers can get much more complicated


'

Most programming language tokens can be described with REs

We can use REs to build scanners automatically


( T

37 38

Examples Recognizers

From a regular expression we can construct a


Let Σ
Ö X
¬ U V

ab W

deterministic finite automaton (DFA)


 Z

1. a b denotes a b
Ö Ö \

Recognizer for identifier :


¯ Y ¯ [

letter
—

 c
digit
2. a b a b denotes aa ab ba bb
Ö _ ` Ö b Ö f Ö g

¯ ^ ¯ a ¯ d ¯ e

i.e., a b a b
Ö j k Ö m n Ö q Ö

aa ab ba bb letter other
˜

¯ i ¯ l ¯ o ¯ p

” •

0 1 2
h

digit accept
other
˜

3. a denotes ε a aa aaa
 s

¯ r ¯ u ¯ v ¯ w x x x y

3
t

error

4. a b denotes the set of all strings of a’s and b’s (including ε)




Ö | } Ö

¯ { ¯

identifier
i.e., a b
Ö €  ‚ ƒ Ö † ‡

a b ~
¯  ¯ „

letter
Ö œ

™ š
¯

a b c › Ø  ž ž ž Ÿ
Ü

z A B C ¡ ¢ £ ¤ ¤ ¤ ¥ Z ¦

ç § ¨

digit
ò ® ô ¯
ê © ù ²

0 1 2 3 4 5 6 7 8 9
ö °

ª « ¬ ­ ±

 Š

5. a a b denotes a b ab aab aaab aaaab


Ö Ö Œ

¯ ˆ ¯ ‰ ¯ ‹ ¯  ¯ Ž ¯  ¯  ‘ ‘ ‘ ’

ç ¶ ·

id ³ letter ´ letter digit µ

39 40
Code for the recognizer Tables for the recognizer

¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç

È É Ê É Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Ù Û Ü Ý Þ

Two tables control the recognizer


ß à á â ã ä å æ ç è é

ê ë ì í î ï ð ñ ò ó ô õ õ ö ÷ ø ù ú û ü ý þ ÿ    

         

             !  "  # $ %

& ' ( ' ) * + , - . / 0 1 0 2 3 4 5 1 / / 6 / 0 1 0 2 7 8

a z A Z 0 9 other
Þ  ê  ù
å

¯ 
Ü

9 : ; < = > ? 9 < @ < A B C

û ü ý þ ÿ    

value letter letter digit other


D E F G H I J K L M N O P N Q R S T U V W X

Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m n o p q r s

t u v w x y z { | } ~  €  ‚ ƒ

„ † ‡ ˆ ‰

Š ‹ Œ  Ž   ‘ ’ “ “ ” • – — ˜ ™ ˜ š › œ

 ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « § ¬ § © ­ ®

¯ ° ± ² ³ ´ µ ¶ · ¸

¹ º » ¼ ½ ¾
class 0 1 2 3
¿ À Á Â Ã Ä Å Æ Ç È È É È Ê Ë

   
letter


1 1 — —
digit 3 1 — —
Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × × Ø × Ù

Ú Û Ü Ý Þ ß à á â ã

ä å æ ç è é
other 3 2 — —
ê

ì í î ï ì ð ñ ò ó ô õ ö ÷ ø ù ú

To change languages, we can just change tables


41 42

Automatic construction Grammars for regular languages

Scanner generators automatically construct code from regular expression- Can we place a restriction on the form of a grammar to ensure that it de-
like descriptions scribes a regular language?

construct a dfa


Provable fact:

use state minimization techniques For any RE r, there is a grammar g such that L r Lg.
ž  ž 

  

 

emit code for the scanner !

The grammars that generate regular sets are called regular grammars
"

(table driven or direct code )


Definition:
A key issue in automation is an interface to the parser In a regular grammar, all productions have one of two forms:

1. A #
¯

aA
2. A
Þ $

  

is a scanner generator supplied with UNIX


¯

emits C code for scanner where A is any non-terminal and a is any terminal symbol ¯

provides macro definitions for each token


(used in the parser) These are also called type 3 grammars (Chomsky)
%

43 44
More regular languages More regular expressions

Example: the set of strings containing an even number of zeros and an What about the RE a b abb ?
Ö @ A

¯ ? ¯

>

even number of ones


E

ab D

1
s0 &

s1 B

s0
C

a B

s1
b B

s2
b B

s3
1
0 0 0 0
State s0 has multiple transitions on a!
‘

¤
¯

1 nondeterministic finite automaton


F G

s2 '

s3 (

a ¯
Ö

b
1
‘

s0 ¤ H
‘

s0 s1
¤ I
‘

J K
‘

s0 ¤ L

s1 – M
‘

s2 N

!
‘

s2 O

– P
‘

s3 Q R

The RE is 00 11 01 10 00 11 01 10 00 11
* + * / * 2 * 6 * 9

) , - . . 0 1 3 4 5 7 8 : ; < =

45 46

Finite automata DFAs and NFAs are equivalent


A non-deterministic finite automaton (NFA) consists of:

1. a set of states S
S
T U V

s0 ¤ W X X X W
‘

sn Y Z

1. DFAs are clearly a subset of NFAs


2. a set of input symbols Σ (the alphabet)
¬

3. a transition function move mapping state-symbol pairs to sets of states 2. Any NFA can be converted into a DFA, by simulating sets of simulta-
[

4. a distinguished star t state s0


S

¤
neous states:

5. a set of distinguished accepting or final states F


]

each DFA state corresponds to a set of NFA states

A Deterministic Finite Automaton (DFA) is a special case of an NFA:


^

possible exponential blowup

1. no state has a ε-transition, and

2. for each state s and input symbol a, there is at most one edge labelled
‘ ¯

a leaving s.
¯ ‘

A DFA accepts x iff. there exists a unique path through the transition graph
`

from the s0 to an accepting state such that the labels along the edges spell
a

x.
47 48
NFA to DFA using the subset construction: example 1 Constructing a DFA from a regular expression
g

ab f

DFA
minimized

a b b
d

s0 d

s1 d

s2 d

s3
RE DFA

NFA
h
a ¯

b ε moves

o
s0 ‘

¤ i j
‘

s0 s1
¤ k
‘

l m
s0 ‘

¤ n

x
‘

s0 s1
¤ p
‘

q r
‘

s0 s1
¤ s
‘

t u
‘

s0 s2 ¤ v
‘

O w

RE NFA w/ε moves


•

s0 s2
¤ y
‘

O z {
‘

s0 s1
¤ |
‘

} ~
‘

s0 s3 ¤ 
‘

Q €

build NFA for each term


connect them with ε moves


s0 s3
¤ ‚
‘

Q ƒ „
‘

s0 s1
¤
‘

† ‡
s0 ‘

¤ ˆ

NFA w/ε moves to DFA


b
construct the simulation
the “subset” construction
a

b
a DFA minimized DFA –

merge compatible states


‰
a b b
s0 s0 s1 s0 s2 s0 s3
DFA RE
Š ‹ Œ

 Ž   ‘ ’ “ ”

˜ š ˜ œ ˜ Ÿ ˜ ¡

Rkik 1 Rkkk 1 Rkk j 1 Rki j 1


˜

construct Rikj ¢ ™

¢ ›
˜  ž ˜
¢

a
a
¢

49 50

RE to NFA RE to NFA: example

ε
µ

¦ Ö · ¸

a b abb
¶ ¯

N ε
£ ¤

a a
2 3

N a
£ §

¯ ¨

ε ε

1 6
N(A)
­

A
®

ε ε
¦ ¦

ε ε

4 5
¯

ab ¹

b
ε
ε ε
¦ ¦

N(B) B
®

N AB
£ ª

« ¬
a
2 3

ε ε
N(A) A N(B) B
®

ε ε
0 1 6 7
N AB
£ ¯

ε
¦

ε ε

4 5
ε b
¦

ε ε
¦

N(A) A
®

Ö ¼ ½

N A
±

² ³
¯

ab »

a b b
7 8 9 10
¯

abb
´ ¾

51 52
NFA to DFA: the subset construction NFA to DFA using subset construction: example 2

Input: NFA N
¿

À
ε
Output: A DFA D with states Dstates and transitions Dtrans
Á

Ã
Â

such that L D LN
¿ É

Å Æ Ç È

Ë Ë a
Method: Let s be a state in N and T be a set of states, 2 3
¿ Ì

and using the following operations:


Ã

ε ε

À
ε ε a b b
0 1 6 7 8 9 10
Operation Definition
ε-closure s set of NFA states reachable from NFA state s on ε-transitions alone
Ä Ð

Í Í

Ê Ï Ê

ε ε
ε-closure T set of NFA states reachable from some NFA state s in T on ε-
Ä Ð

Í Í

Ñ Ò

4 5
transitions alone
Ó

move T a set of NFA states to which there is a transition on input symbol a


Ä

Ö × Ö

Ô Õ

Ø Ù

from some NFA state s in T


Ì

add state T ε-closure s0 unmarked to Dstates


Ì Ú

Ã
Ý
Í

while unmarked state T in Dstates


Þ
Ý
Û

Ü


a b
A B C
Þ
à

mark T
Ì

ò ÿ ô
* õ ù 

A 01247 D 1245679
ö ù ö

B B D
à à 

ó ô ö ÷ ø ú û ü ý þ

for each input symbol a


ß

ô ò  ô 

B 1234678 E 1 2 4 5 6 7 10
í  ð  í  ð 
à   ö  ö 

U ε-closure move T a
Ì ã

C B C
Ö ä ä

   

à á â

ò  ô 

C 124567
ö 

if U Dstates then add U to Dstates unmarked


è
Ó

ç Ã
Ý

D B E
    

Dtrans T a U
Ì ê

Ö ë ì

E B C
à

endfor
í

endwhile
í

ε-closure s0 is the start state of D


Ù

ð Ù

A state of D is accepting if it contains at least one accepting state in N


Á ¿

53 54

Limits of regular languages So what is hard?


Not all languages are regular Language features that can cause problems:

One cannot construct DFAs to recognize these languages: "

reserved words
˜ ˜ '
PL/I had no reserved words
L pk qk
" # $

! % &

B C D E F G H I J K L M N O P Q R S Q T U V W U X Y Z X [ \ ] ^ _ `

wcwr w Σ
+ ,

L
. / 0
" ) *

significant blanks
S

FORTRAN and Algol68 ignore blanks


Note: neither of these is a regular expression!
1

a b c d e f g h i j

(DFAs cannot count!) k l m n o p q r s t

But, this is a little subtle. One can construct DFAs for: string constants
S

special characters in strings


2

alternating 0’s and 1’s u

, , v

, w x y u v

z { | } ~  €  ‚ ƒ „ „ † ‡ ˆ ‰ Š ‹ Œ ‹  ‰ Ž

ε 1 01 ε 0
3

* 7 8 9 * ;

4 5 6 :

finite closures
<

sets of pairs of 0’s and 1’s some languages limit identifier lengths
adds states to count length
=

* >

01 10 ? @

FORTRAN 66 6 characters 

These can be swept under the rug in the language design


A 

55 56
How bad can it get?

‘ ’ “ ” • – • — ˜ ™ “ š ” ’ › “ œ

 ž Ÿ Ÿ ¡ ¢ £ ¢ ¤ Ÿ ¥ ¦ § ¨ ¥ © ª

« ¬ ­ ® ¯ ¬ ° ¬ ± ² ³ ´ µ ´ ² ¶ · µ ¸ ¹ ´ º » ¼ ¹ ´ º » ¼

½ ¾ ¿ À Á Â Á Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ä Ê Ë Ì Í Î Ð Å Ñ Ò Ë

Ó Ô Õ Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à Ü á ß

â ã ä ä å æ ç è é ê ë ì í î ï ð í

ñ ò ó ô õ ö ÷ ö

ø ù ú û ü ý þ ý ÿ

       
Chapter 3: LL Parsing
    

             

   ! " # $ % # & '

( ) * + ,

- . / 0 1 2 3 4 5 6 7 7 8 9 :

; < = > ? @ A B

C D E F G

H I J K L M N O P Q R S T U V W X Y X Z [ \ ] ^ _ ` a b c d e f g h f g i j k f l

m n

57 58

The role of the parser Syntax analysis

source tokens Context-free syntax is specified with a context-free grammar.


z

code scanner parser IR

Formally, a CFG G is a 4-tuple Vt Vn S P , where:


T 

| }

Y ~

{ €

errors
Vt is the set of terminal symbols in the grammar.
|

Parser For our purposes, Vt is the set of tokens returned by the scanner.
|

Vn, the nonterminals, is a set of syntactic variables that denote sets of


a

performs context-free syntax analysis Y

p
q

guides context-sensitive analysis (sub)strings occurring in the language.


!

These are used to impose a structure on the grammar.


r

constructs an intermediate representation 

S is a distinguished nonterminal S Vn denoting the entire set of strings


T T ‚

produces meaningful error messages


Y ƒ

s


in L G . „

attempts error correction !

This is sometimes called a goal symbol.


†

P is a finite set of productions specifying how terminals and non-terminals


‡

For the next few weeks, we will look at parser construction


can be combined to form strings in the language.
u w

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
v
x

Each production must have a single non-terminal on its left hand side.
for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
y

otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or !

fee. Request permission to publish from [email protected]. The set V ‰ Vt Vn is called the vocabulary of G
| Š

n ‹

59 60
Notation and terminology Syntax analysis

 
Grammars are often written in Backus-Naur form (BNF).
Œ 

abc Ž ‘ ’ “ “ “ ”

Vt |

Example:
•

ABC – — ˜ ™ ™ ™ š Vn Y

1 ×
Ø

goal Ù

:: Ú Û
Ü

expr Ý

U VW V
2 expr :: expr op expr
œ  ž Ÿ Ÿ Ÿ

Ü Ü Ü

Þ ß à á â ã ä å æ

αβγ V 3
¤ ¥ ¦ ¦ ¦ §

ç è é ê

¢ £ ¨

4
ë ì í î

uvw « ¬

Vt |

ï ð

5 op ::
ª ­ ® ® ® ¯ °

ñ ò ó

ô õ ö

6
7
÷ ø ù

If A γ then αAβ αγβ is a single-step derivation using A γ


± ² a ± ± ´

¤ ¤

8
³

ú û

This describes simple expressions over numbers and identifiers.


Similarly, and denote derivations of 0 and 1 steps
*

µ ¶ · ¸ ¹ º

In a BNF for a grammar, we represent


If S β then β is said to be a sentential form of G
T » ¼ a

1. non-terminals with angle brackets or capital letters


2. terminals with font or underline


ü ý þ ÿ  ü ÿ 

LG w Vt S w ,w L G is called a sentence of G
T Ä Å
" ½ " È

3. productions as in the example


| Ã

¾ ¿ À Á Â Æ Ç É

Note, L G β V S β Vt
T Ñ Ò
" Ê

Ë Ì Í Î Ï Ð Ó Ô Õ

Ö 

61 62

Scanning vs. parsing Derivations


Where do we draw the line?
Q

term ::               
*

0
 

9
  

We can view the productions of a CFG as rewriting rules.




*  ! " # $ % & ' ( ) *

0
+

op :: , - . / 0 1 2 3
Using our example CFG:
expr :: term op term R

goal expr
4 5 6 7

Ø S T U Ü

expr op expr
W X

Ü Ü

Regular expressions are used to classify: Y Z [ \ ]

expr op expr op expr


^ _

Ü Ü Ü

` a b c d e f g h

k l m n

id, op expr op expr


i j

Ü Ü

identifiers, numbers, keywords k x y z {


o p q r s t u

id, expr op expr


v w

Ü Ü

REs are more concise and simpler for tokens than a grammar
| } ~  €

id, num, op expr


 ‚

ƒ „ † ‡ ˆ ‰ Š ‹ Œ

more efficient scanners can be built from REs (DFAs) than grammars  Ž

id, num, expr   ‘ ’ “ ” • –


Ü

id, num, id,


˜ ™

š › œ  ž Ÿ ¡ ¢ £

Context-free grammars are used to count:


Q

brackets: < = , > ? @ A B ... C D E , F G ... H I J K ... L M N L


We have derived the sentence ¤ ¥ ¦ § ¨ .
Q

imparting structure: expressions We denote this goal ©


Ø ª « ¬ ­ ® ¯ ° ± ² ³ ´ µ

Such a sequence of rewrites is a derivation or a parse.


Syntactic analysis is complicated enough: grammar for C has around 200


ˆ

productions. Factoring out lexical analysis as a separate phase makes


compiler more manageable. The process of discovering a derivation is called parsing.
ˆ

P ·

63 64
Derivations Rightmost derivation

At each step, we chose a non-terminal to replace. For the string » ¼ ½ ¾ ¿ :


À

goal Á Â Ã Ü

expr Ä

This choice can lead to different derivations. Å Æ

expr Ç È op expr É Ê
Ü

expr op id,
Ì Í

Î Ï Ð Ñ Ò Ó

expr id,
Ô Õ

Two are of particular interest: Ö × Ø Ù Ú

expr op expr id,


Û Ü

Ü Ü

Ý Þ ß à á â ã ä å

ñ ò ó

expr op num, id,


æ ç

Ü ì í î ï ð

è é ê ë

leftmost derivation
ñ ý þ

expr num, id,


ô õ

Ü ì ù ú û ü

ö ÷ ø

the leftmost non-terminal is replaced at each step


¹

id, num, id,


ÿ

       

rightmost derivation
º

the rightmost non-terminal is replaced at each step


¹

Again, goal Ø          

The previous example was a leftmost derivation.


· ·

65 66

Precedence Precedence

These two derivations point out a problem with the grammar.


goal
It has no notion of precedence, or implied order of evaluation.
"

To add precedence takes additional machinery:


expr

1 #
Ø

goal $

:: % &
Ü

expr '

2
)

expr :: expr term


0 1

Ü Ü

* + , - . /

3 expr term
0 7

expr op expr 2 3 4 5 6

4 term
ë 8 9

0 :

ï ;

5 term :: term factor


0 < 0 ? @ A

= > B

ô C D

6 term factor
0 E F G

expr op expr * <id,y> H

7 factor
I J K

8 N factor O :: P Q R S

9
 T U V

<id,x> + <num,2>

Treewalk evaluation computes (    )  

This grammar enforces a precedence on the derivation:


— the “wrong” answer!
terms must be derived from expressions
X

Should be   (  ! )
Y

forces the “correct” tree


· Z

67 68
Precedence Precedence

Now, for the string [ \ ] ^ _ : goal


`

goal a b c Ü

expr d

expr term
0 j

e f

expr
g h i

expr term factor


0 p q r
k l

m n o s

expr term id,


0 y z {
t u

v w x | }

expr factor id,


~ 

€  ‚ ƒ „ † ‡

ñ ’ “

expr + term
expr num, id,
ˆ ‰

Ü  Ž   ‘

Š ‹ Œ

ñ  ž

term num, id,


0 – — ˜
” •

 ™ š › œ

factor num, id,


Ÿ

¡ ¢ £ ¤ ¥ ¦ § ¨ ©

term
Ä

term * factor
id, num, id,
ª «

¬ ­ ® ¯ ° ± ² ³ ´ µ

Again, goal ¶
Ø · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â

, but this time, we build the desired tree.


factor factor <id,y>

<id,x> <num,2>

Treewalk evaluation computes Å Æ ( Ç È É )

69 70

Ambiguity Ambiguity

If a grammar has more than one derivation for a single sentential form,
May be able to eliminate ambiguities by rearranging the grammar:
then it is ambiguous
Ê

stmt  ::= matched






 

Example:
Ì
unmatched
stmt ::= Í Î Ï Ð
Ü

expr Ñ Ò Ó Ô Õ Ö stmt ×
matched ::= 
expr matched

matched




Ü

! "


#


$


#


"
        

Ø Ù Ú Û

expr Ü Ý Þ ß à á stmt â ã ä å ã æ stmt ç

::=
è é ê ë ì í î ï ð ï î

unmatched &

1
'

2
(

3
)

4
Ü

expr * + , - . / stmt 0

expr 

matched ; < = > < ?

unmatched @

Consider deriving the sentential form:


5 6 7 8 9 :

ñ ò
ó

E1
ó

E2 O ú û ü ý
T

S1
T

S2 O

This generates the same language as the ambiguous grammar, but applies
ô õ ö ÷ ø ù þ ÿ þ

the common sense rule:


B

It has two derivations.




This ambiguity is purely grammatical. match each C D E C with the closest unmatched F G H I

It is a context-free ambiguity.
z

This is most likely the language designer’s intent.


71 72
Ambiguity Parsing: the big picture

Ambiguity is often due to confusion in the context-free specification.


J

tokens
[

Context-sensitive confusions can arise from overloading.


K

Example:
L M N O P Q R

In many Algol-like languages, could be a function or subscripted variable. parser


X

grammar parser
Y
X

Disambiguating this statement requires context: generator


Y

need values of declarations


not context-free
V

really an issue of type

code IR
Z

Rather than complicate parsing, we will handle this separately. Our goal is a flexible parser generator system
73 74

Top-down versus bottom-up Top-down parsing

Top-down parsers A top-down parser starts with the root of the parse tree, labelled with the
f

star t or goal symbol of the grammar.


g

start at the root of derivation tree and fill in


To build a parse, it repeats the following steps until the fringe of the parse
]

picks a production and tries to match the input


tree matches the input string
B

may require backtracking


1. At a node labelled A, select a production A α and construct the
± ± h

some grammars are backtrack-free (predictive)


`

a
appropriate child for each symbol of α
Bottom-up parsers 2. When a terminal is added to the fringe that doesn’t match the input
b

start at the leaves and fill in string, backtrack


i

start in a state valid for legal first tokens 3. Find the next node to be expanded (must have a label in Vn) j

as input is consumed, change state to encode possibilities


(recognize valid prefixes )
e

use a stack to store both state and sentential forms


A

The key is selecting the right production in step 1

should be guided by input string


k

75 76
Simple expression grammar Example
£

Prod’n Sentential form Input


– goal
¤
¥ ¦ § ¨ © ª « ¬

Recall our grammar for simple expressions: 1 expr


­
®

¯ ° ± ² ³ ´ µ

2 expr term
º » ¼ ½ ¾ ¿ À Á

¶ · ¸ ¹

1 l
Ø

goal m

:: n o
Ü

expr p

4 term
Â
º

term
Ã Ä Å º Æ Ç È É Ê Ë Ì

2 expr :: expr term


( q

7 factor term
0 x

º Ñ Ò Ó Ô Õ Ö ×

Ü Ü

Í Î Ï Ð

r s t u v w

Ø Ù Ú Û Ü

9 term
º Ý Þ ß à á â ã

3 expr term
0 ~

y z { | }

– term
º è é ê ë ì í î

4 term
0 
ä å æ ç

– expr
 €

ï ð ñ ò ó ô õ ö

ï ‚

5 term :: term factor


0 ƒ 0 † ‡ ˆ

÷ ø

3 expr term
º ü ý þ ÿ  

„ ‰

ù ú û

ô Š ‹ 

6 term factor
 

4 term term
0 Œ  Ž

º    º  

7 factor term
º       

7 factor
‘ ’ “

   

Ø    

9 term
º   ! " # $

8 • factor – :: — ˜ ™ š

– % term& ' (
º ) * + , - . /

 › œ 

9 – 0 term1 2 3
º 4 5 6 7 8 9 :

7
F
;

G
factor
<

H
=

I J
>

K L
? @

M
A

N
B

O
C

P
D

Q
E

8
Consider the input string ž Ÿ ¡ ¢
– S T U V W X Y Z [ \ ] ^

– term
º c d e f g h i

_ ` a b

5
k l m n
r s t u v w x y

term factor
º o p q

z { | } ~

7
r  €  r ‚ ƒ „ † ‡ ˆ

‰ Š ‹ Œ 
factor Ž   ‘
factor
8 factor ’ “ ” • – — ˜


r ¡ ¢ £ ¤ ¥ ¦ §

™ š › œ  ž Ÿ factor

Ø
¨

·
©

¸
ª

¹
«

º
¬

»
­

¼
®

½ ¾
¯ factor
¿
° ±

À
²

Á
³

Â
´

Ã
µ

Ä

9
– Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô

77 78

Example Left-recursion

Another possible parse for Õ Ö × Ø Ù Top-down parsers cannot handle left-recursion in a grammar

Prod’n Sentential form Input Formally, a grammar is left-recursive if


$

– goal Ú
Ø Û Ü Ý Þ ß à á

1 expr â
Ü

ã ä å æ ç è é

2 expr term Vn such that A Aα for some string α


%

A
0 î ï ð ñ ò ó ô

ê ë ì í

& ' (

2 expr term term


0 ù ú û 0 ü ý þ ÿ  

õ ö ÷ ø

2 expr term
0    

   

2 expr term
0           

   

2      ! " #

If the parser makes the wrong choices, expansion doesn’t terminate.


This isn’t a good property for a parser to have.

(Parsers should terminate!) Our simple expression grammar is left-recursive


)

79 80
Eliminating left-recursion Example
Our expression grammar contains two cases of left-recursion
B

To remove left-recursion, we can transform the grammar expr :: expr term


0 I

Ü Ü

C D E F G H

J K

expr term
0 O

L M N

P Q

term
0 R

Consider the grammar fragment: *


S

term :: term factor


0 T 0 W X Y

U V Z

foo α
[ \

term factor
 +  .

foo ::
0 ] ^ _

, -

β
a b

factor c

where α and β do not start with foo


d

Applying the transformation gives


0

1 2

expr :: term expr


0 i j

Ü Ü

f g h k l

We can rewrite this as: expr :: term expr


0 s t

Ü Ü

n o p q r u v

ε
w

β bar
 4 7 8

foo :: x y z

term expr
5 6

0 { |

α bar
9

7 : 7 =

bar ::
} ~

term :: factor term


; <

0 € 0 †

ε
>

 ‚ ƒ „

term :: factor term


0 ˆ ‰ 0  

Š ‹ Œ  Ž

where bar is a new non-terminal ε


7 A

? @

’ “ ”

factor term
0 — ˜

• –

With this grammar, a top-down parser will


terminate
B

This fragment contains no left-recursion


›

backtrack on some inputs


) )

81 82

Example Example
A

This cleaner grammar defines the same language


Our long-suffering expression grammar:
1 œ
Ø

goal 

:: ž Ÿ
Ü

expr
2 ¡
Ü

expr ¢ :: £ ¤
0

term
¥ ¦ §

expr
Ü

¨
1 Ï
Ø

goal Ð

:: Ñ expr Ò
Ü

2 expr :: term expr


( Ô

0 Ø Ù

Ü Ü

3 term expr
0 « ¬ ­

Õ Ö × Ú Û

© ª ®

3 expr :: term expr


0 â ã

Ü Ü

4
¯ °

term
0 ±

Ü Ý Þ ß à á ä å

4 term expr
0 é ê

ï ² 

5 term :: factor term


0 ³ 0 ¹

æ ç è ë ì

ε
´ µ ¶ · ¸

ï í

6
º »

factor ¼ ½ ¾
0

term
¿

5
ô î 

6 term :: factor term


0 ï 0 ô õ

7 factor
‘ À Á

ð ñ ò ó

7 term :: factor term


‘ ö

0 ÷ ø 0 þ ÿ

8 Ã factor Ä :: Å Æ Ç È
ù ú û ü ý

8 factor term
0  

9
É Ê Ë

   

ε
 

9
It is 10  factor ::
11
right-recursive
  

free of ε productions
Î

Unfortunately, it generates different associativity


Same syntax, different meaning Recall, we factored out left-recursion
) )

83 84
How much lookahead is needed? Predictive parsing

Basic idea:
a

We saw that top-down parsers may need to backtrack when they select the
wrong production
For any two productions A α β, we would like a distinct way of
± 

Do we need arbitrary lookahead to parse CFGs? choosing the correct production to expand.


in general, yes For some RHS α G, define FIRST α as the set of tokens that appear
  

use the Earley or Cocke-Younger, Kasami algorithms first in some string derived from α


That is, for some w Vt , w FIRST α iff. α




Aho, Hopcroft, and Ullman, Problem 2.34  wγ. |

   ! "
#

Parsing, Translation and Compiling, Chapter 4


Key property:
Fortunately
α and A β both appear in the grammar,
™

Whenever two productions A


± $ ± %

large subclasses of CFGs can be parsed with limited lookahead we would like


most programming language constructs can be expressed in a gram- FIRST & α ' ( FIRST ) β * + φ
mar that falls in these subclasses
This would allow the parser to make a correct choice with a lookahead of
Among the interesting subclasses are: only one symbol!
LL(1): left to right scan, left-most derivation, 1-token lookahead; and
LR(1): left to right scan, right-most derivation, 1-token lookahead The example grammar has this property!
) )

85 86

Left factoring Example

Consider a right-recursive version of the expression grammar:


=
>

What if a grammar does not have this property?


1 ?
Ø

goal @

:: A B
Ü

expr C

2 expr :: term expr


( D

0 H I J

Ü Ü

Sometimes, we can transform a grammar to have this property. E F G K

3 term expr
0 N O P

L M Q

4
R S

term
0 T

For each non-terminal A find the longest prefix


±

ï U

5 term :: factor term


0 V 0 \

α common to two or more of its alternatives.


W X Y Z [

ô ] ^ 

6 factor term
0 b

_ ` a

7 factor
‘ c d

if α ε then replace all of the A productions


e

B ±

8 factor ::
-

f g h i j k

A αβ1 αβ2 αβn


± .
 l m n

/
0 1 2 2 2 3

9
with
?

A αA 4 5

To choose between productions 2, 3, & 4, the parser must see past the o p q

A β1 β26 βn
7

8
0 9 : : : ;

or r and look at the , , , or .


s t u v w

where A is a new non-terminal.


φ
(

2
y z

3
ë

4
 €

<

FIRST x FIRST { | } FIRST ~




Repeat until no two alternatives for a single


This grammar fails the test.
B

non-terminal have a common prefix.


Note: This grammar is right-associative.
) )

87 88
Example Example

There are two nonterminals that must be left factored: Substituting back into the grammar yields
‚

expr :: term expr


0 † ‡ ˆ

Ü Ü

ƒ „ ‰

Š ‹

term
Œ  Ž

expr 
1 Ì
Ø

goal Í

:: Î expr Ï
Ü

2 expr :: term expr


( Ñ

0 Õ Ö

Ü Ü

 ‘

term
0 ’

Ò Ó Ô × Ø

“
3 Ù
Ü

expr Ú Û :: Ü expr
Ý Þ
Ü

term :: factor term


0 ” 0 š

• – — ˜ ™

4 à exprá â
Ü

ε
› œ

factor term
0

ï ä

¡ ¢


 ž Ÿ

5
factor £

6
ô å

term
æ

:: ç factor termè


é ê
0 ë ì

7 term :: term
‘ í

0 î ï 0 ó

ð ñ ò

8 term
0 ÷

Applying the transformation gives us: ô õ ö

ε
 ø

¤
9
expr :: term expr
0 ¨ ©

Ü Ü

¬
¥ ¦ § ª «

10 ù factor ú :: û ü ý þ

expr ­ ® :: ¯

³
expr °

´
±

µ
Ü

11 ÿ 

expr Ü

ε
·

Now, selection requires only a single token lookahead.


term :: factor term
0 ¹ 0 ¾ ¿

º » ¼ ½

term :: term
0 Á Â 0 Æ

Ã Ä Å

Ç È É

term
0 Ê

ε
Ë

Note: This grammar is still right-associative.


) 

89 90

Example Back to left-recursion elimination


£

Sentential form Input


– goal

¥     

Given a left-factored CFG, to eliminate left-recursion:


1 expr ®

    

2 term expr
  

        

A Aα then replace all of the A productions




6 factor term expr


 # $ %

if
® : ; : B :

! " & ' ( ) * + , -

11 term expr
 1 2 3

A Aα β γ
. / 0 4 5 6 7 8 9 : ;

– term expr
 ? @ A

< = > ? ? ? @

< = > B C D E F G H I

ε expr
J K L

9 with
M N

O P Q R S T

U V W X Y

4 expr ®

A NA
: A
B C

Z [ \ ] ^ _ `

– a expr b c d
®

e f g h i j k

β γ
l m n o p

2 N
B D

term expr
 q r

s t u v w x y z

E F F F G

 { | } ~

6 factor term expr


  ‚ ƒ

αA ε
®

A
: H I : J K

 € „ † ‡ ˆ ‰ Š ‹

10 term expr
 “ ” •

Œ  Ž   ‘ ’ – — ˜ ™ š › œ 

where N and A are new productions.


: L
B

– term expr
?

 ¥ ¦ §

ž Ÿ ¡ ¢ £ ¤ ¨ © ª « ¬ ­ ® ¯

° ± ² ³ ´ µ ¶ · ¸

7 term expr
 ¹ º

» ¼ ½ ¾ ¿ À Á Â

– term expr
 Ë Ì

Repeat until there are no left-recursive productions.


Ã Ä Å Æ Ç È É Ê Í Î Ï Ð Ñ Ò Ó Ô

 Õ Ö × Ø Ù Ú Û Ü

6
Ý Þ ß

factor term expr


 à á â

ã ä å æ ç è é ê

11 term expr
õ ö ÷ ø

ë ì í î ï ð ñ ò ó ô ù ú û ü ý þ ÿ

– term expr
õ

               

J          

9
( ) *
expr+ , - . / 0 1
®

! "

2
#

3
$

4
%

5
&

6
'

The next symbol determined each choice correctly.

8 8

91 92
Generality Recursive descent parsing
Now, we can produce a simple recursive descent parser from the (right-
Question:
associative) grammar.
By left factoring and eliminating left-recursion, can we transform
$

[ \ ] ^ _

an arbitrary context-free grammar to a form where it can be pre-


` a b c d e f g h i j k l m n o p q

r s t u v w x t y z { | | } | ~  €  ‚ ƒ „

dictively parsed with a single token lookahead?


† ‡ ˆ ‰ Š ‹ Œ 

Ž   ‘ Ž ’ “ ” ” • ” –

Answer: — ˜ ™ š ›

œ  ž Ÿ ¡ ¢ ž £ ¤ ¥ ¦ ¦ § ¦ ¨ © ª « ¬

Given a context-free grammar that doesn’t meet our conditions, ­ ® ¯ ° ­ ± ² ³ ³ ´ ³ µ

it is undecidable whether an equivalent grammar exists that does ¶ · ¸ ¶ ¹ º » ¼ ¹ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È É

meet our conditions. Ê Ë Ì Í Î Ï Ð Ñ Ò Ó

Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å

Many context-free languages do not have such a grammar:


0

æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷

ø ù ú û ø ü ý þ ÿ   

an 0bn n an1b2n n
j j R j j X

Q Q
P

O S T

1 U V W
O S Y

1 Z

                 

Must look past an arbitrary number of a’s to discover the 0 or the 1 and so
      ! " # $ % & ' ( ) * +
P

, - . / , 0 1 2 3 4 5 6 7

determine the derivation. 8 9 : 8 ; < = > ; ? @ A B

8 C

93 94

Recursive descent parsing Building the tree


D E F G H

I J K L M N O P Q K R S T U U V U W X Y Z [ One of the key jobs of the parser is to build an intermediate representation


of the source code.
\

\ ] ^ _ \ ` a b b c b d

e f g e h i j k h l m n o p q r s t u v w x

y z { | } ~  €  ‚
To build an abstract syntax tree, we can simply insert code at the appropri-
ƒ „ † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ”
ate points:
• – — ˜ ™ š › œ  ž Ÿ ¡ ¢ £ ¤ ¥ ¦

§ ¨ © ª § « ¬ ­ ® ¯ ° ± ²
] ^ _ ` a b c d e

can stack nodes f g , h i j

³ ´ µ ³ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ
k l m n o p q r s t u v

can stack nodes , w x

can pop 3, build and push subtree


Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø

y z { | } ~ 

Ù Ú Û Ü Ù Ý Þ ß à á â ã ä

å æ ç å è é ê ë è ì í î ï
€  ‚ ƒ „ † ‡ ˆ ‰ Š ‹

can stack nodes Œ , 

ð ñ ò ó ô õ ö
Ž   ‘ ’ “ ”

can pop 3, build and push subtree


÷ ø ù ú û ü ý þ ÿ       

            
• – — ˜ ™ š ›

can pop and return tree


       !

" # $ " % & ' ( ) * + , - . / 0 1 2 3 4

5 6 7 8 9 : ; < = > ? @ A B C D E F

G H I J G K L M N

O P Q O R S T U R V W X X Y X Z

[ œ

95 96
Non-recursive predictive parsing Non-recursive predictive parsing

Observation: Now, a predictive parser looks like:

Our recursive descent parser encodes state information in its run- Ÿ

stack
time stack, or call stack.

Using recursive procedure calls to implement a stack abstraction may not


source tokens table-driven
Ÿ

be particularly efficient.
scanner IR
Ÿ

code parser
This suggests other implementation methods:


explicit stack, hand-coded parser


ž

stack-based, table-driven parser parsing


tables

Rather than writing code, we build tables.

Building tables can be automated!


œ œ

97 98

Non-recursive predictive parsing


Table-driven parsers
Input: a string w and a parsing table M for G

A parser generator system often looks like:


§ ¨ © ª «

¬ ­ ® ¯ ° ± ­ ² ³ ´ µ ¶ · ¸

Start Symbol
¹ º » ¼ ½ ¾ ¿ ¿ º À Á Â Ã

stack
¢

Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô

Õ Ö × Ö Ø Ù

Ú Û Ü Ý Þ ß à á Ý â ã ä

å æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ÷ ø ù ú û

source tokens table-driven


£ £

scanner IR
¢

ü ý þ ÿ        

code parser
¤

              

    ! ! " ! # $

parser parsing X is a non-terminal


¤ ¤

% & ' % ( )

+ ,

grammar
¦

M X Y1Y2 Yk
- .

generator tables
£

; < = > ?

/ 0 1 2 3 4 5 6 7 8 9

: : :

@ A @ B

C D E F

Yk Yk 1 ; G ; H

I J J J K
Y1
This is true for both top-down (LL) and bottom-up (LR) parsers L M N L O P P Q P R S

T U V W X Y Z [ \ ]

99 100
Non-recursive predictive parsing FIRST
For a string of grammar symbols α, define FIRST α as:
What we need now is a parsing table M .
^

½ ¾

the set of terminal symbols that begin strings derived from α:


À

Our expression grammar: Its parse table:


a Vt α aβ
Á

O Â O

à Ä

Å Æ Ç

1 a

goal b

:: expr e

$† If α ε then ε α
¨

Ÿ ¡ ¢ £ ¤ ¥ ¦ §

` c d f

FIRST
È

2 expr :: term expr


k l m

É Ê Ë Ì Í
e e

ª
g h i j n o

goal a «

1 1 – – – – –
3 expr :: expr α contains the set of tokens valid in the initial position in α
e e

p q r s t u v ¬

FIRST
expr 2 2 – – – – –
Î Ï

4
x y z

expr e
­

To build FIRST X :
{

®
Ò Ó
±

5
| }

ε ²
expr e

¯ ° – – 3 4 – – 5 Ñ

term 6 6 – – – – –
k ³

~ 

6 term :: factor term


k € k †

1. If X Vt then FIRST X is X
À
Ò Ô Ò Ö Ò Ø

 ‚ ƒ „

Ã
´

term – – 9 9 7 8 9
k µ ¶

Õ ×

7 term :: term
‡ ˆ

k ‰ Š k Ž

ε then add ε to FIRST X .


‹ Œ 

2. If X
·
À À
Ò Ù Ò Û

8  term
 ‘
k ’

factor ¸ 11 10 – – – – – Ú

ε
±

3. If X Yk :
“

9
”

Ü Y1Y2 Ý Þ Þ Þ
ß

10 factor ::
(a) Put Y1 ε in FIRST X
Ò æ

FIRST
• — ˜ ™ š ›

11
à ä å

á â ã

i : 1 i k, if ε FIRST Y1
œ  ž

(b)
ê

ç FIRST Yi 1 è é ë ì

í î ï ï ï ð
ñ
ò ó

(i.e., Y1 Yi 1 ε) õ õ õ
ò ö

÷ ø

then put FIRST Yi ε in FIRST X


À

ò ú û ü

ù ý þ ÿ

(c) If ε Yk then put ε in FIRST X




FIRST Y1      
FIRST 
ß 

† ¹

we use $ to represent º » ¼
Repeat until no more additions can be made.
101 102

FOLLOW LL(1) grammars

For a non-terminal A, define FOLLOW A as Previous definition


A grammar G is LL(1) iff. for all non-terminals A, each distinct pair of pro-
the set of terminals that can appear immediately to the right of A
À :

ductions A β and A γ satisfy the condition FIRST β FIRST γ φ.


'

: ( : )

* * / 0

in some sentential form


+ , - .

Ð
What if A 1 2 ε?
Thus, a non-terminal’s FOLLOW set specifies the tokens that can legally
appear after it. Revised definition
A grammar G is LL(1) iff. for each set of productions A α1 α2 αn:
3 4

A terminal symbol has no FOLLOW set.


5 6 7 7 7 8

1. FIRST α1 FIRST α2 > ? @ A A A B

FIRST αn are all pairwise disjoint


9 D

To build FOLLOW A :
: = C
: 

; <

2. If αi ε then αj φ 1 j.
À

A j ni
R S R

FIRST FOLLOW
T U

W
E F G I J K

H L M N O P Q V

1. Put $ in FOLLOW
a

goal  

If G is ε-free, condition 1 is sufficient.


 

2. If A  αBβ:
(a) Put FIRST  β    ε in FOLLOW B
  

(b) If β ε (i.e., A αB) or ε β (i.e., β ε) then put FOLLOW A


:  : $


  FIRST  ! " #

in FOLLOW B % &

Repeat until no more additions can be made

103 104
LL(1) grammars LL(1) parse table construction

Provable facts about LL(1) grammars: Input: Grammar G


Output: Parsing table M
^

1. No left-recursive grammar is LL(1)


2. No ambiguous grammar is LL(1) Method:
±

3. Some languages have no LL(1) grammar 1. l productions A


3 m

α:
4. A ε–free grammar where each alternative expansion for A begins with
X

(a) a α , add A α to M A a
3 r À 3 t
^ s

FIRST
\ o \ u

n p q

a distinct terminal is a simple LL(1) grammar.


Y

(b) If ε v FIRST w α: x

Example i. A , add A α to M A b
z { z 

b
3 } 3 ~ À 3 €
^ 

y FOLLOW |

ii. If $ A then add A α to M A $


3 „ À 3 À 3 ‡
^ †

S aS a
[

\ ] \

‚ FOLLOW ƒ ˆ

is not LL(1) because FIRST aS FIRST a a 2. Set each undefined entry of M to


À ‰ Š Š ‹ Š
^

\ _ ` \ b c d \ e

^ a

Z f

S aS \ g

If M A a with multiple entries then grammar is not LL(1).


aS ε
\ 

Z h i

S
Œ  Ž

\ j k

accepts the same language and is LL(1)

Note: recall a b Vt , so a b ε
’ “ ’ •

 ‘  ”

105 106

Example A grammar that is not LL(1)


Our long-suffering expression grammar:
p

stmt q :: r

}
s

~
t


u

€
e

expr v w x y z { stmt |

expr stmt stmt


S E T FT
Z ˜
™ › œ

 ‚ ƒ „ † ‡ ˆ ‰ Š ˆ ‹ Œ

 Ž Ž Ž

E  TE ž T Ÿ T ¡ ¢ £ T ε ¤

E ¥ ¦ E § ¨ © E ε F ª « ¬ ­ ® ¯ ° ±

Left-factored: 

stmt  :: ‘ ’ “ ”
e

expr • – — ˜ ™ š stmt › œ stmt  ž Ÿ

FIRST FOLLOW
ε
¡

² ³ ´ µ ¶ · ¸ ¹ º »
¼ ½
stmt ¢ £ :: ¤ ¥ ¦ § ¥ stmt ¨ © ª

S ¼
$ Ç

E $
Now, FIRST stmt ε
¾ ¿ À Á Â Ã Ä Å Æ

ε
¼

$
% & ' ( ) * + , -

¼ Ñ

E $
« ¬ ­ ® ¯ ° ± ² ³ ´ µ ³ ¶

Ê Ë Ì Í Î Ï Ð

È É

T
Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
¼

$
à

S S
² ² .

E S
/
² 0

E
/ 1 2 3 4 5

Also, FOLLOW stmt · ¸ ¹ º » ¼ ½ ¾ ¿ À ¾ Á$ Â

ε φ
¼ í

T $ E E TE E TE
Ò á â / / 6 Ò 7 / 8 Ò 9 : ; < = >

But, FIRST stmt stmt


Ê ã ä å æ ç è é ê ë ì

FOLLOW Õ

ε
Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ð Ó Ô

F $ E E EE E E
î ï ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ / ? @ A / B C D / / E F G / H I / J K

T T FT T FT On seeing , conflict between choosing


Ò Ò L î M Ò N î O P Q R S T
      

Ö × Ø Ö

T T ε T ε T TT TT ε
Ò U V W Ò X Y Ò Z [ Ò \ ] ^ Ò Ò _ ` a Ò Ò b c

Ê Ê Ê

   

ε
Ù


 

 
 





F F d e f F g h i j k l m n o

stmt Ú Û :: Ü Ý Þ ß Ý à stmt á and â stmt ã ä :: å

grammar is not LL(1)!


     æ ç

! " # $

The fix:

Put priority on stmt stmt to associate with clos-


À

è é ê :: ë ì í î ì ï ð ñ ò ó ñ

est previous . ô õ ö ÷

107 108
Error recovery
Key notion:
ø

For each non-terminal, construct a set of terminals on which the parser


can synchronize
ú

When an error occurs looking for A, scan until an element of SYNCH A


3 3 ý

û ü

is found
Building SYNCH:
û

1. a \ þ

FOLLOW ÿ A 
\

a 
û

SYNCH


A 
Chapter 4: LR Parsing
2. place keywords that start statements in SYNCH A
3 

û 

3. add symbols in FIRST A to


À

A
û

  SYNCH

If we can’t match a terminal on top of stack:

1. pop the terminal


2. print a message saying the terminal was inserted
±

3. continue the parse


(i.e., SYNCH a Vt a )
û

\ \ 

à  

109 110

Some definitions Bottom-up parsing

Recall Goal:

For a grammar G, with start symbol S, any string α such that S α is


Z Z  

Given an input string w and a grammar G, construct a parse tree by


"

called a sentential form


#

star ting at the leaves and working to the root.


Y

If α Vt , then α is called a sentence in L G


 

  

The parser repeatedly matches a right-sentential form from the language




Otherwise it is just a sentential form (not a sentence in L G )  

against the tree’s upper frontier.

A left-sentential form is a sentential form that occurs in the leftmost deriva-




At each match, it applies a reduction to build on the frontier:


À

tion of some sentence.


À

A right-sentential form is a sentential form that occurs in the rightmost


$

each reduction matches an upper frontier of the partially built tree to


the RHS of some production
À

'

derivation of some sentence.

 

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work

%

each reduction adds a node on top of the frontier


for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
!

otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected]. The final result is a rightmost derivation, in reverse.
111 112
Example Handles

What are we trying to find?


Consider the grammar
A substring α of the tree’s upper frontier that
1 S
Z & '

AB (

2 A A
) 3 * 3 + ,

matches some production A α where reducing α to A is one step in


À

3
D

- .

the reverse of a rightmost derivation


À

4 B
w
 / 0

and the input string 1 2 2 3 4 5

We call such a string a handle.

Prod’n. Sentential Form Formally:


± 6 7 8 9 : ;

3
a handle of a right-sentential form γ is a production A β and a po-
*

2 A
3 = > ? @

sition in γ where β may be found and replaced by A to produce the


<

4 A
previous right-sentential form in a rightmost derivation of γ
A B C

1 aABe

Z

S
i.e., if S rm αAw rm αβw then A β in the position following α is a
Z F G À

I J

The trick appears to be scanning the input and finding valid sentential handle of αβw
forms.
Because γ is a right-sentential form, the substring to the right of a handle
*

contains only terminal symbols.


113 114

Handles Handles

Theorem:
S

If G is unambiguous then every right-sentential form has a unique han-


'

dle.

Proof: (by definition)

α
K

1. G is unambiguous N
rightmost derivation is unique

2. a unique production A β applied to take γi 1 to γi


À

* *

E
E Q

O P

A
β is applied
± R

3. a unique position k at which A


S

β w
L

β
X U

4. a unique handle A
3 V

β in the parse tree for αβw


Ð

The handle A
3 M

115 116
Example Handle-pruning

The left-recursive expression grammar (original form) The process to construct a bottom-up parse is called handle-pruning.
\

Prod’n. Sentential Form Ð

To construct a rightmost derivation


1 a

goal X

:: e

expr – goal
ˆ
a ‰

γ0 γ1 γ2 γn 1 γn
W Y Z [

S
Í

1 expr w
* * * * *

Î Ï

2 expr :: expr term


) \

k c 9 Ö

9 Ô

e e

Š ‹

Ð Ñ Ò Ò Ò Ó Õ

] ^ _ ` a b

± Œ

3 expr term
k 

3 expr term
e

k i

we set i to n and apply the following simple algorithm


À

 Ž 

d e f g h

4 term 5 expr term factor


k • – —

k l
e

‘ ’ “ ” ˜

j k

9
| m

5 term :: term factor expr term


k n k q r s
k  ž Ÿ

o p t
™ š › œ

7
~ u v –

6 term factor
–

expr factor
k w x y

n
Ø Ù Ú Û Ü

× Ý Þ ß à á Þ â
z
¡ ¢ £ ¤ ¥ ¦ § ¨

7 factor
‡ { |

8 expr
©
e

ª « ¬ ­ ® ¯ ° ±

8 factor ::
X ²

4 term
k ³ ´ µ ¶ · ¸ ¹ º

1. Ai βi γi
~  €  ‚ ƒ

E ð E ñ ò E

„ † ‡

9
ã ä å æ ç è é ê ë ì í î ï

7 factor
» ¼ ½ ¾ ¿ À Á Â Ã

9 Ä Å Æ Ç È É Ê Ë Ì

2. ó ô õ ö ÷ ø ô βi E ù ú û ü

Ai E ý þ ÿ    
*

γi 1E 

This takes 2n steps, where n is the length of the derivation


Y

× ×

117 118

Stack implementation Example: back to  

J
K

Stack Input Action


One scheme to implement a handle-pruning, bottom-up parser is called a L

$
M N O P Q R S T U

shift
L W X Y Z [ \ ] ^ _

shift-reduce parser. L
$ `
reduce 9
$ factor a b c d e f g h reduce 7
1 

goal 

:: 

expr L

$ term
i

 j k l m n o p q

reduce 4
Shift-reduce parsers use a stack and an input buffer


   

2 expr :: expr term


 

 

L r

$ expr shift
      
V

! "

3 expr term
s t u v w x y z
 &

L {

$ expr shift
# $ %

4 term
 ) | } ~  €  ‚ ƒ

1. initialize stack with $ * +


' (

2 3
L

$ expr
„




reduce 8
5 term :: term factor
† ‡ ˆ ‰ Š ‹ Œ
 ,  / 0 1

L Ž

$ expr factor reduce 7


- .

2 :
4

6
5 6

term factor
  ‘ ’ “ ” •
 7 8 9

L –

$ expr term shift


 š › œ 
V

2. Repeat until the top of the stack is the goal symbol and the input token
2 >

7 factor
; < =

— ˜ ™

L ž

$ expr term shift


 ¢ £ ¤ ¥
V

? @

8 factor :: Ÿ ¡

is $
L ¦

$ expr reduce 9
A B C D E

term
 ª « ¬ ­


F G H I

9 L ®
§ ¨ ©

$ expr reduce 5
2 µ

term factor
 ² ³ ´


¯ ° ±

a) find the handle L

$ expr

· ¸ ¹


term
º


reduce 3
L »

if we don’t have a handle on top of the stack, shift an input symbol


Y

L
$ expr ½


¼ reduce 1
$ goal accept
¿

 ¾

onto the stack


b) prune the handle


1. Shift until top of stack is the right end of a handle


if we have a handle A  β on the stack, reduce
2. Find the left end of the handle and reduce
i) pop β symbols off the stack
ii) push A onto the stack 5 shifts + 9 reduces + 1 accept

119 120
LR k grammars
à Ä

Shift-reduce parsing Informally, we say that a grammar G is LR k if, given a rightmost derivation Æ
S Ç

S γ0 γ1 γ2 γn w
Z È

* * * *

Î É

Shift-reduce parsers are simple to understand


9 Î

Ê Ë Ì Ì Ì Í

we can, for each right-sentential form in the derivation,


A shift-reduce parser has just four canonical actions:
1. isolate the handle of each right-sentential form, and
1. shift — next input symbol is shifted onto the top of the stack
Y

2. determine the production by which to reduce


Ð

2. reduce — right end of handle is on top of stack; by scanning γi from left to right, going at most k symbols beyond the right
À

locate left end of handle within the stack; end of the handle of γi.
*

pop handle off stack and push appropriate non-terminal LHS

3. accept — terminate parsing and signal success


±

"

4. error — call an error recovery routine


Á

The key problem: to recognize handles (not covered in this course).

121 122

LR k grammars Why study LR grammars?


à Ó

Formally, a grammar G is LR k iff.: Ô


S Õ

LR(1) grammars are often used to construct parsers.


ú

1. S rm αAw rm αβw, and


Z Ö

H
×

Ø
H
We call these parsers LR(1) parsers.
2. S rm γBx rm αβy, and
Z Ù Ú
 Û

everyone’s favorite parser


H

3. FIRST k w FIRST k y Ý Þ Ý á
Ü â

vir tually all context-free programming language constructs can be ex-


ß à

pressed in an LR(1) form


αAy γBx
3 ä


LR grammars are the most general grammars parsable by a determin-


i.e., Assume sentential forms αβw and αβy, with common prefix αβ and Ü

istic, bottom-up parser


common k-symbol lookahead FIRSTk y FIRST k w , such that αβw re-
Ü æ ç

efficient parsers can be implemented for LR(1) grammars


Ý å Ý è

duces to αAw and αβy reduces to γBx.


'

LR parsers detect an error as soon as possible in a left-to-right scan


But, the common prefix means αβy also reduces to αAy, for the same re- Ü
3

of the input
sult.
ñ

LR grammars describe a proper superset of the languages recognized


by predictive (i.e., LL) parsers
Thus αAy γBx.
Ð

3 ê


LL k : recognize use of a production A β seeing first k symbols of β


S ó S

ò ô

LR k : recognize occurrence of β (the handle) having seen all of what


S ö

is derived from β plus k symbols of lookahead


S

123 124
Left versus right recursion Parsing review

Right Recursion: Recursive descent


A hand coded recursive descent parser directly encodes a grammar
÷

needed for termination in predictive parsers


(typically an LL(1) grammar) into a series of mutually recursive proce-
ø

requires more stack space '

dures. It has most of the linguistic limitations of LL(1).


ù

right associative operators LL k ÿ


S

An LL k parser must be able to recognize the use of a production after


S 

Left Recursion: 

seeing only the first k symbols of its right hand side.


S

works fine in bottom-up parsers LR k


S 

limits required stack space An LR k parser must be able to recognize the occurrence of the right
S 

left associative operators hand side of a production after having seen all that is derived from that
right hand side with k symbols of lookahead.
S

Rule of thumb:

right recursion for top-down parsers


þ

left recursion for bottom-up parsers

125 126

The Java Compiler Compiler

Can be thought of as “Lex and Yacc for Java.”

It is based on LL(k) rather than LALR(1).

Grammars are written in EBNF.

Chapter 5: JavaCC and JTB Ð

The Java Compiler Compiler transforms an EBNF grammar into an


LL(k) parser.
S

The JavaCC grammar can have embedded action code written in Java,
just like a Yacc grammar can have embedded action code written in C.

The lookahead can be changed by writing               .

The whole input is given in just one file (not two).

127 128
The JavaCC input format

Example of a token specification:


The JavaCC input format
    !

"

One file: # $ % & ' ( ' ) * + $ & ' ) , + - . / 0 1 0 2 0 3 0 4 5 6 7 8 7 9 7 : 7 ; < = > ? @ ? A B

header
Example of a production:
token specifications for lexical analysis
À

grammar
ç

D E F G H I J I K L K M I N O P I Q K I R S M T U V

W X

Z [ \ ] \ ^ _ ^ ` \ a b c d e f g h i f j e k l m n o p p q r s t u v w v

129 130

The Visitor Pattern

For object-oriented programming,


î

the Visitor pattern enables


ï

the definition of a new operation


ï

Generating a parser with JavaCC


on an object structure
î

y z { z | | } ~  €   ‚ ƒ „ „ † ‡ ˆ ‡ ‰ Š ‹ ‡ Œ  Ž   ‘ ’  “ ” • – — ˜ ™ š › œ  œ š ž Ÿ ¡ ¢

£ ¤ ¥ ¤ ¦ § ¨ © ª « ¬ ¨ ­ ¨ ® ® ¯ ° ± ² ³ ´ ° µ ° ¶ · ¸ ¹ º » ¸ ¼ ½ ¾ ¿ À À Á Â Ã Ä Å Æ Ç È É Ê È

without changing the classes


Ë Ì Í Ì Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ù Ú Û Ü Ý Þ Ý ß à á â ã ä å ã æ ç è é ê ë ì í

of the objects.

Gamma, Helm, Johnson, Vlissides:


Design Patterns, 1995.
131 132
First Approach: Instanceof and Type Casts

Sneak Preview The running Java example: summing an integer list.

When using the Visitor pattern,


ó ô õ ö ÷ ø ù ú ö û ü ý þ ÿ

the set of classes must be fixed in advance, and


ï

                

each class must have an accept method.            ! "  " # $ % & ' ( ) *

+ , - . / 0 1 2

3 4 5 6 7 8 9 : ;

<

133 134

First Approach: Instanceof and Type Casts

= > ? @ A B C C D E F G H I J K L M N O P J

Second Approach: Dedicated Methods


Q R S T U V W X Y

Z [ [ \ ] ^ _ ` a b c d d e f g h i j k

l m n o p q r s t u v v w x y
The first approach is not object-oriented!
z { | } ~  €  ‚  ƒ „ † ‡ ˆ ‰ Š

To access parts of an object, the classical approach is to use dedicated


‹ Œ  Ž    ‘ ’ “ ” • – —

methods which both access and act on the subobjects.


˜ ™ š ˜ › œ  ž Ÿ ¡ ¢ £ ¤ ¥ ¦ § ¨ © ª « ¬ ­

® ¯ ° ± ² ³ ´ µ ¶ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã

Ä Å Æ Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ì Ò
ý þ ÿ        

Ó Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß à á â ã ä å æ ç æ è

    

Advantage: The code is written without touching the classes ë ì í and We can now compute the sum of all components of a given     -object 

.
î ï ð ñ

by writing .       

Drawback: The code constantly uses type casts and to


ï

ó ô õ ö ÷ ô ø ù ú û

determine what class of object it is considering.


135 136
Second Approach: Dedicated Methods Third Approach: The Visitor Pattern

! " # # $ % & ' ( ) * + ( + , - . / 0 1 2 3

The Idea:
4 5 6 7 8 9 : ; < = > ? @ A B

Divide the code into an object structure and a Visitor (akin to Func-
C D E F C G H I

tional Programming!)
ï

­
Insert an method in each class. Each accept method takes a ¨ © © ª « ¬

L M N O O P Q R S T U V W X U X Y Z [ \ ] ^ _ `
Visitor as argument.
¯

A Visitor contains a method for each class (overloading!) A


a b c d e f g h

° ± ² ± ³

method for a class C takes an argument of type C.


ï

i j k l m n o p q

r s t u v w x y z { | } ~  €

 ‚ ƒ „  † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ ’ “ ” •

´ µ ¶ · ¸ ¹ º » · ¼ ½ ¾ ¿ À

Á Â Ã Ä Å Æ Æ Ç È É Ê Ë Ì Í Ì É Î Ï Ð Ñ Ò

˜ ò

Advantage: The type casts and operations have disappeared, ™ š › œ  š ž Ÿ ¡

Ô Õ Ö × Ø Ù Ú Û × Ü Ý Þ Ý ß à á â

and the code can be written in a systematic way. ã ä å æ ç è é è ê ë ì è í î ï ð

Disadvantage: For each new operation on -objects, new dedicated ¢ £ ¤ ¥


ñ ò ó ô õ ö ÷ ö ø ù ú û ü ÷ ý þ ÿ

methods have to be written, and all classes must be recompiled.


137 138

Third Approach: The Visitor Pattern Third Approach: The Visitor Pattern


ò

The purpose of the methods is to      


‹

The control flow goes back and forth between the methods in Œ  Ž  

the Visitor and the methods in the object structure.


ï

invoke the method in the Visitor which can handle the current
 ‘ ‘ ’ “ ”

  

object.
• – — ˜ ˜ ™ š › œ  ž  Ÿ ¡ ¢ £ ¤ ¥ ¦ £ ¦ § ¨ © ª « ¬ « ­ ® ¯ °

± ² ³ ´ µ ¶ · ¸ ¹

                   

º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Å Ç È É Å Ê Ë Ì Í Î

 ! " # $ % & ' ( ) * * + , - . / 0 1 0 - 2 3 4 5 6

Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ú Ü Ý Þ ß à Û á â ã

7 8 7 9 : 9 ; < ; = 9 : > ?

ä å æ ç è é ê ë ì í î ï ð ñ ò

ó ô õ ö ÷ ø ô ö ù ù ú û õ ü õ ý ÷ þ ÿ

A


B C D E E F G H I J K L M N K N O P Q R S T U V

    

                       

W X Y Z [ \ ] ^

 ! " " # $ % & ' ( ) *

_ ` a b c d e f g

+ , - . / 0 1 2 3 . 1 4 5 6 7 . 8 7 9 - : 1 - 3 0 ; <

h i j k l m n o p q r s s t u v w x y z y v { | } ~ 

€  € ‚ ƒ ‚ „ „ † ‚ ƒ ‡ ˆ

Notice: The methods describe both = > ? > @

1) actions, and 2) access of subobjects.


Š

139 140
Comparison Visitors: Summary
ò

The Visitor pattern combines the advantages of the two other approaches.
E

Visitor makes adding new operations easy. Simply write a new


Frequent Frequent visitor.
F

type casts? recompilation?


ï

Instanceof and type casts


A

Yes No
G

A visitor gathers related operations. It also separates unrelated


Dedicated methods No
A

Yes ones.
ò

The Visitor pattern No No ˜

Adding new classes to the object structure is hard. Key consid-


eration: are you most likely to change the algorithm applied over an
The advantage of Visitors: New methods without recompilation! object structure, or are you most like to change the classes of objects
that make up the structure.
ï

Requirement for using Visitors: All classes must have an accept method.
J

Visitors can accumulate state.


B

Tools that use the Visitor pattern: K

Visitor can break encapsulation. Visitor’s approach assumes that


the interface of the data structure classes is powerful enough to let
ï

visitors do their job. As a result, the pattern often forces you to provide
D

JJTree (from Sun Microsystems) and the Java Tree Builder (from Pur-
public operations that access internal state, which may compromise
ü

due University), both frontends for The Java Compiler Compiler from
Sun Microsystems. its encapsulation.

141 142

The Java Tree Builder


The Java Tree Builder
ò

The Java Tree Builder (JTB) has been developed here at Purdue in my The produced JavaCC grammar can then be processed by the Java Com-
L

group. piler Compiler to give a parser which produces syntax trees.

JTB is a frontend for The Java Compiler Compiler. The produced syntax trees can now be traversed by a Java program by
writing subclasses of the default visitor.
M

JTB supports the building of syntax trees which can be traversed using
Program
visitors.

JTB transforms a bare JavaCC grammar into three components: Q

JavaCC JTB JavaCC grammar Java Compiler Parser


Y Y Y ]

S T W X

grammar [

Y
with embedded Compiler
Java code
N

a JavaCC grammar with embedded Java code for building a syntax ^ ^


R

Syntax-tree-node Syntax tree


U

tree;
ï

classes [

with accept methods


O

one class for every form of syntax tree node; and


Default visitor
V

a default visitor which can do a depth-first traversal of a syntax tree.

143 144
Example (simplified)

For example, consider the Java 1.1 production


               ! " #

$ % & ' ( ) & * + , - & . / / ' 0 1 2 3 4 5 5 6 7 8 9 : 8 ; < = : > ? ; @ > A B

Using JTB
C D E F G H H I J K L M N

JTB produces:
` a b c d e f e g h i j j k k l m n m o p q m r s t u v w x t v s s

O P P Q R S T U S V W X X Y Z [ \ ] [ ^ _ ` a

y z { z | | } ~  €  ‚ ~ € } } ƒ ƒ „ † ‡ ˆ ‰ Š ‹ Œ  Ž   Ž ‘ ’ “ ” • – — ˜ ™ š › š ˜ œ  ž Ÿ

b c d e f g d h i j k d l m m e n o p q r

¡ ¢ £ ¢ ¤ ¥ ¦ § ¨ © ª ¦ « ¦ ¬ ¬ ­ ® ¯ ° ± ² ® ³ ® ´ µ ¶ · ¸ ¹ ¶ º » ¼ ½ ¾ ¾ ¿ À Á Â Ã Ä Å Æ Ç È Æ

s t t u v w x y w z { | y } ~ z  } €  ‚

É Ê Ë Ì Í Î Î Ï Ð Ñ Ò Ó Ô Ó Õ Ö × Ô

ƒ „ † ‡ ˆ ˆ ‰ Š ‹ Œ  Ž   ‘

Ø Ù Ú Ù Û Ü Ý Þ ß à á â ã ä å æ æ ç è é ê ë ì í î ï ð ñ ò ó ô õ ö ö ÷ ø ù ú û ü ý þ ÿ   

’ “ ” • – — ˜ ™ š — › œ  ž — Ÿ ˜ ¡ “ ¢ £

             

¤ ¥ ¦ § ¨ ¨ © ª ¤ « ¬ ¤ ­ ® ¯ ¬ ° ± ­ ² ° ³ ´

µ ¶ · ¸ ¹ º » ¼ ½ ½ ¾ ¿ µ À Á

Â Ã Ä Å Æ Ã Ç È É Ê Ë Ì Ì Í Î Ï Ð Ñ Ï Ò Ó Ï Ô Õ Ï Ö Õ Ï × Ø Ù Ú

Notice that the production returns a syntax tree represented as an


object.
Ü Ý Ý Þ ß à á â à ã

145 146

Example (simplified) Example (simplified)


The default visitor looks like this:
M

JTB produces a syntax-tree-node class for ä å å æ ç è é ê è ë :

¸ ¹ º » ¼ ½ ¾ ¿ À Á Á Â Ã Ä Å Æ Ç È É Ê Å Ë È Ê È Å Ì É Í Î Ï Ð Ñ Î Ñ Ò Ó Ô Õ Ö × Ö Ø Ù Ú Û

ì í î ï ð ñ ò ó ô õ õ ö ÷ ÷ ø ù ú û ü ú ý þ ÿ   ÿ       

                    !  " # $ ! % & " ' % ( ) *

Ü Ü Ü

Ý Ý

+ , - . / 0 0 1 2 3 4 5 6

Þ Þ ß à á â ã ä å æ ç ä è é ê ë ä ì í í å î ï ð ñ

ò ò ó ô õ ö ÷ ø ø ù ú û ü ý û þ ÿ ý   þ    

7 8 9 : ; < = > > ? @ A B C A D E F G ? B H G I J K L G C > > ? M A N O P

            

Q R R S T U V W U X Y Z W [ \ X ] [ ^ _ `

 

a b c d e f f g h i j k l

m n o p q r s t u v w x y z { | } ~  €

         ! " # " $ % & # # " ' ( ) * ( $ + , -

. / 0 1 / 2 3 3 4 5 6 7 6 8 9 : ; <

= > ? @ > A B B C D E F E G H I J K

 ‚ ƒ „ † ‡ ˆ ‰ Š ‹ Œ Œ  Ž   ‘ ’ “ ’  ” • – — ’ “ ’  ” • ˜ ™ š

L M N O M P Q Q R S T U T V W X Y Z

› œ ›  ž  Ÿ Ÿ ¡  ž ¢ £

Notice the method; it invokes the method ¦ § § ¨ © ª « ¬ ­ ¬ ® for ¯ ° ° ± ² ³ ´ µ ³ ¶ in Notice the body of the method which visits each of the three subtrees of
the node.
· ] ^ ^ _ ` a b c a d

the default visitor.


·

147 148
Example (simplified)
Here is an example of a program which operates on syntax trees for Java
1.1 programs. The program prints the right-hand side of every assignment.
e

The entire program is six lines:

f g h i j k l m n o o p q r s t u v w w s x t y z { | } ~ |  €  ‚ ƒ „ † ‡ ˆ ‰ Š ‹ ˆ Š ˆ Œ ‰ 

Ž   ‘ ’ “ ” “ • – — ” ” “ ˜ ™ š › ™ • œ  ž

Ÿ ¡ ¢ £ £ ¤ ¡ ¥ ¦ £ ¢ ¡ § ¨ © ª « ¬ ­ ® ¯ ° ° ± ­ ® ² ³ ° ¯ ® ´ µ ¶

· ¸ ¹ º ¸ » ¼ ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ Ç È Å É Ê Ë Ì È Í Ì Î Ï Ð

Ñ Ò Ó Ô Ò Õ Ö Ö × Ø Ù Ú Ù Û Ü Ý Þ ß

Chapter 6: Semantic Analysis


á

When this visitor is passed to the root of the syntax tree, the depth-first
traversal will begin, and when nodes are reached, the method
·

ã ä ä å æ ç è é ç ê

in ì í

is executed. ì î

ï ð ñ ò ó ô õ ö ö ò ÷ ó ø ù ú

Notice the use of û ü ý þ ÿ ÿ ü ý   ÿ þ ý . It is a visitor which pretty prints Java


1.1 programs.
M

JTB is bootstrapped.
149 150

Semantic Analysis Context-sensitive analysis

The compilation process is driven by the syntactic structure of the program What context-sensitive questions might the compiler ask?
as discovered by the parser


1. Is scalar, an array, or a function?


Semantic routines: 2. Is


declared before it is used?




3. Are any names declared but not used?




interpret meaning of the program based on its syntactic structure 

4. Which declaration of  does this reference?


two purposes:
·

5. Is an expression type-consistent ?
– finish analysis by deriving context-sensitive information
6. Does the dimension of a reference match the declaration?
– begin synthesis by generating the IR or target code 7. Where can  be stored? (heap, stack,    )


associated with individual productions of a context free grammar or 8. Does   reference the result of a malloc()?
subtrees of a syntax tree


9. Is  defined before it is used?


10. Is an array reference in bounds ?



11. Does function    produce a constant value?
Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work


12. Can be implemented as a memo-function?




for personal or classroom use is granted without fee provided that copies are not made or distributed for 

profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected]. These cannot be answered with a context-free grammar
151 152
Symbol tables
Context-sensitive analysis
For compile-time efficiency, compilers often use a symbol table:


Why is context-sensitive analysis hard? !

associates lexical names (symbols) with their attributes




answers depend on values, not syntax â

What items should be entered?


questions and answers involve non-local information


variable names
#

"

answers may involve computation 

defined constants
Several alternatives: %

procedure and function names


abstract syntax tree specify non-local computations


&

literal constants and strings


(attribute grammars ) automatic evaluators


'

source text labels


symbol tables central store for facts


compiler-generated temporaries (we’ll get there)


express checking code
Separate table for structure layouts (types) (field offsets and lengths )
language design simplify language
avoid problems
)

A symbol table is a compile-time structure


153 154

Nested scopes: block-structured symbol tables


Symbol table information
â

â
What information is needed?
What kind of information might the compiler need?


when we ask about a name, we want the most recent declaration



7

textual name
·

the declaration may be from the current scope or some enclosing


·

scope


data type
(for aggregates) innermost scope overrides declarations from outer scopes


dimension information
9

declaring procedure
Key point: new declarations (usually) occur only in current scope
.

lexical level of declaration â

0
What operations do we need?
/

storage class (base address )


1

offset in storage
: ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T U V W X

– binds key to value


2

if record, pointer to structure table


Y Z [ \ ] ^ _ ` a b c d e f g h i j k l m

– returns value bound to key


n o p q r s t u v w x y z { t | }

– remembers current state of table


3

if parameter, by-reference or by-value?


~  €

– restores table to state at most recent scope that


 ‚ ƒ „ † ‡ ˆ ‰ ƒ Š ‹

can it be aliased? to what other names?


has not been ended
5

number and type of arguments to functions


May need to preserve list of locals for the debugger
155 156
Type expressions

Attribute information e

Type expressions are a textual representation for types:

Attributes are internal representation of declarations 1. basic types: boolean, char, integer, real, etc.


Symbol table associates names with attributes 2. type names




3. constructed types (constructors applied to type expressions):


Names may have different attributes depending on their meaning: 

(a) array I T denotes array of elements type T , index type I


” • ”

“ –

variables: type, procedure level, frame offset e.g., array 1 10 integer


#

— ˜ ˜ ˜ ™ š

types: type descriptor, data size/alignment (b) T1 T2 denotes Cartesian product of type expressions T1 and T2
·

œ œ


constants: type, value (c) records: fields have names


e.g., record integer real
procedures: formals (names/types), result type, block information (lo-
  ž Ÿ ¡ ¢ £ ¤ ¥ ¥

cal decls.), frame size (d) pointer T denotes the type “pointer to object of type T ”
¦ §

(e) D R denotes type of function mapping domain D to range R


·

e.g., integer integer integer ª «

157 158

Type compatibility
Type descriptors
Type checking needs to determine type equivalence
Type descriptors are compile-time structures representing type expres-
Two approaches:
sions
Name equivalence: each type name is a distinct type
e.g., char ‘

¬
‘

char ­
¦

pointer integer
®

Structural equivalence: two types are equivalent iff. they have the same
°

structure (after substituting type expressions for type names)

¹ º

s »

t iff. s and t are the same basic types º

pointer or

pointer ¼ ’

array s1 s2 ½
º º ’

array t1 t2 iff. s1 Á
º

t1 and s2 º

t2
¾ ¿ À Â Ã Ä Å

Æ º

s1 Ç
º

s2 È É

t1 Ê
t2 iff. s1
È
º

Ë
t1 and s2 º

È Ì

t2
È

char char integer


´

char integer Í Î

pointer s Ï º Ð Ñ Î

pointer t iff. s Ò

Ó
º Ô

t
Õ º

s1 Ö
º

s2 È ×

t1 Ø
t2 iff. s1
È
º

Ù
t1 and s2 º

È Ú

t2 È

159 160
Type compatibility: example

Consider:
Type compatibility: Pascal-style name equivalence
Û Ü Ý Þ ß à á â ã ä å æ ç ç è

é ê ë ì í î ï ð ñ ò ó ô õ
Build compile-time structure called a type graph:
ö ÷ ø ù ú û ü ý þ ÿ

      
'

each constructor or basic type creates a node


    
(

each name creates a leaf (associated with the type’s descriptor)


Under name equivalence: ) * + , - . / 0 : ; <

    

and     have the same type


 

,  and  have the same type 1 2 3 4 5

pointer pointer pointer


6 6

 

and  ! " have different type

Under structural equivalence all variables have the same type 7 8 9 9

Ada/Pascal/Modula-2/Tiger are somewhat confusing: they treat distinct


Type expressions are equivalent if they are represented by the same node
type definitions as distinct types, so
·

in the graph
$

has different type from % and &

161 162

Type compatibility: recursive types

Consider:
= > ? @ A B C D E F G H I I J

K L M M N O P Q R O S

T U V W X Y Z [ \ ] \ ^ _

Type compatibility: recursive types


` a b c d e f g h i

â
j k l m

Allowing cycles in the type graph eliminates † ‡ ˆ ˆ :


We may want to eliminate the names from the type graph
r
Š ‹ Œ Œ

record
Eliminating name n o p q from type graph for record:


t u v v

record
Ž “

integer pointer
  ‘ ’

” • – — ‚

x
}

integer pointer
y z { |

~  € 
‚

ƒ „

163 164
Tiger IR trees: Expressions
™

CONST
Integer constant i
˜

i
˜

NAME ›
œ

Symbolic constant n š

[a code label]
n š

TEMP ž œ

Temporary t [one of any number of “registers”]




t


BINOP
Application of binary operator o: Ÿ

o e1 e2 ¡
PLUS, MINUS, MUL, DIV
AND, OR, XOR [bitwise logical]
LSHIFT, RSHIFT [logical shifts]
Chapter 7: Translation and Simplification ARSHIFT [arithmetic right-shift]
to integer operands e1 (evaluated first) and e2 (evaluated second)
¢

MEM ™

Contents of a word of memory starting at address e


e
™

CALL ª

Procedure call; expression f is evaluated before arguments e1


£

£ ¤
« ¬ ¬ ¬ ­
en
¨

f e1 en ¥ ¦ ¦ ¦ §
¨ ©

ESEQ ¯ ¯

Expression sequence; evaluate s for side-effects, then e for result


®


®

se °

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
±

for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
²

otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
165 166

Tiger IR trees: Statements


MOVE
ž

TEMP e
ª

Evaluate e into temporary t




Kinds of expressions
t


Expression kinds indicate “how expression might be used”


MOVE
Evaluate e1 yielding address a, e2 into word at a Ex(exp) expressions that compute a value
³

´ ´

MEM e2
e1 Nx(stm) statements: expressions that compute no value
EXP
Evaluate e and discard result
µ

Cx conditionals (jump to true and false destinations)


e
¼

JUMP ž
RelCx(op, left, right)
Transfer control to address e; l1 ln are all possible values for e
· ·

e l1 ln ¶
·

¸ ¹ ¹ ¹ º
·

»
½ ¾ ¾ ¾ ¿

IfThenElseExp expression/statement depending on use


™

CJUMP
Evaluate e1 then e2 , yielding a and b, respectively; compare a with b
¢

´
µ
À

´
À

Conversion operators allow use of one form in context of another:


o e1 e2 t f using relational operator o:
Á

EQ, NE [signed and unsigned integers] unEx convert to tree expression that computes value of inner tree
LT, GT, LE, GE [signed]
Â
ULT, ULE, UGT, UGE [unsigned] unNx convert to tree statement that computes inner tree but returns no
jump to t if true, f if false
£

value
#

SEQ ›
¯

unCx(t, f) convert to statement that evaluates inner tree and branches to


Statement s1 followed by s2
® ®

true destination if non-zero, false destination otherwise


·

s1 s2 ® ®

LABEL
Define constant value of name n as current code address; NAME n
Ã

š š Å

can be used as target of jumps, calls, etc.


Æ

n
167 168
Translating Tiger
Translating Tiger
Tiger record variables: Again, records are pointers to record base, so fetch like other
variables. For e. :
Ó

Simple variables: fetch with a MEM: Ô

Ex(MEM( (e.unEx, CONST o))) Õ


Ÿ

MEM
Ex(MEM( (TEMP fp, CONST k)))
È

where o is the byte offset of the field in the record


Ÿ

BINOP
×
É

Note: must check record pointer is non-nil (i.e., non-zero)


Ø ›

String literals: Statically allocated, so just use the string’s label


PLUS TEMP fp CONST k
È

where fp is home frame of variable, found by following static links; k is Ex(NAME(label ))


È

offset of variable in that level where the literal will be emitted as:
Ê e

Tiger array variables: Tiger arrays are pointers to array base, so fetch ß à á â ß ã
Ù

ä
Ú

å
Û

æ
Ü

ç
Ý

è è
Þ Þ

é ê ë ì ì í î ï ð ñ ò ó

with a MEM like any other variable: ª

Record creation: en in the (preferably GC’d) heap, first allocate


£

f1 e1 f2
£

e2
£

fn
Ex(MEM( (TEMP fp, CONST k)))
È

¨ û

ô õ

ö ÷ ø ù ú ú ú ü

the space then initialize it:


¢

Thus, for e i :
Ì Í

Ex( ESEQ(SEQ(MOVE(TEMP r, externalCall(”allocRecord”, [CONST n])), š

Ex(MEM( (e.unEx, (i.unEx, CONST w))))


›

Î Ï

SEQ(MOVE(MEM(TEMP r), e1 .unEx)), ›

SEQ(. . . ,
i is index expression and w is word size – all values are word-sized MOVE(MEM(+(TEMP r, CONST n 1 w)), þ
š ÿ
Ö

(scalar) in Tiger en .unEx))),


TEMP r))
Note: must first check array index i size e ; runtime will put size in
Ð Ñ Ò

word preceding array base where w is the word size




Array creation:   e1   
e2 : Ex(externalCall(”initArray”, [e1 .unEx, e2 .unEx]))
169 170

while loops


while c do s:


‘ º

Control structures 1. evaluate c ‘

2. if false jump to next statement after loop


Basic blocks : 

3. if true fall into loop body




a sequence of straight-line code 4. branch to top of loop


if one instruction executes then they all execute
e.g.,
a maximal sequence of instructions without branches test :
a label starts a new basic block if not(c) jump done


s
Overview of control structure translation: 

jump test


control flow links up the basic blocks done:


Nx( SEQ(SEQ(SEQ(LABEL test, c.unCx(body, done)),
0 

ideas are simple


SEQ(SEQ(LABEL body, s.unNx), JUMP(NAME test ))), º

implementation requires bookkeeping LABEL done))




some care is needed for good code

repeat e1 until e2 evaluate/compare/branch at bottom of loop



 

171 172
for loops


for := e1 to e2 do s


1. evaluate lower bound into index variable


2.

evaluate upper bound into limit variable
3. if index limit jump to next statement after loop
4.


fall through to loop body


Function calls
5. increment index
6. if index limit jump to top of loop body en :
% &

f e1 ' ( ( ( )
š *

t1 e1 

t2 e2 Ex(CALL(NAME label f , [sl,e1,. . . en ]))




 

0
if t1 t2 jump done 
È

body : s where sl is the static link for the callee f , found by following n static links
%

º 

t1 t1 1 

 0
from the caller, n being the difference between the levels of the caller and
,

if t1 t2 jump body !
·

the callee
done:


For break statements: 

when translating a loop push the done label on some stack


"

break simply jumps to label on top of stack


#

when done translating loop and its body, pop the label
7

173 174

Conditionals
Comparisons 2

The short-circuiting Boolean operators have already been transformed into


Translate a op b as: ’
-


if-expressions in Tiger abstract syntax:


e.g., x 5&a b turns into if x 5 then a b else 0
5  5 
· 

3 4 ’ 6 3 7 ’ 8

RelCx(op, a.unEx, b.unEx)




Translate if e1 then e2 else e3 into: IfThenElseExp(e1, e2, e3)




’ 9

: :

When used as a value unEx yields:


When used as a conditional unCx t f yields:
% 0

ESEQ(SEQ(SEQ(e1 .unCx(t, f),


. /

SEQ(SEQ(LABEL t,
CJUMP(op, a.unEx, b.unEx, t, f )
 %

SEQ(MOVE(TEMP r, e2.unEx),
;

where t and f are labels.


%

JUMP join)),
â
SEQ(LABEL f,
When used as a value unEx yields: SEQ(MOVE(TEMP r, e3.unEx), :

ESEQ(SEQ(MOVE(TEMP r, CONST 1), JUMP join)))),


SEQ(unCx(t, f), LABEL join),
2

SEQ(LABEL f, TEMP r)
As a conditional unCx t f yields:
% >

SEQ(MOVE(TEMP r, CONST 0), LABEL t)))), < =

SEQ(e1.unCx(tt,ff), SEQ(SEQ(LABEL tt, e2.unCx(t, f )),


%

TEMP r)
SEQ(LABEL ff, e3.unCx(t, f ))))
%

175 176
One-dimensional fixed arrays
Conditionals: Example D E F G H I J J I K L M N N O P Q R S T U V W V X Y

Z Z Z

Applying unCx t f to if x 5 then a b else 0:


% A 5 
· 

3 B ’ C

? @

[ \ ] ^

SEQ(CJUMP(LT, x.unEx, CONST 5, tt, ff),


3

SEQ(SEQ(LABEL tt, CJUMP(GT, a.unEx, b.unEx, t, f )),


 %

translates to:
·

SEQ(LABEL ff, JUMP f )))


%

MEM(+(TEMP fp, +(CONST k (CONST w, e.unEx))))


_ `

or more optimally: 2w, a

SEQ(CJUMP(LT, x.unEx, CONST 5, tt, f ),


%

where k is offset of static array from fp, w is word size


È

SEQ(LABEL tt, CJUMP(GT, a.unEx, b.uneX, t, f )))


 %

In Pascal, multidimensional arrays are treated as arrays of arrays, so b c d e f g

is equivalent to A[i][j], so can translate as above.

177 178

Multidimensional arrays
Array layout:
Multidimensional arrays
#
i

Contiguous:
Array allocation: 1. Row major
constant bounds Rightmost subscript varies most quickly:
– allocate in static area, stack, or heap
j k l m l n m o p q r s t r u u u

– no run-time descriptor is needed


v w x y z { y | } ~  ~ €    

dynamic arrays: bounds fixed at run-time Used in PL/1, Algol, Pascal, C, Ada, Modula-3
– allocate in stack or heap 2. Column major
Leftmost subscript varies most quickly:
h

– descriptor is needed
‚ ƒ „ „ † ‡ ˆ ‰ Š ‹ Œ Š   
h

dynamic arrays: bounds can change at run-time


Ž   ‘ ’ “ ‘ ” • – — – ˜ — ™ ™ ™

– allocate in heap
Used in FORTRAN
h

– descriptor is needed


By vectors
Contiguous vector of pointers to (non-contiguous) subarrays
œ

179 180
Multi-dimensional arrays: row-major layout
 ž ž  Ÿ ¡ ¢ ¢ £ ¤ ¡ ¢ ¢ ¥ ¦ § ¨ ©

ª « ¬ ¬ « ­ ® ¯ ° ° ± ² ³ ´ µ ¶ ¶ µ · ¸ ¹ º º » ¼ ½ ¾ ¿

case statements
no. of elt’s in dimension j:
À

case E of V1: S1 . . . Vn : Sn end


 9

š š

Dj Á Â

Uj Á Ã

Lj Á Ä

1 1. evaluate the expression


position of Å Æ i1 Ç È È È É
in : š Ê

2. find value in case list equal to value of expression




in
š Ì
Í

Ln š Î
3. execute statement associated with value found
in 1 Ln 1 Dn 4. jump to next statement after case
Í Õ

Ï Ð

š Ñ š Ó

Ò Ô

in 2 Ln 2 DnDn 1
Í Õ Õ

Ö ×

È Ù È Û

Key issue: finding the right case


š

š Ø š Ú š Ü

Ý Þ Þ Þ

ß à

i1 L1 Dn š ã ã ã

D2 È sequence of conditional jumps (small case set)


á â

O cases
which can be rewritten as
  

binary search of an ordered jump table (sparse case set)


variable part
#

O log2 cases


È 
ä å æ ç

i1D2 Dn i2D3 Dn in 1Dn in


Õ Õ Õ Õ Õ

  

È è è è È : ê ê ê

š é š ë ì ì ì í š ï š

š î

L1D2 Dn L2D3 Dn ñ
Í Õ Õ

š ó
Í Õ

Ln 1Dn : ô ô ô
Õ

š õ ö ö ö ÷
Í

š ø
Õ

š ù
Í

Ln
š ú


hash table (dense case set)


ò ò ò
û ü ý þ

constant part O1  

address of i1 in : ÿ

    
š 

address( ) + ((variable part   constant part)  element size)


181 182

Simplification
case statements
%

Goal 1: No SEQ or ESEQ.


case E of V1: S1 . . . Vn : Sn end
 9

š š

&

Goal 2: CALL can only be subtree of EXP(. . . ) or MOVE(TEMP t,. . . ).


One translation approach: Transformations:
t := expr



'

lift ESEQs up tree until they can become SEQs


jump test (
œ

turn SEQs into linear list


L1 : S1      !

Ö ) Ö Ö

ESEQ(s1, ESEQ(s2, e)) ® ®

ESEQ(SEQ(s1,s2), e) ® ®

jump next BINOP(op, ESEQ(s, e1 ), e2 ) ®


Ö Ö *

ESEQ(s, BINOP(op, e1 , e2 ))
®
Ö

L2 :È

code for S2 È

Ö + Ö

MEM(ESEQ(s, e1 )) ESEQ(s, MEM(e1 ))




jump next
® ®

... ¼

JUMP(ESEQ(s, e1 )) ®
Ö ,
›

SEQ(s, JUMP(e1 ))
®
Ö

Ln :š

code for Sn š

™ ›

CJUMP(op, SEQ(s, CJUMP(op, e1 , e2 , l1 , l2 ))


Ö

· ·

jump next 
ESEQ(s, e1 ), e2 , l1 , l2 ) ®
Ö

· ·
Ö -

test: if t V1 jump L1
œ

BINOP(op, e1 , ESEQ(s, e2 ))
Ö . Ö

"
®

ESEQ(MOVE(TEMP t, e1 ),


ESEQ(s, ®

if t V2 jump L2
Í

BINOP(op, TEMP t, e2 )))


...
™ ›
Ö

CJUMP(op, Ö Ö /
SEQ(MOVE(TEMP t, e1 ),
›

e1 , ESEQ(s, e2 ), l1 , l2 ) SEQ(s,
· ·

if t Vn jump Ln
® ®

š š

CJUMP(op, TEMP t, e2 , l1 , l2 )))


Ö

· ·

code to raise run-time exception MOVE(ESEQ(s, e1), e2 ) ®


Ö Ö 0
›

SEQ(s, MOVE(e1 , e2 ))
®
Ö

next:
™
Ö 1 Ö

CALL( f , a) ESEQ(MOVE(TEMP t, CALL( f , a)),


£ £

´ ´

TEMP(t))

183 184
Register allocation

IR instruction machine
selection code
4 ; 5 < 6 < = 7 > 8 ; 9 : 6 = 5 ? 3

: 8

errors
2

Register allocation:

have value in a register when used

Chapter 8: Liveness Analysis and Register Allocation


A

limited resources
B

changes instruction choices


C

can move loads and stores


D

optimal allocation is difficult


NP-complete for k 1 registers
È F

 G

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this
H

work for personal or classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and full citation on the first page. To
copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission
and/or fee. Request permission to publish from [email protected].
I

185 186

Liveness analysis Control flow analysis


Problem: Before performing liveness analysis, need to understand the control flow
J

IR contains an unbounded number of temporaries by building a control flow graph (CFG):


K

machine has bounded number of registers P

nodes may be individual program statements or basic blocks


Approach:
Q

edges represent potential flow of control

temporaries with disjoint live ranges can map to same register Out-edges from node n lead to successor nodes, succ n
œ
  R

, , S

In-edges to node n come from predecessor nodes, pred n


œ

› › T

if not enough registers then spill some temporaries


, , U

(i.e., keep them in memory)


2

The compiler must perform liveness analysis for each temporary:


N O

Example:

It is live if it holds a value that may be needed in future


a 0
W

’ V

 X

L1 : b a 1
Í

’ Y

c c b ‘ Z ‘ [

 ]

a b 2 ’ \

if a N goto L1
` Í

’ _

return c ‘

187 188
Liveness analysis Liveness analysis


Gathering liveness information is a form of data flow analysis operating


Define:
over the CFG: in n : variables live-in at n
„
, ,

 †

liveness of variables “flows” around the edges of the graph in n : variables live-out at n
, ‡ ,

assignments define a variable, v:




Then:
c d

 e

– def v d f g

set of graph nodes that define v d

out n
ˆ

, ‰ Š ‹


in s


º 

 h

– def n , i j

set of variables defined by n ,

s Œ


succ n


š Ž

occurrences of v in expressions use it:


l

k d

succ n φ out n φ
 ‘ - •

, ’ “ , – —

– use v set of nodes that use v


”

l m

d n o d

– use n set of variables used in n


l p

, q r ,

Note:
Liveness : v is live on edge e if there is a directed path from e to a use of v
œ

d d

in n use n
 ˜

l ›

, ™ š , œ

 s

that does not pass through any def v


œ

d t

v is live-in at node n if live on any of n’s in-edges


 £

in n out n def n
d , ,

, ž Ÿ , ¡ ¢ , ¤

v is live-out at n if live on any of n’s out-edges


N

d , ,

 §

use n and def n are constant (independent of control flow)


l ¥

, ¦ , ¨

use n


v v live-in at n
l v

d u , w x d ,

 ³

Now, v in n iff. v use n or v out n def n


- °
l ­

d © , « d ¬ , ® d ¯ , ± ² , ´

pred n
 

v live-in at n v live-out at all m


ª

› |

d , y d z { , }

 ¿

Thus, in n use n out n def n


 

v live-out at n v def n v live-in at n


- ¼
€ l ¸

, ¶ · , ¹ º » , ½ ¾ , À Á
d , ~ d  , ‚ ƒ d ,

189 190

Iterative solution for liveness


Iterative solution for liveness
φ; out n φ Complexity : for input program of size N
Â

foreach n in n


, Ã , Å Æ Ç , É Ê

Ä È Ë

repeat 


N nodes in CFG
foreach n ,

N variables


in n in n ; N elements per in/out




, Î Ï , Ñ 

Ì Í Ð

out n out n ; O N time per set-union


œ
 

Ç , Ô Õ Ç , ×

Ò Ó Ö

in n use n out n de f n Ø
, Ù Ú

Û
, Ü Ý Þ Ç

ß
, à á

ã
, ä å

for loop performs constant number of set operations per node


out n in s O N 2 time for for loop
œ


Ç , ç è é º î

æ í

s succ n ê ® ë š ì

each iteration of repeat loop can only add to each set


until in n ï ð
, ñ ò

in n ó
, ô õ Ç

out n ö ÷
, ø ù Ç

out n ú
, û ü ý ,

n sets can contain at most every variable È

sizes of all in and out sets sum to 2N 2,





Notes: þ

bounding the number of iterations of the repeat loop


should order computation of inner loop to follow the “flow”  

worst-case complexity of O N 4





liveness flows backward along control-flow arcs, from out to in


œ

 

nodes can just as easily be basic blocks to reduce CFG size




ordering can cut repeat loop down to 2-3 iterations È !

O N or O N 2 in practice
  

could do one variable at a time, from uses back to defs, noting liveness
 

along the way


191 192
Register allocation

Iterative solution for liveness IR instruction machine


selection code
4 ; 5 < 6 < = 7 > 8 ; 9 : 6 = 5 ? 3

: 8

Least fixed points


errors
2

There is often more than one solution for a given dataflow problem (see
example). Register allocation:
Any solution to dataflow equations is a conservative approximation:
"

have value in a register when used


# d

v has some later use downstream from n ,

limited resources
v out n
$  &

d % , '

changes instruction choices


(

but not the converse


,

can move loads and stores


Conservatively assuming a variable is live does not break the program; just
means more registers may be needed.
-

optimal allocation is difficult


NP-complete for k 1 registers
È /

Assuming a variable is dead when it is really live will break things.


May be many possible solutions but want the “smallest”: the least fixpoint. 0 2

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this
3

The iterative liveness computation computes this least fixpoint. work for personal or classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and full citation on the first page. To
copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission
and/or fee. Request permission to publish from [email protected].
4

193 194

Register allocation by simplification


Register allocation by simplification (continued)
1. Build interference graph G: for each program point
(a) compute set of temporaries simultaneously live
(b) add edge to graph for each pair in set 4. Select : assign colors to nodes
2. Simplify : Color graph using a simple heuristic (a) start with empty graph
(a) suppose G has node m with degree K 5

6
(b) if adding non-spill node there must be a color for it as that was the
(b) if G G m can be colored then so can G, since nodes adjacent
7 8 9 :
5 ;

basis for its removal


to m have at most K 1 colors (c) if adding a spill node and no color available (neighbors already K-
œ

<

(c) each such simplification will reduce degree of remaining nodes colored) then mark as an actual spill
B

leading to more opportunity for simplification (d) repeat select


(d) leads to recursive coloring algorithm 5. Start over : if select has no actual spills then finished, otherwise
3. Spill : suppose m of degree K (a) rewrite program to fetch actual spills before each use and store
=

> ? @

(a) target some node (temporary) for spilling (optimistically, spilling after each definition
node will allow coloring of remaining nodes) (b) recalculate liveness and repeat
(b) remove and continue simplifying

195 196
Coalescing Simplification with aggressive coalescing
h

Can delete a move instruction when source s and destination d do not


â

C º

interfere: F

build
– coalesce them into a new node whose edges are the union of those
œ

of s and d
â

done
H

aggressive
E

In principle, any pair of non-interfering nodes can be coalesced coalesce

e
lesc
– unfortunately, the union is more constrained and new graph may

coa
no longer be K -colorable
A

any
– overly aggressive
G

simplify

done
G

spill

l
spil
any
G

select

197 198

Conservative coalescing Iterated register coalescing


I

Interleave simplification with coalescing to eliminate most moves while without extra spills
Apply tests for coalescing that preserve colorability.
1. Build interference graph G; distinguish move-related from non-move-related nodes
U

Suppose a and b are candidates for coalescing into node ab


K

J J

h
2. Simplify : remove non-move-related nodes of low degree one at a time
Briggs: coalesce only if ab has K neighbors of significant degree K
M

3. Coalesce: conservatively coalesce move-related nodes


V

L N

simplify will first remove all insignificant-degree neighbors remove associated move instruction
M

O
W

if resulting node is non-move-related it can now be simplified


P J

ab will then be adjacent to Q K neighbors


repeat simplify and coalesce until only significant-degree or uncoalesced moves
X

simplify can then remove ab


M

R J

4. Freeze : if unable to simplify or coalesce


George: coalesce only if all significant-degree neighbors of a already inter- J

(a) look for move-related node of low-degree


(b) freeze its associated moves (give up hope of coalescing them)
O

fere with b
K

(c) now treat as a non-move-related and resume iteration of simplify and coalesce
simplify can remove all insignificant-degree neighbors of a
M

S J

5. Spill : if no low-degree nodes


[

remaining significant-degree neighbors of a already interfere with b so


K

T J

(a) select candidate for spilling


coalescing does not increase the degree of any node (b) remove to stack and continue simplifying
6. Select : pop stack assigning colors (including actual spills)
\

7. Start over : if select has no actual spills then finished, otherwise


]

(a) rewrite code to fetch actual spills before each use and store after each definition
(b) recalculate liveness and repeat

199 200
Spilling
Iterated register coalescing
b

Spills require repeating build and simplify on the whole program


M

SSA constant
To avoid increasing number of spills in future rounds of build can sim-
2 b

propagation
c

(optional)
ply discard coalescences
Alternatively, preserve coalescences from before first potential spill,
e

build h

discard those after that point


`

simplify f

Move-related spilled temporaries can be aggressively coalesced, since


(unlike registers) there is no limit on the number of stack-frame loca-
_

conservative
tions
œ

coalesce

freeze

potential
spill

select
done

actual
spills

spill
y
an

201 202

Precolored nodes Temporary copies of machine registers


Precolored nodes correspond to machine registers (e.g., stack pointer, ar- Since precolored nodes don’t spill, their live ranges must be kept short:
guments, return address, return value)
g

1. use move instructions


l

select and coalesce can give an ordinary temporary the same color as 2. move callee-save registers to fresh temporaries on procedure entry,
M D

a precolored register, if they don’t interfere and back on exit, spilling between as necessary
e.g., argument registers can be reused inside procedures for a tempo- 3. register pressure will spill the fresh temporaries as necessary, other-
=

m
n

rary wise they can be coalesced with their precolored counterpart and the
simplify, freeze and spill cannot be performed on them moves deleted
M M

also, precolored nodes interfere with other precolored nodes


So, treat precolored nodes as having infinite degree
2

This also avoids needing to store large adjacency lists for precolored nodes;
coalescing can use the George criterion

203 204
Caller-save and callee-save registers Example
Variables whose live ranges span calls should go to callee-save registers,
s t u s v w

otherwise to caller-save
x y z { |

} ~  € 

This is easy for graph coloring allocation with spilling ‚ ƒ „ †

calls interfere with caller-save registers ‡ ˆ ‰ Š

a cross-call variable interferes with all precolored caller-save registers,


‹ Œ  Ž

   ‘ ’

as well as with the fresh temporaries created for callee-save copies,


“ ” • – — ˜

forcing a spill
™ š › œ  ž

choose nodes with high degree but few uses, to spill the fresh callee- Ÿ ¡ ¢ £ ¤ ¥ ¦ ¥ § ¨ ¨ ©

save temporary instead of the cross-call variable ª « ¬ ­ ®

this makes the original callee-save register available for coloring the
œ

¯ ° ± ² ³

cross-call variable ´ µ ¶ · ´ ¸ ¹ º » ¼ ½ ¾ ¿ À Á Â Ã Ä Å Æ

Temporaries are , , , , È É Ê Ë Ì

Assume target machine with K 3 registers: , (caller-save/argument/resul


A Î

Ï Ð Ñ Ò

(callee-save)
Ô

The code generator has already made arrangements to save ex- Ö ×

plicitly by copying into temporary and back again Ø

205 206

Example (cont.) Example (cont.)


Ù

Interference graph: 

Interference graph with  removed:


r3 c r3

b e
Ú

b e
r2 r2

r1 a d r1 a d

Only possibility is to coalesce and : will have K significant-


A

D
n

 

No opportunity for simplify or freeze (all non-precolored nodes have


M

Û
h

degree neighbors (after coalescing will be low-degree, though high- 

significant degree K) Ü
h

degree before)
Any coalesce will produce a new node adjacent to K significant- r3
D

degree nodes Ú

b
r2
Must spill based on priorities:
M

h h

Node uses defs uses defs à á degree priority r1 ae d


outside loop inside loop
â

2 10 0 ã

ä å æ ç 4 è 0.50
1 10 1 4 2.75
é ê

ë ì í î ï

2 ð

10 0 ñ

ò ó ô õ 6 ö 0.33
2 10 2 4 5.50
÷ ø

ù ú û ü ý

=   = 

10 3 ÿ

 3 10.30


Node has lowest priority so spill it




207 208
Example (cont.) Example (cont.)
Cannot coalesce with because the move is constrained : the
D " # $ % D
n &

Can now coalesce with (or coalesce and ):


D 

     

nodes interfere. Must simplify :


M '

r3
r3

r2b
r2b
r1 ae d
r1ae
Coalescing and (could also coalesce with ):
n 

Graph now has only precolored nodes, so pop nodes from stack col-
    

r3
oring along the way
r2b – ) * + ,

r1ae d
– , , have colors by coalescing
- . /

– must spill since no color can be found for it


0

Introduce new temporaries and for each use/def, add loads be- 2 3 4 5

fore each use and stores after each def

209 210

Example (cont.) Example (cont.)

New interference graph:


6 7 8 6 9 :

; < = > ? @

r3

A B C D E F C G H I J K
c1
Ú

b e
r2
L M N O P

c2

r1 a d
Q R S T U

V W X Y

Coalesce with , then with :


n ¦ § n ª «

¤ ¥ ¨ ©

Z [ \ ]

r3c1c2
^ _ _ ` a

b e
b c d e f g
r2

h i j k l m

r1 a d
n o p q r s t u t v w w x

As before, coalesce
I

with , then with :


D ­
n ® n ° ±

y z { | }

r3c1c2

~  €  ‚ ƒ „ † ‡ „ ˆ

r2b
‰ Š ‹ Œ  Ž

r1 ae d
  ‘ ’  “ ” • – — ˜ ™ š › œ  ž Ÿ ¡

211 212
Example (cont.) Example (cont.)

As before, coalesce with and simplify : Rewrite the program with this assignment:
D ³ ´ M ·

² Ð

µ ¶

r3c1c2
Ñ Ò Ó Ñ Ô Õ

Ö × Ø Ù Ú Û

r2b
Ü Ý Þ ß à á Þ â ã ä å æ

r1ae
ç è é ê ë ì

Pop from stack: select ¹ º » . All other nodes were coalesced or precol- í î ï ð ñ ò

ored. So, the coloring is: ó ô õ ö ÷

– ¼ ½ ¾ ¿
ø ù ú û ü ý

– À Á Â Ã
þ ÿ ÿ 

– Ä Å Æ Ç
      

– È É Ê Ë
    


             

Ì Í Î Ï

 ! " # $

% & ' ( ) * + , - . + /

0 1 2 3 4 5

6 7 8 9 6 : ; < = > ? @ A B C D E F G H

213 214

Example (cont.)
I

Delete moves with source and destination the same (coalesced):


J K L J M N

O P Q R S T Q U V W X Y

Z [ \ ] ^

_ ` ` a b

c d e f g h i j k

l m n o p q r s

t u v w x y z { | { } ~ ~ 

Chapter 9: Activation Records


€  ‚ ƒ „

† ‡ ˆ ‰ Š ‹ Œ  Ž  Œ 

‘ ’ “ ” ‘ • – — ˜ ™ š › œ  ž Ÿ ¡ ¢ £

One uncoalesced move remains

215 216
The procedure abstraction The procedure abstraction
¯

The essentials:
Separate compilation:
on entry, establish ’s environment
±

allows us to build large programs


²

at a call, preserve ’s environment


B

keeps compile times reasonable


´

on exit, tear down ’s environment


±

requires independent procedures


in between, addressability and proper lifetimes


The linkage convention: ¸

procedure P ¸

procedure Q
¨

a social contract prologue prologue


©

machine dependent
«

division of responsibility pre−call


The linkage convention ensures that procedures inherit a valid run-time call
environment and that they restore one for their parents. post−call
¬

Linkages execute at run time


m

Code to make the linkage is generated at compile time


D

epilogue epilogue
0 ­

Copyright c 2000 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work
1

for personal or classroom use is granted without fee provided that copies are not made or distributed for
profit or commercial advantage and that copies bear this notice and full citation on the first page. To copy
®

otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or
fee. Request permission to publish from [email protected].
Each system has a standard linkage
M

217 218

Procedure linkages Procedure linkages


¯

higher addresses
The linkage divides responsibility between caller and callee
D D

argument n
previous frame

.
arguments

¼ ¼

Caller Callee
incoming

.
.
¼

Call pre-call prologue


½ ½

argument 2
argument 1
frame 1. allocate basic frame 1. save registers, state
pointer
2. store FP (dynamic link)
¾

2. evaluate & store params.


local
Assume that each procedure activation has
V

3. set new FP
V

variables 3. store return address


4. store static link
¿

4. jump to child
an associated activation record or frame (at
B B

return address 5. extend basic frame


run time) (for local data)
current frame

temporaries 6. initialize locals


Assumptions: Â

7. fall through to code


RISC architecture saved registers
¹

Return post-call epilogue


½

can always expand an allocated block argument m


.
arguments

1. copy return value 1. store return value


outgoing

locals stored in frame . ¾ ¾

. 2. deallocate basic frame 2. restore state


argument 2 V V

3. restore parameters 3. cut back to basic frame


stack argument 1
pointer
(if copy out) 4. restore parent’s FP
next frame

5. jump to return address

At compile time, generate the code to do this


Ã

lower addresses

At run time, that code manipulates the frame & data areas

219 220
Run-time storage organization Run-time storage organization
¯

To maintain the illusion of procedures, the compiler can adopt some con-
Typical memory layout
ventions to govern memory use.
high address

Code space Ï

stack
Å

fixed size
(link time)
Ç

statically allocated
free memory

Data space
heap
È

fixed-sized data may be statically allocated


static data
Ï

variable-sized data must be dynamically allocated


Ê

some data is dynamically allocated in code code


Ð

low address
Control stack ¯

«
The classical scheme
Ì

dynamic slice of activation tree


Ñ

allows both stack and heap maximal freedom


Í

return addresses
Ò

code and static data may be separate or intermingled


Î

may be implemented in hardware


221 222

Run-time storage organization Storage classes


(base address )
Ó b

Where do local variables go? Each variable must be assigned a storage class
When can we allocate them on a stack?
Static variables:
Key issue is lifetime of local names
Ü

addresses compiled into code (relocatable)


Downward exposure: Ý

(usually ) allocated at compile-time


Þ

called procedures may reference my variables ß

limited to fixed size objects


«

dynamic scoping à

control access with naming scheme


Ö

lexical scoping

Upward exposure: Global variables:


×

can I return a reference to my variables?


á

almost identical to static variables


layout may be important (exposed )
Ù ã

functions that return functions


â

continuation-passing style
ä

naming scheme ensures universal access

Û
Link editor must handle duplicate definitions
With only downward exposure, the compiler can allocate the frames on the
run-time call stack
223 224
å

Storage classes (cont.) Access to non-local data


Procedure local variables How does the code find non-local data at run-time?
î

Put them on the stack —


Real globals
æ

if sizes are fixed


visible everywhere
ð

if lifetimes are limited


ñ

naming convention gives an address


è

if values are not preserved


ò

initialization requires cooperation

Dynamically allocated variables


é

Must be treated differently — Lexical nesting


view variables as (level,offset ) pairs (compile-time)
ô

call-by-reference, pointers, lead to non-local lifetimes


ë

(usually ) an explicit allocation


ì
õ

chain of non-local access links


more expensive to find (at run-time)
÷

explicit or implicit deallocation

225 226

Access to non-local data Access to non-local data


¯ ¯

Two important problems arise To find the value specified by l o û ü


ý þ

How do we map a name into a (level,offset ) pair? ÿ

need current procedure level, k


b

Use a block-structured symbol table (remember last lecture?) 

k


l  local value
– look up a name, want its most recent declaration 

k


l 


find l’s activation record


«

– declaration may be at current level or any lower level 

k l cannot occur
Given a (level,offset ) pair, what’s the address?
Ç

Two classic approaches


– access links (or static links ) Maintaining access links: (static links )
ú

– displays calling level k 1 procedure


1. pass my FP as access link
2. my backward chain will work for lower levels
calling procedure at level l k
1. find link to level l  1 and pass it
2. its access link will work for lower levels

227 228
The display Calls: Saving and restoring registers
¯

To improve run-time access costs, use a display :




caller’s registers callee’s registers all registers


0

callee saves 1 3 5
table of access links for lower levels
¬


1

caller saves 2 4 6


lookup is index from known offset 1. Call includes bitmap of caller’s registers to be saved/restored
(best with save/restore instructions to interpret bitmap directly)
takes slight amount of time at call
¬

2. Caller saves and restores its own registers




a single display or one per frame Unstructured returns (e.g., non-local gotos, exceptions) create some problems, since
code to restore must be located and executed
2

for level k procedure, need k 1 slots




3. Backpatch code to save registers used in callee on entry, restore on exit


e.g., VAX places bitmap in callee’s stack frame for use on call/return/non-local goto/exception
4 5

Non-local gotos and exceptions must unwind dynamic chain restoring callee-saved
registers
6

Access with the display


4. Bitmap in callee’s stack frame is used by caller to save/restore
assume a value described by l o


ý 

 

(best with save/restore instructions to interpret bitmap directly)




Unwind dynamic chain as for 3




find slot as      ! " l #

5. Easy
$

add offset to pointer from slot ( % & ' ( ) * + , l o)


- .
ý /

Non-local gotos and exceptions must restore all registers from “outermost callee”
“Setting up the basic frame” now includes display manipulation Á

6. Easy (use utility routine to keep calls compact)


Non-local gotos and exceptions need only restore original registers from caller
7

Top-left is best: saves fewer registers, compact calling sequences


229 230

Call/return MIPS procedure call convention


8

Assuming callee saves: Registers:


1. caller pushes space for return value Number Name Usage
2. caller pushes SP
9

3. caller pushes space for: 0 ; < = > Constant 0


return address, static chain, saved registers 1 at Reserved for assembler
2, 3 v0, v1 Expression evaluation, scalar function results
Ê

4. caller evaluates and pushes actuals onto stack 1




5. caller sets return address, callee’s static chain, performs call 4–7 a0–a3 first 4 scalar arguments
¯

8–15 t0–t7 Temporaries, caller-saved; caller must save to pre-


¬

6. callee saves registers in register-save area serve across calls


7. callee copies by-value arrays/records using addresses passed as ac- 16–23 s0–s7 Callee-saved; must be preserved across calls
tuals
¬

24, 25 t8, t9 Temporaries, caller-saved; caller must save to pre-


¬

8. callee allocates dynamic arrays as needed


serve across calls
9. on return, callee restores saved registers 26, 27 k0, k1 Reserved for OS kernel
10. jumps to return address 28 gp
?

Pointer to global area


Caller must allocate much of stack frame, because it computes the actual 29 sp Stack pointer
0

30 s8 (fp) Callee-saved; must be preserved across calls


parameters 0

8
31 ra Expression evaluation, pass return address in calls
Alternative is to put actuals below callee’s stack frame in caller’s: common
when hardware supports stack management (e.g., VAX)
:

231 232
MIPS procedure call convention MIPS procedure call convention
¯

Philosophy: The stack frame


Use full, general calling sequence only when necessary; omit por- high memory
argument n
B

tions of it where possible (e.g., avoid using fp register whenever


¬

possible)
argument 1
B

Classify routines as: virtual frame pointer ($fp)


static link
C

frame offset
@

non-leaf routines: routines that call other routines


locals
A

leaf routines: routines that do not themselves call other routines

framesize
saved $ra
C

– leaf routines that require stack storage for locals


temporaries
D

– leaf routines that do not require stack storage for locals


other saved registers
E

argument build
B

stack pointer ($sp)


C

low memory

233 234

MIPS procedure call convention MIPS procedure call convention


Pre-call: Prologue:
1. Pass arguments: use registers a0 . . . a3; remaining arguments are 1. Leaf procedures that use the stack and non-leaf procedures:
pushed on the stack along with save space for a0 . . . a3 (a) Allocate all stack space needed by routine:
2. Save caller-saved registers if necessary I

local variables
0

3. Execute a instruction: jumps to target address (callee’s first in-


F G H
J

saved registers
struction), saves return address in register ra K

sufficient space for arguments to routines called by this routine


L M N M O P Q R S T U V W P X Y W

(b) Save registers (ra, etc.)


e.g.,
Z [ \ ] ^ _ ` a b c d e f g d h ` a b c d i ` ` e d j k \ e l m

n o p q r s t u v w x y z { x | t u v w x } t t y x ~  €  p y ‚ ƒ

„ † ‡ ˆ ‰ Š ‹ Œ  Ž   ‘ Ž ’ Š ‹ Œ  Ž “ Š Š  Ž ” • – — †  ˜ ™

where and (usually negative) are compile-


: š › œ  ž Ÿ ¡ ž

¢ £ ¤ ¥ ¦ § ¢ ¢ ¨ ¦ ©

time constants
¬

2. Emit code for routine

235 236
MIPS procedure call convention
Epilogue:
1. Copy return values into result registers (if not already there)
2. Restore saved registers
ª « ¬ ­ ® ¯ ° ¬ ± ² ­ ³ ´ µ ­ ¶ ° ¬ ± ² ­ · ° ° ³ ­ ¸ ¹ º » ¼ ³ ½ ¾

3. Get return address


¿ À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì É Í Å Æ Ç È É Î Å Å Ê É Ï Ð Á Ê Ñ Ò

4. Clean up stack
Ó Ô Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ × ß à Þ

5. Return
á â ã ä

237

You might also like