Extract
Extract
Implementation
in Java
Basic Techniques
ANDREW W. APPEL
Princeton University
c Andrew W. Appel, 1997
A catalog record for this book is available from the British Library
Preface ix
1 Introduction 3
1.1 Modules and interfaces 4
1.2 Tools and software 5
1.3 Data structures for tree languages 7
2 Lexical Analysis 16
2.1 Lexical tokens 17
2.2 Regular expressions 18
2.3 Finite automata 21
2.4 Nondeterministic finite automata 24
2.5 JavaLex: a lexical analyzer generator 31
3 Parsing 40
3.1 Context-free grammars 42
3.2 Predictive parsing 47
3.3 LR parsing 57
3.4 Using parser generators 68
4 Abstract Syntax 80
4.1 Semantic actions 80
4.2 Abstract parse trees 87
5 Semantic Analysis 99
5.1 Symbol tables 99
5.2 Bindings for the Tiger compiler 107
5.3 Type-checking expressions 110
v
CONTENTS
vi
CONTENTS
vii
CONTENTS
Bibliography 389
Index 393
viii
Preface
Over the past decade, there have been several shifts in the way compilers are
built. New kinds of programming languages are being used: object-oriented
languages with dynamic methods, functional languages with nested scope and
first-class function closures; and many of these languages require garbage
collection. New machines have large register sets and a high penalty for
memory access, and can often run much faster with compiler assistance in
scheduling instructions and managing instructions and data for cache locality.
This book is intended as a textbook for a one-semester or two-quarter course
in compilers. Students will see the theory behind different components of a
compiler, the programming techniques used to put the theory into practice,
and the interfaces used to modularize the compiler. To make the interfaces and
programming examples clear and concrete, I have written them in the Java
programming language. Other editions of this book are available that use the
C and ML languages.
The “student project compiler” that I have outlined is reasonably simple,
but is organized to demonstrate some important techniques that are now in
common use: Abstract syntax trees to avoid tangling syntax and semantics,
separation of instruction selection from register allocation, sophisticated copy
propagation to allow greater flexibility to earlier phases of the compiler, and
careful containment of target-machine dependencies to one module.
This book, Modern Compiler Implementation in Java: Basic Techniques,
is the preliminary edition of a more complete book to be published in 1998,
entitled Modern Compiler Implementation in Java. That book will have a
more comprehensive set of exercises in each chapter, a “further reading”
discussion at the end of every chapter, and another dozen chapters on advanced
material not in this edition, such as parser error recovery, code-generator
generators, byte-code interpreters, static single-assignment form, instruction
ix
PREFACE
http://www.cs.princeton.edu/˜appel/modern/
There are also pencil and paper exercises in each chapter; those marked with a
star * are a bit more challenging, two-star problems are difficult but solvable,
and the occasional three-star exercises are not known to have a solution.
x
PART ONE
Fundamentals of
Compilation
1
Introduction
This book describes techniques, data structures, and algorithms for translating
programming languages into executable code. A modern compiler is often
organized into many phases, each operating on a different abstract “language.”
The chapters of this book follow the organization of a compiler, each covering
a successive phase.
To illustrate the issues in compiling real programming languages, I show
how to compile Tiger, a simple but nontrivial language of the Algol family,
with nested scope and heap-allocated records. Programming exercises in each
chapter call for the implementation of the corresponding phase; a student
who implements all the phases described in Part I of the book will have a
working compiler. Tiger is easily modified to be functional or object-oriented
(or both), and exercises in Part II show how to do this. Other chapters in Part
II cover advanced techniques in program optimization. Appendix A describes
the Tiger language.
The interfaces between modules of the compiler are almost as important as
the algorithms inside the modules. To describe the interfaces concretely, it is
useful to write them down in a real programming language. This book uses
Java – a simple object-oriented language. Java is safe, in that programs cannot
circumvent the type system to violate abstractions; and it has garbage collec-
tion, which greatly simplifies the management of dynamic storage allocation.
3
CHAPTER ONE. INTRODUCTION
Environ-
ments
Source Program
Abstract Syntax
Tables
Reductions
Translate
IR Trees
IR Trees
Tokens
Assem
Parsing Semantic Canon- Instruction
Lex Parse Actions Analysis Translate icalize Selection
Frame
Frame
Layout
Assembly Language
Interference Graph
Machine Language
Flow Graph
Assem
Control Data
Flow Flow Register Code Assembler Linker
Analysis Analysis Allocation Emission
Both of these properties are useful in writing compilers (and almost any kind
of software).
This is not a textbook on Java programming. Students using this book who
do not know Java already should pick it up as they go along, using a Java
programming book as a reference. Java is a small enough language, with
simple enough concepts, that this should not be difficult for students with
good programming skills in other languages.
4
1.2. TOOLS AND SOFTWARE
produces machine language, it suffices to replace just the Frame Layout and In-
struction Selection modules. To change the source language being compiled,
only the modules up through Translate need to be changed. The compiler
can be attached to a language-oriented syntax editor at the Abstract Syntax
interface.
The learning experience of coming to the right abstraction by several itera-
tions of think–implement–redesign is one that should not be missed. However,
the student trying to finish a compiler project in one semester does not have
this luxury. Therefore, I present in this book the outline of a project where the
abstractions and interfaces are carefully thought out, and are as elegant and
general as I am able to make them.
Some of the interfaces, such as Abstract Syntax, IR Trees, and Assem, take
the form of data structures: for example, the Parsing Actions phase builds an
Abstract Syntax data structure and passes it to the Semantic Analysis phase.
Other interfaces are abstract data types; the Translate interface is a set of
functions that the Semantic Analysis phase can call, and the Tokens interface
takes the form of a function that the Parser calls to get the next token of the
input program.
Two of the most useful abstractions used in modern compilers are context-free
grammars, for parsing, and regular expressions, for lexical analysis. To make
best use of these abstractions it is helpful to have special tools, such as Yacc
5
CHAPTER ONE. INTRODUCTION
(which converts a grammar into a parsing program) and Lex (which converts
a declarative specification into a lexical analysis program). Fortunately, good
versions of these tools are available for Java, and the project described in this
book makes use of them.
The programming projects in this book can be compiled using Sun’s Java
6
1.3. DATA STRUCTURES FOR TREE LANGUAGES
http://www.cs.princeton.edu/˜appel/modern/
Source code for some modules of the Tiger compiler, support code for some
of the programming exercises, example Tiger programs, and other useful files
are also available from the same Web address.
Skeleton source code for the programming assignments is available from
this Web page; the programming exercises in this book refer to this directory as
$TIGER/ when referring to specific subdirectories and files contained therein.
7
CHAPTER ONE. INTRODUCTION
prints
8 7
80
Grammar class
Stm Stm
Exp Exp
ExpList ExpList
id String
num int
For each grammar rule, there is one constructor that belongs to the class
for its left-hand-side symbol. We simply extend the abstract class with a
“concrete” class for each grammar rule. The constructor (class) names are
indicated on the right-hand side of Grammar 1.3.
8
1.3. DATA STRUCTURES FOR TREE LANGUAGES
.
CompoundStm
AssignStm CompoundStm
a OpExp
AssignStm PrintStm
NumExp Plus NumExp LastExpList
b EseqExp
5 3 IdExp
PrintStm OpExp b
PairExpList
NumExp Times IdExp
IdExp LastExpList 10 a
a OpExp
a 1
a := 5 + 3 ; b := ( print ( a , a - 1 ) , 10 * a ) ; print ( b )
9
CHAPTER ONE. INTRODUCTION
10
PROGRAMMING EXERCISE
11
CHAPTER ONE. INTRODUCTION
Stm prog =
new CompoundStm(new AssignStm("a",
new OpExp(new NumExp(5),
OpExp.Plus, new NumExp(3))),
new CompoundStm(new AssignStm("b",
new EseqExp(new PrintStm(new PairExpList(new IdExp("a"),
new LastExpList(new OpExp(new IdExp("a"),
OpExp.Minus,new NumExp(1))))),
Files with the data type declarations for the trees, and this sample program,
are available in the directory $TIGER/chap1.
Writing interpreters without side effects (that is, assignment statements that
update variables and data structures) is a good introduction to denotational
semantics and attribute grammars, which are methods for describing what
programming languages do. It’s often a useful technique in writing compilers,
too; compilers are also in the business of saying what programming languages
do.
Therefore, in implementing these programs, never assign a new value to
any variable or object-field except when it is initialized. For local variables,
use the initializing form of declaration (for example, int i=j+3;) and for
each class, make a constructor function (like the CompoundStm constructor
in Program 1.5).
1. Write a Java function int maxargs(Stm s) that tells the maximum num-
ber of arguments of any print statement within any subexpression of a given
statement. For example, maxargs(prog) is 2.
2. Write a Java function void interp(Stm s) that “interprets” a program
in this language. To write in a “functional programming” style – in which
you never use an assignment statement – initialize each local variable as you
declare it.
Your functions that examine each Exp will have to use instanceof to
determine which subclass the expression belongs to and then cast to the proper
12
PROGRAMMING EXERCISE
subclass. Or you can add methods to the Exp and Stm classes to avoid the use
of instanceof.
For part 1, remember that print statements can contain expressions that
contain other print statements.
For part 2, make two mutually recursive functions interpStm and interp-
Exp. Represent a “table,” mapping identifiers to the integer values assigned
to them, as a list of id × int pairs.
class Table {
String id; int value; Table tail;
Table(String i, int v, Table t) {id=i; value=v; tail=t;}
}
taking a table t1 as argument and producing the new table t2 that’s just like
t1 except that some identifiers map to different integers as a result of the
statement.
For example, the table t1 that maps a to 3 and maps c to 4, which we write
{a 7→ 3, c 7→ 4} in mathematical notation, could be represented as the linked
list a 3 c 4 .
Now, let the table t2 be just like t1 , except that it maps c to 7 instead of 4.
Mathematically, we could write,
t2 = update(t1 , c, 7)
13
CHAPTER ONE. INTRODUCTION
without doing any side effects in the interpreter itself. (The print statements
will be accomplished by interpreter side effects, however.) The solution is to
declare interpExp as
EXERCISES
1.1 This simple program implements persistent functional binary search trees, so
that if tree2=insert(x,tree1), then tree1 is still available for lookups
even while tree2 can be used.
a. Implement a member function that returns true if the item is found, else
false.
b. Extend the program to include not just membership, but the mapping of
keys to bindings:
Tree insert(String key, Object binding, Tree t);
Object lookup(String key, Tree t);
c. These trees are not balanced; demonstrate the behavior on the following
two sequences of insertions:
(a) t s p i p f b s t
14
EXERCISES
(b) a b c d e f g h i
*d. Research balanced search trees in Sedgewick [1988] and recommend
a balanced-tree data structure for functional symbol tables. (Hint: to
preserve a functional style, the algorithm should be one that rebalances
on insertion but not on lookup.)
e. Rewrite in an object-oriented style (but still “functional”) style, so that
insertion is now t.insert(key) instead of insert(key,t). Hint: you’ll
need an EmptyTree subclass.
15