Compiler Design - Theory Tools and Examples PDF
Compiler Design - Theory Tools and Examples PDF
5-1-2017
Recommended Citation
Bergmann, Seth D., "Compiler Design: Theory, Tools, and Examples" (2017). Open Educational Resources.
1.
https://rdw.rowan.edu/oer/1
This Book is brought to you for free and open access by the University Libraries at Rowan Digital Works. It has been
accepted for inclusion in Open Educational Resources by an authorized administrator of Rowan Digital Works. For
more information, please contact [email protected].
Compiler Design: Theory, Tools, and Examples
Seth D. Bergmann
Preface v
1 Introduction 1
1.1 What is a Compiler? . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 The Phases of a Compiler . . . . . . . . . . . . . . . . . . . . . . 8
1.2.1 Lexical Analysis (Scanner) - Finding the Word Boundaries 8
1.2.2 Syntax Analysis Phase . . . . . . . . . . . . . . . . . . . . 10
1.2.3 Global Optimization . . . . . . . . . . . . . . . . . . . . . 12
1.2.4 Code Generation . . . . . . . . . . . . . . . . . . . . . . . 13
1.2.5 Local Optimization . . . . . . . . . . . . . . . . . . . . . . 15
1.2.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.3 Implementation Techniques . . . . . . . . . . . . . . . . . . . . . 19
1.3.1 Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . . 20
1.3.2 Cross Compiling . . . . . . . . . . . . . . . . . . . . . . . 21
1.3.3 Compiling To Intermediate Form . . . . . . . . . . . . . . 22
1.3.4 Compiler-Compilers . . . . . . . . . . . . . . . . . . . . . 24
1.3.5 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1.4 Case Study: Decaf . . . . . . . . . . . . . . . . . . . . . . . . . . 25
1.5 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2 Lexical Analysis 28
2.0 Formal Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.0.1 Language Elements . . . . . . . . . . . . . . . . . . . . . . 28
2.0.2 Finite State Machines . . . . . . . . . . . . . . . . . . . . 29
2.0.3 Regular Expressions . . . . . . . . . . . . . . . . . . . . . 33
2.0.4 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.1 Lexical Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.2 Implementation with Finite State Machines . . . . . . . . . . . . 42
2.2.1 Examples of Finite State Machines for Lexical Analysis . 42
2.2.2 Actions for Finite State Machines . . . . . . . . . . . . . 44
2.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3 Lexical Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
i
ii CONTENTS
3 Syntax Analysis 67
3.0 Grammars, Languages, and Pushdown Machines . . . . . . . . . 68
3.0.1 Grammars . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.0.2 Classes of Grammars . . . . . . . . . . . . . . . . . . . . . 70
3.0.3 Context-Free Grammars . . . . . . . . . . . . . . . . . . . 73
3.0.4 Pushdown Machines . . . . . . . . . . . . . . . . . . . . . 75
3.0.5 Correspondence Between Machines and Classes of Languages 79
3.0.6 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
3.1 Ambiguities in Programming Languages . . . . . . . . . . . . . . 87
3.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
3.2 The Parsing Problem . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
7 Optimization 227
7.1 Introduction and View of Optimization . . . . . . . . . . . . . . . 227
7.1.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
7.2 Global Optimization . . . . . . . . . . . . . . . . . . . . . . . . . 230
7.2.1 Basic Blocks and DAGs . . . . . . . . . . . . . . . . . . . 230
7.2.2 Other Global Optimization Techniques . . . . . . . . . . . 237
7.2.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
7.3 Local Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 246
7.3.1 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
7.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 250
Glossary 251
Bibliography 298
Index 301
Preface
v
vi PREFACE
Secondary Authors
This book is the result of an attempt to launch a series of open source textbooks.
Source files are available at
http://cs.rowan.edu/~bergmann/books.
Contributor List
If you have a suggestion or correction, please send email to [email protected].
If I make a change based on your feedback, I will add you to the contributor
list (unless you ask to be omitted).
vii
If you include at least part of the sentence the error appears in, that makes
it easy for me to search. Page and section numbers are fine, too, but not quite
as easy to work with.
If you wish to rewrite a section or chapter, it would be a good idea to notify
me before starting on it. Major rewrites can qualify for “secondary author”
status.
•
viii PREFACE
Chapter 1
Introduction
Recently the phrase user interface has received much attention in the computer
industry. A user interface is the mechanism through which the user of a device
communicates with the device. Since digital computers are programmed using
a complex system of binary codes and memory addresses, we have developed
sophisticated user interfaces, called programming languages, which enable us to
specify computations in ways that seem more natural. This book will describe
the implementation of this kind of interface, the rationale being that even if
you never need to design or implement a programming language, the lessons
learned here will still be valuable to you. You will be a better programmer as a
result of understanding how programming languages are implemented, and you
will have a greater appreciation for programming languages. In addition, the
techniques which are presented here can be used in the construction of other
user interfaces, such as the query language for a database management system.
1
2 CHAPTER 1. INTRODUCTION
if loaded into memory and executed, would carry out the intended computation.
It is important to bear in mind that when processing a statement such as x = x
* 9; the compiler does not perform the multiplication. The compiler generates,
as output, a sequence of instructions, including a ”multiply” instruction.
Languages which permit complex operations, such as the ones above, are
called high-level languages, or programming languages. A compiler accepts as
input a program written in a particular high-level language and produces as
output an equivalent program in machine language for a particular machine
called the target machine. We say that two programs are equivalent if they
always produce the same output when given the same input. The input program
is known as the source program, and its language is the source language. The
output program is known as the object program, and its language is the object
language. A compiler translates source language programs into equivalent object
language programs. Some examples of compilers are:
A Java compiler for the Apple Macintosh
A COBOL compiler for the SUN
A C++ compiler for the Apple Macintosh
If a portion of the input to a Java compiler looked like this:
a = b + c ∗ d;
the output corresponding to this input might look something like this:
The compiler must be smart enough to know that the multiplication should
be done before the addition even though the addition is read first when scanning
the input. The compiler must also be smart enough to know whether the input
is a correctly formed program (this is called checking for proper syntax), and to
issue helpful error messages if there are syntax errors.
Note the somewhat convoluted logic after the Test instruction in Sample
Problem 1.1.1 . Why did it not simply branch to L3 if the condition code indi-
cated that the first operand (X) was greater than or equal to the second operand
(temp1), thus eliminating an unnecessary branch instruction and label? Some
compilers might actually do this, but the point is that even if the architecture of
the target machine permits it, many compilers will not generate optimal code.
In designing a compiler, the primary concern is that the object program be
semantically equivalent to the source program (i.e. that they mean the same
thing, or produce the same output for a given input). Object program efficiency
is important, but not as important as correct code generation.
1.1. WHAT IS A COMPILER? 3
Solution:
L1: LOD r1,a // Load a into reg. 1
ADD r1,b // Add b to reg. 1
STO r1,temp1 // temp1 = a + b
CMP x,temp1 // Test for while condition
BL L2 // Continue with loop if x < Temp1
B L3 // Terminate loop
L2: LOD r1,=’2’
MUL r1,x
STO r1,x // x = 2 * x
B L1 // Repeat loop
L3:
Input Output
a = 3; 12
b = 4;
✲ Interpreter ✲
println (a*b);
Figure 1.1: A Compiler and Interpreter produce very different output for the
same input
language, but rather than generating a machine language program, the inter-
preter actually carries out the computations specified in the source program.
In other words, the output of a compiler is a program, whereas the output of
an interpreter is the source program’s output. Figure 1.1 shows that although
the input may be identical, compilers and interpreters produce very different
output. Nevertheless, many of the techniques used in designing compilers are
also applicable to interpreters.
Show the compiler output and the interpreter output for the fol-
lowing Java source code:
for (i=1; i<=4; i++) System.out.println (i*3);
Solution:
Compiler Output
LOD r1,=’4’
1.1. WHAT IS A COMPILER? 5
STO r1,Temp1
MOV i,=’1’
L1: CMP i,temp1
BH L2 // Terminate loop if i>Temp1
LOD r1,i
MUL r1,=’3’
STO r1,Temp2
PUSH Temp2 // Parameter for println
CALL Print // Print the result
B L1 // Repeat loop
L2:
Interpreter Output
3 6 9 12
Students are often confused about the difference between a compiler and an
interpreter. Many commercial compilers come packaged with a built-in edit-
compile-run front end. In effect, the student is not aware that after compilation
is finished, the object program must be loaded into memory and executed,
because this all happens automatically. As larger programs are needed to solve
more complex problems, programs are divided into manageable source modules,
each of which is compiled separately to an object module. The object modules
can then be linked to form a single, complete, machine language program. In
this mode, it is more clear that there is a distinction between compile time, the
time at which a source program is compiled, and run time, the time at which
the resulting object program is loaded and executed. Syntax errors are reported
by the compiler at compile time and are shown at the left, below, as compile-
time errors. Other kinds of errors not generally detected by the compiler are
called run-time errors and are shown at the right below:
CMac
Java → Mac
CSun
Java → Mac
CAda
PC → Java
Figure 1.2: Big C notation for compilers. (a) A Java compiler for the Mac.
(b) A compiler which translates Java programs to Mac machine language, and
which runs on a Sun machine. (c) A compiler which translates PC machine
language programs to Java, writen in Ada
Using the big C notation of Figure 1.2, show each of the following
compilers:
1. An Ada compiler which runs on the PC and compiles to the PC
machine language.
1.1. WHAT IS A COMPILER? 7
Solution:
(1) (2) (3)
CPCAda → PC CAda
Ada → PC
CSun
Ada → PC
1.1.1 Exercises
1. Show assembly language for a machine of your choice, corresponding to
each of the following Java statements:
(a) a = b + c;
(b) a = (b+c) * (c-d);
(c) for (i=1; i<=10; i++) a = a+i;
2. Show the difference between compiler output and interpreter output for
each of the following source inputs:
(c) a = 12;
b = 6;
while (b<a)
{ a = a-1;
println (a+b);
}
8 CHAPTER 1. INTRODUCTION
(a) a = b+c = 3;
(b) if (x<3) a = 2
else a = x;
4. Using the big C notation, show the symbol for each of the following:
(a) A compiler which translates COBOL source programs to PC machine
language and runs on a PC.
(b) A compiler, written in Java, which translates FORTRAN source pro-
grams to Mac machine language.
(c) A compiler, written in Java, which translates Sun machine language
programs to Java.
Show the token classes, or “words”, put out by the lexical analysis
phase corresponding to this Java source input:
sum = sum + unit * /* accumulate sum */ 1.2e-12 ;
Solution:
identifier (sum)
assignment (=)
identifier (sum)
operator (+)
identifier (unit)
operator (*)
numeric constant (1.2e-12)
semicolon (;)
10 CHAPTER 1. INTRODUCTION
Solution:
(MULT, c, d, temp1)
(ADD, b, temp1, temp2)
(MOVE, temp2, a)
To implement transfer of control, we could use label atoms, which serve only
to mark a spot in the object program to which we might wish to branch in
implementing a control structure such as if or while. A label atom with the
name L1 would be (LBL,L1). We could use a jump atom for an unconditional
branch, and a test atom for a conditional branch: The atom (JMP, L1) would
be an unconditional branch to the label L1. The atom (TEST, a, ¡=, b, L2)
would be a conditional branch to the label L2, if a¡=b is true.
1.2. THE PHASES OF A COMPILER 11
=
a +
b *
c d
Solution:
(LBL, L1)
(Test, a, <=, b, L2)
(JMP, L3)
(LBL, L2)
(ADD, a, 1, a)
(JMP, L1)
(LBL, L3)
Some parsers put out syntax trees as an intermediate data structure, rather
than atom strings. A syntax tree indicates the structure of the source statement,
and object code can be generated directly from the syntax tree. A syntax tree
for the expression a = b + c ∗ d is shown in Figure 1.3.
In syntax trees, each interior node represents an operation or control struc-
ture and each leaf node represents an operand. A statement such as if (Expr)
Stmt1 else Stmt2 could be implemented as a node having three children: one for
the conditional expression, one for the true part (Stmt1), and one for the else
statement (Stmt2). The while control structure would have two children: one
for the loop condition, and one for the statement to be repeated. The compound
statement could be treated a few different ways. The compound statement could
have an unlimited number of children, one for each statement in the compound
statement. The other way would be to treat the semicolon like a statement
concatenation operator, yielding a binary tree.
12 CHAPTER 1. INTRODUCTION
Once a syntax tree has been created, it is not difficult to generate code from
the syntax tree; a postfix traversal of the tree is all that is needed. In a postfix
traversal, for each node, N, the algorithm visits all the subtrees of N, and visits
the node N last, at which point the instruction(s) corresponding to node N can
be generated.
Solution:
if
< = =
+ 400 a 0 b *
a 3 a a
Many compilers also include a phase for semantic analysis. In this phase
the data types are checked, and type conversions are performed when necessary.
The compiler may also be able to detect some semantic errors, such as division
by zero, or the use of a null pointer.
stmt1
go to label1
stmt2
stmt3
label2: stmt4
stmt2 and stmt3 can never be executed. They are unreachable and can be
eliminated from the object program. A second example of global optimization
is shown below:
In this case, the assignment to x need not be inside the loop since y does not
change as the loop repeats (it is a loop invariant ). In the global optimization
phase, the compiler would move the assignment to x out of the loop in the object
program:
This would eliminate 99,999 unnecessary calls to the sqrt method at run
time.
The reader is cautioned that global optimization can have a serious impact
on run-time debugging. For example, if the value of y in the above example
was negative, causing a run-time error in the sqrt function, the user would
be unaware of the actual location of that portion of code which called the
sqrt function, because the compiler would have moved the offending statement
(usually without informing the programmer). Most compilers that perform
global optimization also have a switch with which the user can turn optimization
on or off. When debugging the program, the switch would be off. When the
program is correct, the switch would be turned on to generate an optimized
version for the user. One of the most difficult problems for the compiler writer
is making sure that the compiler generates optimized and unoptimized object
modules, from the same source module, which are equivalent.
memory addresses, all encoded as binary values. In the code generation phase,
atoms or syntax trees are translated to machine language (binary) instructions,
or to assembly language, in which case the assembler is invoked to produce
the object program. Symbolic addresses (statement labels) are translated to
relocatable memory addresses at this time.
For target machines with several CPU registers, the code generator is re-
sponsible for register allocation. This means that the compiler must be aware
of which registers are being used for particular purposes in the generated pro-
gram, and which become available as code is generated.
For example, an ADD atom might be translated to three machine language
instructions: (1) load the first operand into a register, (2) add the second
operand to that register, and (3) store the result, as shown for the atom (ADD,
a, b,temp):
In Sample Problem 1.2.5 the destination for the MOV instruction is the
first operand, and the source is the second operand, which is the reverse of the
operand positions in the MOVE atom.
Solution:
LOD r1,a
ADD r1,b
STO r1,temp1 // ADD, a, b, temp1
CMP a,b
BE L1 // TEST, A, ==, B, L1
MOV a,temp1 // MOVE, temp1, a
L1: MOV b,temp1 // MOVE, temp1, b
1.2. THE PHASES OF A COMPILER 15
Note that some of these instructions (those marked with * in the comment) can
be eliminated without changing the effect of the program, making the object
program both smaller and faster:
A diagram showing the phases of compilation and the output of each phase is
shown in Figure 1.4. Note that the optimization phases may be omitted (i.e. the
atoms may be passed directly from the Syntax phase to the Code Generator,
and the instructions may be passed directly from the Code Generator to the
compiler output file.)
A word needs to be said about the flow of control between phases. One way
to handle this is for each phase to run from start to finish separately, writing
output to a disk file. For example, lexical analysis is started and creates a file
of tokens. Then, after the entire source program has been scanned, the syntax
16 CHAPTER 1. INTRODUCTION
Source Program
❄
Lexical
Analysis
Tokens
❄
Syntax
Analysis
Atoms
❄
Global
Optimization
Atoms ✛
❄
Code
Generation
Instructions
❄
Local
Optimization
✛
❄
Instructions
Figure 1.4: The Phases of a Compiler
1.2. THE PHASES OF A COMPILER 17
analysis phase is started, reads the entire file of tokens, and creates a file of
atoms. The other phases continue in this manner; this would be a multiple pass
compiler since the input is scanned several times.
Another way for flow of control to proceed would be to start up the syntax
analysis phase first. Each time it needs a token it calls the lexical analysis phase
as a subroutine, which reads enough source characters to produce one token,
and returns it to the parser. Whenever the parser has scanned enough source
code to produce an atom, the atom is converted to object code by calling the
code generator as a subroutine; this would be a single pass compiler.
1.2.6 Exercises
1. Show the lexical tokens corresponding to each of the following Java source
inputs:
2. Show the sequence of atoms put out by the parser, and show the syntax
tree corresponding to each of the following Java source inputs:
(a) a = (b+c) * d;
(b) if (a<b) a = a + 1;
(c) while (x>1)
{ x = x/2;
i = i+1;
}
(d) a = b - c - d/a + d * a;
operand1 + operand2
4. Show how each of the following Java source inputs can be optimized using
global optimization techniques:
(d) if (x>0) x = 2;
else if (x<=0) x = 3;
else x = 4;
(ADD,a,b,temp1)
(SUB,c,d,temp2)
(TEST,temp1,<,temp2,L1)
(JUMP,L2)
(LBL,L1)
(MOVE,a,b)
(JUMP,L3)
(LBL,L2)
(MOVE,b,a)
(LBL,L3)
6. Show a Java source statement which might have produced the atom string
in Problem 5, above.
7. Show how each of the following object code segments could be optimized
using local optimization techniques:
(a) LD r1,a
MULT r1,b
ST r1,temp1
LD r1,temp1
ADD r1,C
ST r1,temp2
(b) LD r1,a
ADD r1,b
ST r1,temp1
1.3. IMPLEMENTATION TECHNIQUES 19
Name of
Computer
Program loaded
Input ✲ in Computer’s ✲ Output
RAM
MOV c,temp1
CSX → Y ✲
CMS → O ✲ COX → Y
Figure 1.6: Notation for a compiler being translated to a different language
CAda
Ada → PC ✲
CSun
Ada → Sun ✲
?
Solution:
CSun
Ada → PC
1.3.1 Bootstrapping
The term bootstrapping is derived from the phrase ”pull yourself up by your
bootstraps” and generally involves the use of a program as input to itself (the
student may be familiar with bootstrapping loaders which are used to initialize
1.3. IMPLEMENTATION TECHNIQUES 21
CSun
Java → Sun
CSun
Java → Sun
CSun
Sub → Sun
Sun
CSub
Java → Sun ✲
CSun
Sub → Sun ✲ CSun
Java → Sun
a computer just after it has been switched on, hence the expression ”to boot”
a computer).
In this case, we are talking about bootstrapping a compiler, as shown in
Figure 1.7. We wish to implement a Java compiler for the Sun computer. Rather
than writing the whole thing in machine (or assembly) language, we instead
choose to write two easier programs. The first is a compiler for a subset of Java,
written in machine (assembly) language. The second is a compiler for the full
Java language written in the Java subset language. In Figure 1.7 the subset
language of Java is designated ’Sub’, and it is simply Java, without several of
the superfluous features, such as enumerated types, unions, switch statements,
etc. The first compiler is loaded into the computer’s memory and the second is
used as input. The output is the compiler we want i.e. a compiler for the full
Java language, which runs on a Sun and produces object code in Sun machine
language.
In actual practice this is an iterative process, beginning with a small subset
of Java, and producing, as output, a slightly larger subset. This is repeated,
using larger and larger subsets, until we eventually have a compiler for the
complete Java language.
We already have
We want this compiler: We write this compilers: this compiler:
CMac
Java → Mac
CJava
Java → Mac
CSun
Java → Sun
Step 1
Sun
CJava
Java → Mac ✲
CSun
Java → Sun ✲ CSun
Java → Mac
Step 2
Sun
CJava
Java → Mac ✲
CSun
Java → Mac ✲ CMac
Java → Mac
which runs on a Sun. Step two is to load this compiler into the Sun and use
the compiler we wrote in Java as input once again. This time the output is a
Java compiler for the Mac which runs on the Mac, i.e. the compiler we wanted
to produce.
Note that this entire process can be completed before a single Mac has been
built. All we need to know is the architecture (the instruction set, instruction
formats, addressing modes, ...) of the Mac.
(a) PC
✲
Java ◗
◗
◗
◗ ✿
✘
◗✘✘✘✘
✘✘◗ ✸
✑
✘✘✘ ◗ ✑
✑
◗
C++ ❍❍ ✑ ◗
❍ ✑ s
◗ Mac
❍✑
❍
✑ ❍❍
✑ ❍❍
✑
✑ ✲
❥
❍
Ada
PC
(b)
Java
✬✩
❅
❅ ✒
❅
❘
❅
Mac
✲ ❅
✫✪
C++
❅
✒ ❅❘
❅
Ada
Figure 1.9: (a) Six compilers neede for three languages on two machines. (b)
Fewer than three compilers using intermediate form needed for the same lan-
guages and machines.
A very popular intermediate form for the PDP-8 and Apple II series of
computers, among others, called p-code, was developed several years ago at the
University of California at San Diego. Today, high-level languages such as C are
commonly used as an intermediate form. The Java Virtual Machine (i.e. Java
byte code) is another intermediate form which has been used extensively on the
Internet.
24 CHAPTER 1. INTRODUCTION
1.3.4 Compiler-Compilers
Much of compiler design is understood so well at this time that the process can
be automated. It is possible for the compiler writer to write specifications of the
source language and of the target machine so that the compiler can be generated
automatically. This is done by a compiler-compiler. We will introduce this topic
in Chapters 2 and 5 when we study the SableCC public domain software.
1.3.5 Exercises
1. Fill in the missing information in the compilations indicated below:
(a)
PC
CJava
Java → Mac ✲
CPCJava → PC ✲ ?
(b)
PC
CJava
Java → Mac ✲
CPCJava → Mac ✲ ?
(c)
Sun
? ✲
CSun
Ada → Sun ✲ CSun
Ada → Sun
(d)
1.4. CASE STUDY: DECAF 25
Mac
CJava
Mac → Java✲
? ✲ CSun
Mac → Java
2. How could the compiler generated in part (d) of the previous question be
used?
3. If the only computer you have is a PC (for which you already have a
FORTRAN compiler), show how you can produce a FORTRAN compiler
for the Mac computer, without writing any assembly or machine language.
4. Show how Ada can be bootstrapped in two steps on a Sun. First use a
small subset of Ada, Sub1 to build a compiler for a larger subset, Sub2
(by bootstrapping). Then use Sub2 to implement Ada (again by boot-
strapping). Sub1 is a subset of Sub2.
5. You have 3 computers: A PC, a Mac, and a Sun. Show how to generate
automatically a java to FORT translator which will run on a Sun if you
also have the four compilers shown below:
CMacJava → FORT
CSun
FORT → Java
CMac
Java → Sun
CJava
Java → FORT
CJava
6. In Figure 1.8, suppose we also have
Java → Sun
. When we write
CJava
Java → Mac
, which of the phases of C
Java → Sun
Java
can be reused
as is?
7. Using the big C notation, show the 11 translators which are represented
in Figure 1.9. Use Int to represent the intermediate form.
use for the case study is the following Decaf program, to compute the cosine
function:
class Cosine
{ public static void main (String [] args)
{ float cos, x, n, term, eps, alt;
/* compute the cosine of x to within tolerance eps */
/* use an alternating series */
x = 3.14159;
eps = 0.0001;
n = 1;
cos = 1;
term = 1;
alt = -1;
while (term>eps
This program computes the cosine of the value x (in radians) using an al-
ternating series which terminates when a term becomes smaller than a given
tolerance (eps). This series is described in most calculus textbooks and can be
written as:
cos(x) = 1 − x2 /2 + x4 /24 − x6 /720 + ...
Note that in the statement term = term ∗ x ∗ x/n/(n + 1) the multiplication
and division operations associate to the left, so that n and (n + 1) are both in
the denominator.
A precise specification of Decaf, similar to a BNF description, is given in
Appendix A. The lexical specifications (free format, white space taken as de-
limiters, numeric constants, comments, etc.) of Decaf are the same as standard
C.
When we discuss the back end of the compiler (code generation and opti-
mization) we will need to be concerned with a target machine for which the
compiler generates instructions. Rather than using an actual computer as the
target machine, we have designed a fictitious computer called Mini as the target
machine. This was done for two reasons: (1) We can simplify the architecture of
the machine so that the compiler is not unnecessarily complicated by complex
addressing modes, complex instruction formats, operating system constraints,
etc., and (2) we provide the source code for a simulator for Mini so that the
student can compile and execute Mini programs (as long as he/she has a C
1.5. CHAPTER SUMMARY 27
compiler on his/her computer). The student will be able to follow all the steps
in the compilation of the above cosine program, understand its implementation
in Mini machine language, and observe its execution on the Mini machine.
The complete source code for the Decaf compiler and the Mini simulator
is provided in the appendix and is available through the Internet, as described
in the appendix. With this software, the student will be able to make his/her
own modifications to the Decaf language, the compiler, or the Mini machine
architecture. Some of the exercises in later chapters are designed with this
intent.
Lexical Analysis
28
2.0. FORMAL LANGUAGES 29
set of words, but it represents the same set as {girl, boy, animal, girl}. A set
may contain an infinite number of objects. The set which contains no elements
is still a set, and we call it the empty set and designate it either by { } or by φ.
A string is a list of characters from a given alphabet. The elements of a
string need not be unique, and the order in which they are listed is important.
For example, “abc” and “cba” are different strings, as are “abb” and “ab”.
The string which consists of no characters is still a string (of characters from
the given alphabet), and we call it the null string and designate it by ǫ. It is
important to remember that if, for example, we are speaking of strings of zeros
and ones (i.e. strings from the alphabet {0,1}), then ǫ is a string of zeros and
ones.
In this and following chapters, we will be discussing languages. A (formal)
language is a set of strings from a given alphabet. In order to understand this, it
is critical that the student understand the difference between a set and a string
and, in particular, the difference between the empty set and the null string. The
following are examples of languages from the alphabet {0,1}:
1. {0,10,1011}
2. { }
3. {ǫ,0,00,000,0000,00000,... }
4. The set of all strings of zeroes and ones having an even number of ones.
The first two examples are finite sets while the last two examples are infinite.
The first two examples do not contain the null string, while the last two examples
do. The following are four examples of languages from the alphabet of characters
available on a computer keyboard:
1. {0,10,1011}
2. {ǫ}
3. Java syntax
4. Italian syntax
The third example is the syntax of a programming language (in which each
string in the language is a Java program without syntax errors), and the fourth
example is a natural language (in which each string in the language is a gram-
matically correct Italian sentence). The second example is not the empty set.
❥1 B❥0 C❣
❥
1 0
✲ A
❥
1
0
0,1
Figure 2.1: Example of a finite state machine
❣1 B❥
❥
0 0
✲ A
1
Figure 2.2: Even parity checker
0 1
A D B 0 1
B C B *A A B
*C C B B B A
D D D
(a) (b)
Figure 2.3: Finite state machines in table form for the machines of (a) Figure 2.1
and (b) Figure 2.2
these) will cause the machine to be in an accepting state when the entire input
string has been read. Another finite state machine is shown in Figure 2.2. This
machine accepts any string of zeroes and ones which contains an even number of
ones (which includes the null string). Such a machine is called a parity checker.
For both of these machines, the input alphabet is {0,1}.
Notice that both of these machines are completely specified, and there are
no contradictions in the state transitions. This means that for each state there
is exactly one arc leaving that state labeled by each possible input symbol. For
this reason, these machines are called deterministic. We will be working only
with deterministic finite state machines.
Another representation of the finite state machine is the table, in which we
assign names to the states (A, B, C, ...) and these label the rows of the table.
The columns are labeled by the input symbols. Each entry in the table shows
the next state of the machine for a given input and current state. The machines
of Figure 2.1. and Figure 2.2. are shown in table form in Figure 2.3 Accepting
states are designated with an asterisk, and the starting state is the first one
listed in the table.
With the table representation it is easier to ensure that the machine is com-
pletely specified and deterministic (there should be exactly one entry in every
cell of the table). However, many students find it easier to work with the state
diagram representation when designing or analyzing finite state machines.
for each of the following languages (in each case the input alphabet
is {0,1}):
Solution:
Solution for #1
❥ ❥
❣
1 1
0
✲ A B
0
0 1
A B A
*B A B
Solution for #2
❥1 B❥1 C❥1 D❣
❥
0 0,1
✲ A
0
0
0 1
A A B
B A C
C A D
*D D D
Solution for #3
2.0. FORMAL LANGUAGES 33
❥0 B❥0 C❥0 D❣
❥
1 1 1
1
✲ A
❥ 0,1
0
0 1
A B A
B C B
C D C
*D E D
❥0 ❥
❣
E E E
Solution for #4
✲ A B
0
❥0 ❥
1 1
1 1
C D
0
0 1
A B C
*B A D
C D A
D C B
Note that the union of any language with the empty set is that language:
L + {} = L
For each of the following regular expressions, list six strings which
are in its language.
1. (a(b+c)*)*d
2. (a+b)*(c+d)
3. (a*b*)*
Solution:
1. d ad abd acd aad abbcbd
2. c d ac abd babc bad
3. ǫ a b ab ba aa
Note that (a*b*)* = (a+b)*
36 CHAPTER 2. LEXICAL ANALYSIS
Solution:
1. 1*01*(01*01*)*
2. (0+1)*111(0+1)*
3. 1*01*01*01*
4. (00+11)*(01+10)(1(0(11)*0)*1+0(1(00)*1)*0)*1(0(11)*0)*
+ (00+11)*0
An algorithm for converting a finite state machine to an equiv-
alent regular expression is beyond the scope of this text, but may be
found in Hopcroft & Ullman [1979].
2.0.4 Exercises
1. Suppose L1 represents the set of all strings from the alphabet 0,1 which
contain an even number of ones (even parity). Which of the following
strings belong to L1?
(a) 0101
(b) 110211
(c) 000
(d) 010011
(e) ǫ
2.0. FORMAL LANGUAGES 37
2. Suppose L2 represents the set of all strings from the alphabet a,b,c which
contain an equal number of a’s, b’s, and c’s. Which of the following strings
belong to L2?
(a) bca
(b) accbab
(c) ǫ
(d) aaa
(e) aabbcc
❣
❞ ❣
❞
4. Which of the following strings are in the language specified by this finite
state machine?
a
✲
❣
b
b b
a
a
(a) abab
(b) bbb
(c) aaab
(d) aaa
(e) ǫ
5. Show a finite state machine with input alphabet 0,1 which accepts any
string having an odd number of 1’s and an odd number of 0’s.
6. Describe, in your own words, the language specified by each of the follow-
ing finite state machines with alphabet a,b.
a b a b
A B A A B A
(a) B B C (b) B B C
C B D C B D
*D B A *D D D
38 CHAPTER 2. LEXICAL ANALYSIS
a b a b
*A A B A B A
(c) (d)
*B C B B A B
C C C *C C B
a b
(e) A B B
*B B B
7. Which of the following strings belong to the language specified by this
regular expression: (a+bb)*a
(a) ǫ
(b) aaa
(c) ba
(d) bba
(e) abba
1. while, if, else, for, ... These are words which may have a partic-
ular predefined meaning to the compiler, as opposed to identifiers which
have no particular meaning. Reserved words are keywords which are not
available to the programmer for use as identifiers. In most programming
languages, such as Java and C, all keywords are reserved. PL/1 is an
example of a language which has key words but no reserved words.
8. white space - Spaces and tabs are generally ignored by the compiler, except
to serve as delimiters in most languages, and are not put out as tokens.
An example of Java source input, showing the word boundaries and types is
given below:
1 6 2 3 4 3 2 6 2 6 2 6 6
data structure is often organized as a binary search tree, or hash table, for
efficiency in searching.
When compiling block structured languages such as Java, C, or Algol, the
symbol table processing is more involved. Since the same identifier can have
different declarations in different blocks or procedures, both instances of the
identifier must be recorded. This can be done by setting up a separate symbol
table for each block, or by specifying block scopes in a single symbol table. This
would be done during the parse or syntax analysis phase of the compiler; the
scanner could simply store the identifier in a string space array and return a
pointer to its first character.
Numeric constants must be converted to an appropriate internal form. For
example, the constant 3.4e+6 should be thought of as a string of six characters
which needs to be translated to floating point (or fixed point integer) format so
that the computer can perform appropriate arithmetic operations with it. As
we will see, this is not a trivial problem, and most compiler writers make use of
library routines to handle this.
The output of this phase is a stream of tokens, one token for each word
encountered in the input program. Each token consists of two parts: (1) a class
indicating which kind of token and (2) a value indicating which member of the
class. The above example might produce the following stream of tokens:
Token Token
Class Value
Note that the comment is not put out. Also, some token classes might not
have a value part. For example, a left parenthesis might be a token class, with
no need to specify a value.
Some variations on this scheme are certainly possible, allowing greater effi-
ciency. For example, when an identifier is followed by an assignment operator,
a single assignment token could be put out. The value part of the token would
2.1. LEXICAL TOKENS 41
be a symbol table pointer for the identifier. Thus the input string x =, would be
put out as a single token, rather than two tokens. Also, each keyword could be
a distinct token class, which would increase the number of classes significantly,
but might simplify the syntax analysis phase.
Note that the lexical analysis phase does not check for proper syntax. The
input could be
} while if ( { and the lexical phase would put out five tokens corresponding
to the five words in the input. (Presumably the errors will be detected in the
syntax analysis phase.)
If the source language is not case sensitive, the scanner must accommodate
this feature. For example, the following would all represent the same keyword:
then, tHeN, Then, THEN. A preprocessor could be used to translate all alpha-
betic characters to upper (or lower) case. Java is case sensitive.
2.1.1 Exercises
1. For each of the following Java input strings show the word boundaries and
token classes (for those tokens which are not ignored) selected from the
list in Section 2.1.
2. Since Java is free format, newline characters are ignored during lexical
analysis (except to serve as white space delimiters and to count lines for di-
agnostic purposes). Name at least two high-level programming languages
for which newline characters would not be ignored for syntax analysis.
3. Which of the following will cause an error message from your Java com-
piler?
4. Write a Java method to sum the codes of the characters in a given String:
42 CHAPTER 2. LEXICAL ANALYSIS
if (accept[state])
System.out.println ("Accepted"); System.out.println ("Rejected");
When the loop terminates, the program would simply check to see whether
the state is one of the accepting states to determine whether the input is ac-
cepted. This implementation assumes that all input characters are represented
by small integers, to be used as subscripts of the array of states.
❥
❣
L,D
❥D
❥ L,D
L
✲
❥D ❥. ❥Ee ❥± ❥
D D
❥
❣
D
D D
Ee
dead
Figure 2.5: Finite state machine to accept numeric constants (all unspecified transitions
are to the dead state)
Figure 2.4. The letter ‘L’ represents any letter (a-z), and the letter ‘D’ represents
any numeric digit (0-9). This implies that a preprocessor would be needed to
convert input characters to tokens suitable for input to the finite state machine.
A finite state machine which accepts numeric constants is shown in Fig-
ure 2.5.
Note that these constants must begin with a digit, and numbers such as
.099 are not acceptable. This is the case in some languages, such as Pascal,
whereas Java does permit constants which do not begin with a digit. We could
have included constants which begin with a decimal point, but this would have
required additional states.
A third example of the use of state machines in lexical analysis is shown
in Figure 2.6. This machine accepts the keywords if, int, import, for, float.
This machine is not completely specified, because in order for it to be used
in a compiler it would have to accommodate identifiers as well as keywords.
In particular, identifiers such as i, wh, fo, which are prefixes of keywords, and
identifiers such as fork, which contain keywords as prefixes, would have to be
handled. This problem will be discussed below when we include actions in the
finite state machine.
❝
❢
44 CHAPTER 2. LEXICAL ANALYSIS
❢ ❢p ❢ o ❢ r ❢ t ❝❢
❢ ❢t ❝❢
f
m
❢ 0l ❢r ❝❢
i n
✲
❢0 ❢ a ❢ t ❝❢
f
void p()
{ if (parity==0) parity = 1;
else parity = 0;
}
❣ ❣
❞
0 0
1/p()
✲
1/p()
At this point, we have seen how finite state machines are capable of specifying
a language and how they can be used in lexical analysis. But lexical analysis
involves more than simply recognizing words. It may involve building a symbol
table, converting numeric constants to the appropriate data type, and putting
out tokens. For this reason, we wish to associate an action, or function to be
invoked, with each state transition in the finite state machine.
This can be implemented with another array of the same dimension as the
state transition array, which would be an arrray of functions to be called as each
state transition is made. For example, suppose we wish to put out keyword
tokens corresponding to each of the keywords recognized by the machine of
Figure 2.6. We could associate an action with each state transition in the finite
state machine. Moreover, we could recognize identifiers and call a function to
store them in a symbol table.
In Figure 2.7 we show an example of a finite state machine with actions.
The purpose of the machine is to generate a parity bit so that the input string
and parity bit will always have an even number of ones. The parity bit, parity,
is initialized to 0 and is complemented by the function p().
2.2. IMPLEMENTATION WITH FINITE STATE MACHINES 45
Solution:
In the state diagram shown below we have included method calls
designated digits(), decimals(), minus(), and expDigits() which are
to be invoked as the corresponding transition occurs. In other words,
a transition marked i/p means that if the input is i, invoke method
❣ ❣
❞
p() before changing state and reading the next input symbol.
d/expDigits
❣
❞
d/digits
d/expDigits
d/digits, d/expDigits
❣ Ee
❞ ❣ ❣
•
+
d/decimals,
−/minus
dead
// instance variables
int d; // A single digit, 0..9
int places=0; // places after the decimal point
int mantissa=0; // all the digits in the number
int exp=0; // exponent value
int signExp=+1; // sign of the exponent
46 CHAPTER 2. LEXICAL ANALYSIS
where M ath.pow(x, y) = xy
2.2.3 Exercises
1. Show a finite state machine which will recognize the words RENT, RE-
NEW, RED, RAID, RAG, and SENT. Use a different accepting state for
each of these words.
2. Modify the finite state machine of Figure 2.5 to include numeric constants
which begin with a decimal point and have digits after the decimal point,
such as .25, without excluding any constants accepted by that machine.
3. Add actions to your solution to the previous problem so that numeric
constants will be computed as in Sample Problem ??.
4. Show a finite state machine that will accept C-style comments /* as shown
here */. Use the symbol A to represent any character other than * or /;
thus the input alphabet will be {/,*,A}.
2.3. LEXICAL TABLES 47
5. What is the output of the finite state machine, below, for each of the
following inputs (L represents any letter, and D represents any numeric
digit; also, assume that each input is terminated with a period):
❣ L/p1 ❣ ./p4 ❞
❣
L/p2
❣
D
D/p3
L,D
int sum;
void p1() void p2()
{ sum = L; } { sum += L; }
void p4()
{ System.out.println(hash(sum)); }
(a) ab3.
(b) xyz.
(c) a49.
6. Show the values that will be asigned to the variable mantissa in Sample
Problem ?? as the input string 46.73e-21 is read.
bat
frog bird
hill
tree
(a) (b)
Figure 2.8: (a) A balanced binary search tree and (b) a binary search tree which
is not balanced
added at the end. As we learned in our data structures course, the time required
to build a table of n words is O(n2 ). This sequential search technique is easy to
implement but not very efficient, particularly as the number of words becomes
large. This method is generally not used for symbol tables, or tables of line
numbers, but could be used for tables of statement labels, or constants.
1.
2.
3.
hash(frog) = (4+102)%6 = 4
4. frog hash(tree) = (4+116)%6 = 0
hash(hill) = (4+104)%6 = 0
5. bat hash(bird) = (4+98)%6 = 0
hash(bat) = (3+98)%6 = 5
hash(cat) = (3+99)%6 = 0
Figure 2.9: Hash table corresponding to the words entered for Figure 2.8 (a)
2.3.4 Exercises
1. Show the binary search tree which would be constructed to store each of
the following lists of identifiers:
50 CHAPTER 2. LEXICAL ANALYSIS
class Node
{ Node left;
String data;
Node right;
7. Show a sequence of four identifiers which would cause your hash function
defined in the previous problem to generate a collision for each identifier
after the first.
4. Token declarations
5. Ignored tokens
6. Productions
At the present time the student may ignore the Ignored tokens and Produc-
tions sections, whereas the sections on Helper declarations, States declarations,
and Token declarations are relevant to lexical analysis. The required Java classes
will be provided to the student as a standard template. Consequently, the input
file, named language.grammar will be arranged as shown below:
Package package-name ;
Helpers
[ Helper declarations, if any, go here ]
States
[ State declarations, if any, go here ]
Tokens
[ Token declarations go here ]
There are two important rules to remember when more than one token def-
inition matches input characters at a given point in the input file:
• When two token definitions match the input, the one matching the longer
input string is selected.
• When two token definitions match input strings of the same length, the
token definition listed first is selected. For example, the following would
not work as desired:
Tokens
identifier = [’a’..’z’]+ ;
keyword = ’while’ | ’for’ | ’class’ ;
Tokens
keyword = ’while’ | ’for’ | ’class’ ;
identifier = [’a’..’z’]+ ;
Helpers
digit = [’0’..’9’] ;
letter = [[’a’..’z’] + [’A’..’Z’]] ;
sign = ’+’ | ’-’ ;
newline = 10 | 13 ; // ascii codes
2.4. LEXICAL ANALYSIS WITH SABLECC 55
Tokens
number = sign? digit+ ; // A number is an optional
// sign, followed by 1 or more digits.
Students who may be familiar with macros in the unix utility lex will see an
important distinction here. Whereas in lex, macros are implemented as textual
substitutions, in SableCC helpers are implemented as semantic substitutions.
For example, the definition of number above, using lex would be obtained by
substituting directly the definition of sign into the definition of number:
Solution:
number 334
space
identifier abc
space
identifier abc334
56 CHAPTER 2. LEXICAL ANALYSIS
In this case the entire comment should be ignored. In other words, we wish
the scanner to go into a different state, or mode of operation, when it sees the
two consecutive slashes. It should remain in this state until it encounters the
end of the line, at which point it would return to the default state. Some other
uses of states would be to indicate that the scanner is processing the characters
in a string; the input character is at the beginning of a line; or some other left
context, such as a ‘$’ when processing a currency value. To use states, simply
identify the names of the states as a list of names separated by commas in the
States section:
States
statename1, statename2, statename3,... ;
The first state listed is the start state; the scanner will start out in this state.
In the Tokens section, any definition may be preceded by a list of state names
and optional state transitions in curly braces. The definition will be applied
only if the scanner is in the specified state:
How is the scanner placed into a particular state? This is done with the
transition operator, − >. A transition operator may follow any state name
inside the braces:
Show the token and state definitions needed to process a text file
containing numbers, currency values, and spaces. Currency values
begin with a dollar sich, such as ‘$3045’ and ‘$9’. Assume all num-
bers and currency values are whole numbers. Your definitions should
be able to distinguish between currency values (money) and ordinary
numbers (number). You may also use helpers.
Solution:
58 CHAPTER 2. LEXICAL ANALYSIS
Helpers
num = [’0’..’9’]+ ; // 1 or more digits
States
def, currency; // def is start state
Tokens
space = (’ ’ | 10 | 13 | ’\t’) ;
{def -> currency} dollar = ’$’ ; // change to currency state
{currency -> def} money = num; // change to def state
{def} number = num; // remain in def state
It is also possible to specify a right context for tokens. This is done with a
forward slash (’/’). To recognize a particular token only when it is followed by
a certain pattern, include that pattern after the slash. The token, not including
the right context (i.e. the pattern), will be matched only if the right context
is present. For example, if you are scanning a document in which all currency
amounts are followed by DB or CR, you could match any of these with:
In the text:
SableCC would find a currency token as ‘14.50’ (it excludes the ‘ CR’ which
is the right context). The ‘12’ would not be returned as a currency token because
the right context is not present.
Ignored Tokens
space, comment ;
2.4. LEXICAL ANALYSIS WITH SABLECC 59
Helpers
num = [’0’..’9’]+; // A num is 1 or more decimal digits
letter = [’a’..’z’] | [’A’..’Z’] ;
// A letter is a single upper or
// lowercase character.
Tokens
number = num; // A number token is a whole number
ident = letter (letter | num)* ;
// An ident token is a letter followed by
// 0 or more letters and numbers.
arith_op = [ [’+’ + ’-’ ] + [’*’ + ’/’ ] ] ;
// Arithmetic operators
rel_op = [’<’ + ’>’] | ’==’ | ’<=’ | ’>=’ | ’!=’ ;
// Relational operators
paren = [’(’ + ’)’]; // Parentheses
blank = (’ ’ | ’\t’ | 10 | ’\n’)+ ; // White space
unknown = [0..0xffff] ;
// Any single character which is not part
// of one of the above tokens.
package lexing;
60 CHAPTER 2. LEXICAL ANALYSIS
import lexing.lexer.*;
import lexing.node.*;
import java.io.*; // Needed for pushbackreader and
// inputstream
class Lexing
{
static Lexer lexer;
static Object token;
public static void main(String [] args)
{
lexer = new Lexer
(new PushbackReader
(new InputStreamReader (System.in), 1024));
token = null;
try
{
while ( ! (token instanceof EOF))
{ token = lexer.next(); // read next token
if (token instanceof TNumber)
System.out.print ("Number: ");
else if (token instanceof TIdent)
System.out.print ("Identifier: ");
else if (token instanceof TArithOp)
System.out.print ("Arith Op: ");
else if (token instanceof TRelOp)
System.out.print ("Relational Op: ");
else if (token instanceof TParen)
System.out.print ("Parentheses ");
else if (token instanceof TBlank) ;
// Ignore white space
else if (token instanceof TUnknown)
System.out.print ("Unknown ");
if (! (token instanceof TBlank))
System.out.println (token); // print token as a string
}
}
catch (LexerException le)
{ System.out.println ("Lexer Exception " + le); }
catch (IOException ioe)
{ System.out.println ("IO Exception " +ioe); }
}
}
There is now a two-step process to generate your scanner. The first step is
to generate the Java class definitions by running SableCC. This will produce
a sub-directory, with the same name as the language being compiled. All the
2.4. LEXICAL ANALYSIS WITH SABLECC 61
sablecc languagename.grammar
sablecc lexing.grammar
The second step required to generate the scanner is to compile these Java
classes. First. copy the Lexing.java file from the web site to your lexing sub-
directory, and make any necessary changes. Then compile the source files from
the top directory:
javac languagename/*.java
javac lexing/*.java
java languagename.Classname
java lexing.Lexing
This will read from the standard input file (keyboard) and should display
tokens as they are recognized. Use the end-of-file character to terminate the
input (ctrl-d for unix, ctrl-z for Windows/DOS). A sample session is shown
below:
java lexing.Lexing
sum = sum + salary ;
Identifier: sum
Unknown =
Identifier: sum
Arith Op: +
Identifier: salary
Unknown ;
62 CHAPTER 2. LEXICAL ANALYSIS
2.4.3 Exercises
1. Modify the given SableCC lexing.grammar file and lexing/Lexing.java file
to recognize the following 7 token classes.
Helpers
char = [’a’..’z’] [’0’..’9’]? ;
Tokens
token1 = char char ;
token2 = char ’x’ ;
token3 = char+ ;
token4 = [’0’..’9’]+ ;
space = ’ ’ ;
Input files:
(a) a1b2c3
(b) abc3 a123
(c) a4x ab r2d2
In this section we describe the first two sections of the SableCC source file for
Decaf, which are used for lexical analysis.
The Helpers section, shown below, defines a few macros which will be useful
in the Tokens section. A letter is defined to be any single letter, upper or
lower case. A digit is any single numeric digit. A digits is a string of one or
more digits. An exp is used for the exponent part of a numeric constant, such
as 1.34e12i. A newline is an end-of-line character (for various systems). A
non star is any unicode character which is not an asterisk. A non slash is any
unicode character which is not a (forward) slash. A non star slash is any unicode
character except for asterisk or slash. The helpers non star and non slash are
used in the description of comments. The Helpers section, with an example for
each Helper, is shown below:
Helpers // Examples
letter = [’a’..’z’] | [’A’..’Z’] ; // w
digit = [’0’..’9’] ; // 3
digits = digit+ ; // 2040099
exp = [’e’ + ’E’] [’+’ + ’-’]? digits; // E-34
newline = [10 + 13] ; // ’\n’
non_star = [[0..0xffff] - ’*’] ; // /
non_slash = [[0..0xffff] - ’/’]; // *
non_star_slash = [[0..0xffff] - [’*’ + ’/’]]; // $
States can be used in the description of comments, but this can also be done
without using states. Hence, we will not have a States section in our source file.
The Tokens section, shown below, defines all tokens that are used in the
definition of Decaf. All tokens must be named and defined here. We begin with
definitions of comments; note that in Decaf, as in Java, there are two kinds of
comments: (1) single line comments, which begin with ‘//’ and terminate at a
newline, and (2) multi-line comments, which begin with ‘/*’ and end with ‘*/’.
These two kinds of comments are called comment1 and comment2, respectively.
The definition of comment2, for multi-line comments, was designed using a finite
state machine model as a guide (see exercise 4 in section 2.2). Comments are
listed with white space as Ignored Tokens, i.e. the parser never even sees these
tokens.
A space is any white space, including tab (9) and newline (10, 13) characters.
Each keyword is defined as itself. The keyword class is an exception; for some
reason SableCC will not permit the use of class as a name, so it is shortened to
clas. A language which is not case-sensitive, such as BASIC or Pascal, would
require a different strategy for keywords. The keyword while could be defined
as
while = [’w’ + ’W’] [’h’ + ’H’] [’i’ + ’I’] [’l’ + ’L’] [’e’ + ’E’] ;
mentation makes use of the Java class Hashtable to implement a symbol table
and a table of numeric constants. This will be discussed further in Chapter 5
when we define the Translation class to be used with SableCC.
2.5.1 Exercises
1. Extend the SableCC source file for Decaf, decaf.grammar, to accommo-
date string constants and character constants (these files can be found at
http://cs.rowan.edu/∼bergmann/books). For purposes of this exercise,
ignore the section on productions. A string is one or more characters in-
side double-quotes, and a character constant is one character inside single-
quotes (do not worry about escape-chars, such as ‘
n’). Here are some examples, with a hint showing what your lexical scan-
ner should find:
INPUT HINT
"A long string" One string token
" Another ’c’ string" One string token
"one" ’x’ "three" A string, a char, a string
" // string " A string, no comment
// A "comment" A comment, no string
Syntax Analysis
The second phase of a compiler is called syntax analysis. The input to this
phase consists of a stream of tokens put out by the lexical analysis phase. They
are then checked for proper syntax, i.e. the compiler checks to make sure the
statements and expressions are correctly formed. Some examples of syntax
errors in Java are:
When the compiler encounters such an error, it should put out an informa-
tive message for the user. At this point, it is not necessary for the compiler
to generate an object program. A compiler is not expected to guess the in-
tended purpose of a program with syntax errors. A good compiler, however,
will continue scanning the input for additional syntax errors.
The output of the syntax analysis phase (if there are no syntax errors) could
be a stream of atoms or syntax trees. An atom is a primitive operation which
is found in most computer architectures, or which can be implemented using
only a few machine language instructions. Each atom also includes operands,
which are ultimately converted to memory addresses on the target machine. A
syntax tree is a data structure in which the interior nodes represent operations,
and the leaves represent operands, as discussed in Section 1.2.2. We will see
that the parser can be used not only to check for proper syntax, but to produce
output as well. This process is called syntax directed translation.
Just as we used formal methods to specify and construct the lexical scanner,
we will do the same with syntax analysis. In this case however, the formal
methods are far more sophisticated. Most of the early work in the theory of
compiler design focused on syntax analysis. We will introduce the concept of a
67
68 CHAPTER 3. SYNTAX ANALYSIS
3.0.1 Grammars
Recall our definition of language from Chapter 2 as a set of strings. We have
already seen two ways of formally specifying a language: regular expressions and
finite state machines. We will now define a third way of specifying languages,
i.e. by using a grammar. A grammar is a list of rules which can be used to
produce or generate all the strings of a language, and which does not generate
any strings which are not in the language. More formally a grammar consists
of:
1. A finite set of characters, called the input alphabet, the input symbols, or
terminal symbols.
2. A finite set of symbols, distinct from the terminal symbols, called nonter-
minal symbols, exactly one of which is designated the starting nonterminal
(if no nonterminal is explicitly designated as the starting nonterminal, it
is assumed to be the nonterminal defined in the first rule).
3. A finite list of rewriting rules, also called productions, which define how
strings in the language may be generated. Each of these rewriting rules
is of the form a → b, where a and b are arbitrary strings of terminals and
nonterminals, and a is not null.
The grammar specifies a language in the following way: beginning with the
starting nonterminal, any of the rewriting rules are applied repeatedly to pro-
duce a sentential form, which may contain a mix of terminals and nonterminals.
If at any point, the sentential form contains no nonterminal symbols, then it
is in the language of this grammar. If G is a grammar, then we designate the
language specified by this grammar as L(G).
A derivation is a sequence of rewriting rules, applied to the starting nonter-
minal, ending with a string of terminals. A derivation thus serves to demonstrate
that a particular string is a member of the language. Assuming that the starting
nonterminal is S, we will write derivations in the following form:
S ⇒ a ⇒ b ⇒ g ⇒ ... ⇒ x
3.0. GRAMMARS, LANGUAGES, AND PUSHDOWN MACHINES 69
1. S → aSA
2. S → BA
3. A → ab
4. B → bA
Solution:
S ⇒ a S A ⇒ a B A A ⇒ a B a b A ⇒ a B a b a b ⇒
a bAa ba b⇒ab aba ba b
S ⇒ a S A ⇒ a S a b ⇒ a B A a b ⇒ a b A A a b ⇒
a ba bAa b⇒ab aba ba b
S⇒B A⇒bA A⇒bab A⇒b aba b
Note that in the solution to this problem we have shown that it
is possible to have more than one derivation for the same string:
abababab.
SaB → cS
1. Context-Sensitive: A context-sensitive grammar is one in which each rule
must be of the form:
αAγ → αβγ
where each of α, β, and γ is any string of terminals and nonterminals (in-
cluding ǫ), and A represents a single nonterminal. In this type of grammar, it
is the nonterminal on the left side of the rule (A) which is being rewritten, but
only if it appears in a particular context, α on its left and γ on its right. An
example of a context-sensitive rule is shown below:
SaB → caB
which is another way of saying that an S may be rewritten as a c, but only
if the S is followed by aB (i.e. when S appears in that context). In the above
example, the left context is null.
2. Context-Free: A context-free grammar is one in which each rule must be
of the form:
A→α
where A represents a single nonterminal and α is any string of terminals and
nonterminals. Most programming languages are defined by grammars of this
type; consequently, we will focus on context-free grammars. Note that both
grammars G1 and G2, above, are context-free. An example of a context-free
rule is shown below:
A → aABb
3. Right Linear: A right linear grammar is one in which each rule is of the
form:
A → aB
or
A→a
where A and B represent nonterminals, and a represents a terminal. Right
linear grammars can be used to define lexical items such as identifiers, constants,
and keywords.
Note that every context-sensitive grammar is also in the unrestricted class.
Every context-free grammar is also in the context-sensitive and unrestricted
classes. Every right linear grammar is also in the context-free, context-sensitive,
and unrestricted classes. This is represented by the diagram of Figure 3.1, which
depicts the classes of grammars as circles. All points in a circle belong to the
class of that circle.
A context-sensitive language is one for which there exists a context-sensitive
grammar. A context-free language is one for which there exists a context-
free grammar. A right linear language is one for which there exists a right
linear grammar. These classes of languages form the same hierarchy as the
corresponding classes of grammars.
We conclude this section with an example of a context-sensitive grammar
which is not context-free.
72 CHAPTER 3. SYNTAX ANALYSIS
Unrestricted
Context Sensitive
Context Free
Right Linear
G3:
1. S → aSBC
2. S → ǫ
3. aB → ab
4. bB → bb
5. C → c
6. CB → CX
7. CX → BX
8. BX → BC
Solution:
1. Type 1, Context-Sensitive
2. Type 3, Right Linear
3. Type 0, Unrestricted
4. Type 2, Context-Free
5. Type 1, Context-Sensitive
6. Type 0, Unrestricted
A S B
a A S B b
a A S B b
a ǫ b
Figure 3.2: A derivation tree for aaabbb using grammar G2
Expr Expr
Figure 3.3: Two different derivation trees for the string var + var * var
are those which may have more than one interpretation. Thus, the derivation
tree does more than show that a particular string is in the language of the
grammar - it shows the structure of the string, which may affect the meaning or
semantics of the string. For example, consider the following grammar for simple
arithmetic expressions:
G4:
1. Expr → Expr + Expr
2. Expr → Expr * Expr
3. Expr → ( Expr )
4. Expr → var
5. Expr → const
Figure 3.3 shows two different derivation trees for the string var+var*var,
consequently this grammar is ambiguous. It should be clear that the second
derivation tree in Figure 3.3 represents a preferable interpretation because it
correctly shows the structure of the expression as defined in most programming
languages (since multiplication takes precedence over addition). In other words,
all subtrees in the derivation tree correspond to subexpressions in the derived
expression. A nonambiguous grammar for expressions will be given in the next
section.
A left-most derivation is one in which the left-most nonterminal is always
the one to which a rule is applied. An example of a left-most derivation for
grammar G2 above is:
S ⇒ ASB ⇒ aSB ⇒ aASBB ⇒ aaSBB ⇒ aaBB ⇒ aabB ⇒ aabb
We have a similar definition for right-most derivation. A left-most (or right-
3.0. GRAMMARS, LANGUAGES, AND PUSHDOWN MACHINES 75
most) derivation is a normal form for derivations; i.e., if two different deriva-
tions can be written in the same normal form, they are equivalent in that they
correspond to the same derivation tree. Consequently, there is a one-to-one cor-
respondence between derivation trees and left-most (or right-most) derivations
for a grammar.
1. S → aSbS
2. S → aS
3. S → c
Solution:
S S
a S b S a S
a S c a S b S
c c c
S⇒a S b S⇒a a S bS⇒a ac bS⇒a ac bc
S⇒a S⇒aa S b S⇒a a cb S⇒a a cb c
We note that the two derivation trees correspond to two different
left-most derivations, and the grammar is ambiguous.
• An infinite stack and a finite set of stack symbols which may be pushed
on top or removed from the top of the stack in a last-in first-out manner.
The stack symbols need not be distinct from the input symbols. The stack
must be initialized to contain at least one stack symbol before the first
input symbol is read.
• On each state transition the machine may advance to the next input sym-
bol or retain the input pointer (i.e., not advance to the next input symbol).
• On each state transition the machine may perform one of the stack oper-
ations, push(X) or pop, where X is one of the stack symbols.
• A state transition may include an exit from the machine labeled either
Accept or Reject. This determines whether or not the input string is in
the specified language.
Note that without the infinite stack, the pushdown machine is nothing more
than a finite state machine as defined in Chapter 2. Also, the pushdown machine
halts by taking an exit from the machine, whereas the finite state machine halts
when all input symbols have been read.
An example of a pushdown machine is shown in Figure 3.4, in which the rows
are labeled by stack symbols and the columns are labeled by input symbols. The
←֓ character is used as an endmarker, indicating the end of the input string,
and the ▽ symbol is a stack symbol which we are using to mark the bottom of
the stack so that we can test for the empty stack condition. The states of the
machine are S1 (in our examples S1 will always be the starting state) and S2,
and there is a separate transition table for each state. Each cell of those tables
shows a stack operation (push() or pop), an input pointer function (advance
or retain), and the next state. Accept and Reject are exits from the machine.
The language of strings accepted by this machine is {an bn } where n ≥ 0 ; i.e.,
the same language specified by grammar G2, above. To see this, the student
should trace the operation of the machine for a particular input string. A trace
showing the sequence of stack configurations and states of the machine for the
input string aabb is shown in Figure 3.5. Note that while in state S1 the machine
is pushing X’s on the stack as each a is read, and while in state S2 the machine
is popping an X off the stack as each b is read.
An example of a pushdown machine which accepts any string of correctly
balanced parentheses is shown in Figure 3.6. In this machine, the input symbols
are left and right parentheses, and the stack symbols are X and ,. Note that this
language could not be accepted by a finite state machine because there could
3.0. GRAMMARS, LANGUAGES, AND PUSHDOWN MACHINES 77
S1 a b ∇
Push (X) Pop
X Advance Advance Reject
S1 S2
Push (X)
↵ Advance Reject Accept ∇
Initial
S2 a b ∇ Stack
Pop
X Reject Advance Reject
S2
a a X b b ↵
→ X → X → X →
∇ ∇ ∇ ∇ ∇
S1 S1 S1 S2 S2 Accept
Figure 3.5: Sequence of stacks as the pushdown machine of Figure 3.4 accepts
the input string aabb.
78 CHAPTER 3. SYNTAX ANALYSIS
S1 ( ) ↵
Push (X) Pop
X Advance Advance Reject
S1 S1
Push (X)
∇ Advance Reject Accept ∇
S1
Initial
Stack
Expr
+
Term Term
( → (
∇ ∇
Infix Postfix
2 + 3 2 3 +
2 + 3 * 5 2 3 5 * +
2 * 3 + 5 2 3 * 5 +
(2 + 3) * 5 2 3 + 5 *
Note that parentheses are never used in postfix notation. In Figure 3.8 the
default state transition is to stay in the same state, and the default input pointer
operation is advance. States S2 and S3 show only a few input symbols and stack
symbols in their transition tables, because those are the only configurations
which are possible in those states. The stack symbol E represents an expression,
and the stack symbol L represents a left parenthesis. Similarly, the stack symbols
Ep and Lp represent an expression and a left parenthesis on top of a plus symbol,
respectively.
S1 a + * ( ) N
pop pop
E Reject push(+) push(*) Reject retain retain
S3
pop pop pop
Ep Reject out(+) push(*) Reject retain retain
S2 S2
push(E)
L out(a) Reject Reject push(L) Reject Reject
push(E)
Lp out(a) Reject Reject push(L) Reject Reject
push(E)
Ls out(a) Reject Reject push(L) Reject Reject
push(Ep)
+ out(a) Reject Reject push(Lp) Reject Reject
pop
* out(a*) Reject Reject push(Ls) Reject Reject
push(E)
, out(a) Reject Reject push(L) Reject Accept
S2 ) N S3 )
pop pop Rep(E)
+ out(+) out(+) L S1
retain,S3 retain,S1
pop Rep(E)
* out(*) Reject Lp S1
S1 ,
pop
E retain Initial
Stack
pop
Ls retain
S2
, Reject
machine, we can write a right linear grammar which specifies the same language
accepted by the finite state machine.
There are algorithms which can be used to produce any of these three forms
(finite state machines, right linear grammars, and regular expressions), given
one of the other two (see, for example, Hopcroft and Ullman [12]). However,
here we rely on the student’s ingenuity to solve these problems.
Show the sequence of stacks and states which the pushdown machine
of Figure 3.8 would go through if the input were: a+(a*a)
Solution:
*
E E E
Lp Lp Lp Lp Lp Ep
a + + ( + a + * + a + ) + + ↵
→ E → E → E → E → E → E → E → E →
∇ Out(a) ∇ ∇ ∇ Out(a) ∇ ∇ Out(a*) ∇ ∇ ∇
S1 S1 S1 S1 S1 S1 S1 S3 S1
+
E → E → Output: aaa*+
∇ Out(+) ∇ ∇
S2 S1 S1 Accept
82 CHAPTER 3. SYNTAX ANALYSIS
Solution:
(4) Strings over {0,1} which contain an odd number of 0’s and
an even number of 1’s.
3.0. GRAMMARS, LANGUAGES, AND PUSHDOWN MACHINES 83
1. S → 0A
2. S → 1B
3. S → 0
4. A → 0S
5. A → 1C
6. B → 0C
7. B → 1S
8. C → 0B
9. C → 1A
10. C → 1
1. S → 0S0
2. S → 1S1
3. S → 0
4. S → 1
5. S → ǫ
3.0.6 Exercises
1. Show three different derivations using each of the following grammars,
with starting nonterminal S.
(a)
1. S → aS
2. S → bA
3. A → bS
4. A → c
(b)
1. S → aBc
2. B → AB
3. A → BA
4. A → a
5. B → ǫ
(c)
1. S → aSBc
2. aSA → aSbb
3. Bc → Ac
4. Sb → b
5. A → a
(d)
1. S →ab
2. a →aAbB
3. A b B → ǫ
4. For each of the given input strings show a derivation tree using the fol-
lowing grammar.
3.0. GRAMMARS, LANGUAGES, AND PUSHDOWN MACHINES 85
1. S → SaA
2. S → A
3. A → AbB
4. A → B
5. B → cSd
6. B → e
7. B → f
(a) eae (b) ebe (c) eaebe (d) ceaedbe (e) cebedaceaed
5. Show a left-most derivation for each of the following strings, using gram-
mar G4 of section 3.0.3.
(a) var + const (b) var + var * var (c) (var) (d) ( var + var ) *
var
(a)
1. S → aSb
2. S → AA
3. A → c
4. A → S
(b)
1. S → AaA
2. S → AbA
3. A → c
4. A → z
(c)
1. S → a S b S
2. S → a S
3. S → c
(d)
1. S → aSbc
2. S → AB
3. A → a
4. B → b
86 CHAPTER 3. SYNTAX ANALYSIS
8. Show a pushdown machine that will accept each of the following languages:
(a) {an bm }m > n > 0
(b) a ∗ (a + b)c∗
(c) {an bn cm dm }m, n ≥ 0
(d) {an bm cm dn }m, n > 0
(e) {Ni c(Ni+1 )r }
- where Ni is a binary representation of the integer i, and (Ni )r is Ni
written right to left (reversed). Examples:
i A string which should be accepted
19 10011c00101
19 10011c001010
15 1111c00001
15 1111c0000100
Hint: Use the first state to push Ni onto the stack until the c is read. Then
use another state to pop the stack as long as the input is the complement
of the stack symbol, until the top stack symbol and the input symbol are
equal. Then use a third state to ensure that the remaining input symbols
match the symbols on the stack. A fourth state can be used to allow for
leading (actaully, trailing) zeros after the c.
9. Show the output and the sequence of stacks for the machine of Figure 3.8
for each of the following input strings:
(a) a + a ∗ a ←֓
(b) (a + a) ∗ a ←֓
(c) (a) ←֓
(d) ((a)) ←֓
10. Show a grammar and an extended pushdown machine for the language of
prefix expressions involving addition and multiplication. Use the terminal
symbol a to represent a variable or constant. Example: *+aa*aa
11. Show a pushdown machine to accept palindromes over {0,1} with center-
marker c. This is the language, Pc, referred to in section 3.0.5.
12. Show a grammar for the language of valid regular expressions (as defined in
section 2.0) over the alphabet {0,1}. You may assume that concatenation
is always represented by a raised dot. An example of a string in this
language would be:
(0 + 1 · 1) ∗ ·0
An example of a string not in this language would be:
((0 + +1)
Hint: Think about grammars for arithmetic expressions.
3.1. AMBIGUITIES IN PROGRAMMING LANGUAGES 87
Expr
Expr + Term
A derivation tree for the input string var + var * var is shown, in Fig-
ure 3.9. The student should verify that there is no other derivation tree for
this input string, and that the grammar is not ambiguous. Also note that in
any derivation tree using this grammar, subtrees correspond to subexpressions,
according to the usual precedence rules. The derivation tree in Figure 3.9 in-
dicates that the multiplication takes precedence over the addition. The left
associativity rule would also be observed in a derivation tree for var + var +
var.
Another example of ambiguity in programming languages is the conditional
statement as defined by grammar G6:
G6:
1. Stmt → IfStmt
2. IfStmt → if ( BoolExpr ) Stmt
3. IfStmt → if ( BoolExpr ) Stmt else Stmt
Stmt
IfStmt
IfStmt
if ( Expr ) Stmt
Stmt
IfStmt
if ( Expr ) Stmt
IfStmt
Figure 3.10: Two different derivation trees for the string if ( Expr ) if (
Expr ) Stmt else Stmt
Stmt is completely defined. For the present example we will show derivation
trees in which some of the leaves are left as nonterminals. Two different deriva-
tion trees for the input string if (BoolExpr) if (BoolExpr) Stmt else Stmt are
shown in Figure 3.10. In this grammar, a BoolExpr is any expression which
results in a boolean (true/false) value. A Stmt is any statement, including if
statements. This ambiguity is normally resolved by informing the programmer
that elses always are associated with the closest previous unmatched ifs. Thus,
the second derivation tree in Figure 3.10 corresponds to the correct interpreta-
tion. The grammar G6 can be rewritten with an equivalent grammar which is
not ambiguous:
G7:
1. Stmt → IfStmt
2. IfStmt → Matched
3. IfStmt → Unmatched
4. Matched → if ( BoolExpr ) Matched else Matched
5. Matched → OtherStmt
6. Unmatched → if ( BoolExpr ) Stmt
7. Unmatched → if ( BoolExpr ) Matched else Unmatched
3.1. AMBIGUITIES IN PROGRAMMING LANGUAGES 89
Stmt
IfStmt
Unmatched
if ( Expr ) Stmt
IfStmt
Matched
OtherStmt OtherStmt
Figure 3.11: A derivation tree for the string if ( Expr ) if ( Expr ) Stmt
else Stmt using grammar G7
3.1.1 Exercises
1. Show derivation trees for each of the following input strings using grammar
G5.
(a) var ∗ var
(b) (var ∗ var) + var
(c) (var)
(d) var ∗ var ∗ var
S → aAb S → xBy
A → bAa B → yBx
A→a B→x
5. How many different derivation trees are there for each of the following if
statements using grammar G6?
(a) if ( BoolExpr ) Stmt
(b) if ( BoolExpr ) Stmt else if ( BoolExpr ) Stmt
(c) if ( BoolExpr ) if ( BoolExpr ) Stmt else Stmt else Stmt
(d) if ( BoolExpr ) if ( BoolExpr ) if ( BoolExpr ) Stmt else Stmt
A parsing algorithm is one which solves the parsing problem for a particular
class of grammars. A good parsing algorithm will be applicable to a large class
of grammars and will accommodate the kinds of rewriting rules normally found
in grammars for programming languages. For context-free grammars, there are
two kinds of parsing algorithms: bottom up and top down. These terms refer
to the sequence in which the derivation tree of a correct input string is built. A
parsing algorithm is needed in the syntax analysis phase of a compiler.
There are parsing algorithms which can be applied to any context-free gram-
mar, employing a complete search strategy to find a parse of the input string.
These algorithms are generally considered unacceptable since they are too slow;
they cannot run in polynomial time (see Aho et. al. [1], for example).
3.3 Summary
This chapter on syntax analysis serves as an introduction to the chapters on
parsing (chapters 4 and 5). In order to understand what is meant by parsing and
how to use parsing algorithms, we first introduce some theoretic and linguistic
concepts and definitions.
We define grammar as a finite list of rewriting rules involving terminal and
nonterminal symbols, and we classify grammars in a hierarchy according to com-
plexity. As we impose more restrictions on the rewriting rules of a grammar, we
arrive at grammars for less complex languages. The four classifications of gram-
mars (and languages) are (0) unrestricted, (1) context-sensitive, (2) context-free,
and (3) right linear. The context-free grammars will be most useful in the syn-
tax analysis phase of the compiler, since they are used to specify programming
languages.
We define derivations and derivation trees for context-free grammars, which
show the structure of a derived string. We also define ambiguous grammars as
those which permit two different derivation trees for the same input string.
Pushdown machines are defined as machines having an infinite stack and are
92 CHAPTER 3. SYNTAX ANALYSIS
The parsing problem was defined in section 3.2 as follows: given a grammar and
an input string, determine whether the string is in the language of the grammar,
and, if so, determine its structure. Parsing algorithms are usually classified as
either top down or bottom up, which refers to the sequence in which a derivation
tree is built or traversed; in this chapter we consider only top down algorithms.
In a top down parsing algorithm, grammar rules are applied in a sequence
which corresponds to a general top down direction in the derivation tree. For
example, consider the grammar:
G8:
1. S → aSb
2. S → bAc
3. A → bS
4. A → a
We show a derivation tree for the input string abbbaccb in Figure 4.1. A
parsing algorithm will read one input symbol at a time and try to decide, using
the grammar, whether the input string can be derived. A top down algorithm
will begin with the starting nonterminal and try to decide which rule of the
grammar should be applied. In the example of Figure 4.1, the algorithm is able
to make this decision by examining a single input symbol and comparing it with
the first symbol on the right side of the rules. Figure 4.2 shows the sequence
of events, as input symbols are read, in which the numbers in circles indicate
which grammar rules are being applied, and the underscored symbols are the
ones which have been read by the parser. Careful study of Figures 4.1 and 4.2
reveals that this sequence of events corresponds to a top down construction of
the derivation tree.
In this chapter, we describe some top down parsing algorithms and, in addi-
tion, we show how they can be used to generate output in the form of atoms or
syntax trees. This is known as syntax directed translation. However, we need to
begin by describing the subclass of context-free grammars which can be parsed
93
94 CHAPTER 4. TOP DOWN PARSING
S
a S b
b A c
b S
b A c
a
Figure 4.1: A derivation tree for abbbaccb using grammar G8
Figure 4.2: Sequence of events in a top down parse of the string abbbaccb using
grammar G8
top down. In order to do this we begin with some preliminary definitions from
discrete mathematics.
Note that (a,b) and (b,a) are not the same. Sometimes the name of a rela-
tion is used to list the elements of the relation:
4<9
5 < 22
2<3
−3 < 0
a❥-
Figure 4.3: Transitive (top) and reflexive (bottom) aspects of a relation
Solution:
R1*:
(a,b)
(c,d)
(b,a) (from R1)
(b,c)
(c,c)
(a,c)
96 CHAPTER 4. TOP DOWN PARSING
(b,d) (transitive)
(a,d)
(a,a)
(b,b) (reflexive)
(d,d)
Note in Sample Problem 4.0.1 that we computed the transitive entries before
the reflexive entries. The pairs can be listed in any order, but reflexive entries
can never be used to derive new transitive pairs, consequently the reflexive pairs
were listed last.
4.0.1 Exercises
1. Show the reflexive transitive closure of each of the following relations:
2. The mathematical relation less than is denoted by the symbol <. Some of
the elements of this relation are: (4,5) (0,16) (-4,1) (1.001,1.002). What
do we normally call the relation which is the reflexive transitive closure of
less than ?
Figure 4.4 illustrates the construction of a derivation tree for the input string
abbaddd, using grammar G12. The parser must decide which of the three rules
to apply as input symbols are read. In Figure 4.4 the underscored input symbol
is the one which determines which of the three rules is to be applied, and is thus
used to guide the parser as it attempts to build a derivation tree. The input
symbols which direct the parser to use a particular rule are the members of the
selection set for that rule. In the case of simple grammars, there is exactly one
symbol in the selection set for each rule, but for other context-free grammars,
there could be several input symbols in the selection set.
98 CHAPTER 4. TOP DOWN PARSING
S
rule 1
a b S d
a⇒
rule 2 a b S d
abb ⇒ b a S d
rule 3 a b S d
abbad ⇒ b a S d
d
Figure 4.4: Using the input symbol to guide the parsing of the string abbaddd
G13:
1. S → aSB
2. S → b
3. B → a
4. B → bBa
a b ↵
Rep (Bsa) Rep (b) Reject
S Retain Retain
Rep (a) Rep (aBb) Reject
B Retain Retain
pop Reject Reject
a Advance S
Reject pop Reject ∇
b Advance
∇ Reject Reject Accept Initial
Stack
Figure 4.5: A pushdown machine for grammar G13
a
a S S b b a N
S z B z B z B z B z a z
, , , , , , , Accept
Figure 4.6: Sequence of stacks for machine of Figure 4.5 for input aba
void parse ()
{ inp = getInp();
4.1. SIMPLE GRAMMARS 101
void S ()
{ if (inp==’a’) // apply rule 1
{ inp = getInp();
S ();
B ();
} // end rule 1
else if (inp==’b’) inp = getInp(); // apply rule 2
else reject();
}
void B ()
{ if (inp==’a’) inp = getInp(); // rule 3
else if (inp==’b’) // apply rule 4
{ inp = getInp();
B();
if (inp==’a’) inp = getInp();
else reject();
} // end rule 4
else reject();
}
char getInp()
{ try
{ return (char) System.in.read(); }
catch (IOException ioe)
{ System.out.println ("IO error " + ioe); }
return ’#’; // must return a char
}
}
Note that the main method (parse) reads the first input character before
calling the method for nonterminal S (the starting nonterminal). Each method
102 CHAPTER 4. TOP DOWN PARSING
assumes that the initial input symbol in the example it is looking for has been
read before that method has been called. It then uses the selection set to deter-
mine which of the grammar rules defining that nonterminal should be applied.
The method S calls itself (because the nonterminal S is defined in terms of it-
self), hence the name recursive descent. When control is returned to the parse
method, it must ensure that the entire input string has been read before accept-
ing it. The methods accept() and reject() simply indicate whether or not the
input string is in the language of the grammar. The method getInp() is used
to obtain one character from the standard input file (keyboard). In subsequent
examples, we omit the main, accept, reject, and getInp methods to focus atten-
tion on the concepts of recursive descent parsing. The student should perform
careful hand simulations of this program for various input strings, to understand
it fully.
Solution:
void A()
{ if (inp==’0’) // apply rule 3
{ getInp();
S();
if (inp==’0’) getInp();
else reject();
} // end rule 3
else if (inp==’1’) getInp() // apply rule 4
else reject();
}
104 CHAPTER 4. TOP DOWN PARSING
4.1.3 Exercises
1. Determine which of the following grammars are simple. For those which
are simple, show an extended one-state pushdown machine to accept the
language of that grammar.
(a)
1. S → a S b
2. S → b
(b)
1. Expr → Expr + Term
2. Expr → Term
3. Term → var
4. Term → ( Expr )
(c)
1. S → aAbB
2. A → bA
3. A → a
4. B → bA
(d)
1. S → aAbB
2. A → bA
3. A → b
4. B → bA
(e)
1. S → aAbB
2. A → bA
3. A → ǫ
4. B → bA
2. Show the sequence of stacks for the pushdown machine of Figure 4.5 for
each of the following input strings:
(a) aba←֓
(b) abbaa←֓
(c) aababaa←֓
(where N represents any nonterminal) as long as all rules defining the same
nonterminal have disjoint selection sets. For example, the following is a quasi-
simple grammar:
G14:
1. S → a A S
2. S → b
3. A → c A S
4. A → ǫ
In order to do a top down parse for this grammar, we will again have to find
the selection set for each rule. In order to find the selection set for ǫ rules (such
as rule 4) we first need to define some terms. The follow set of a nonterminal
A, designated Fol(A), is the set of all terminals (or endmarker ←֓ ) which can
immediately follow an A in an intermediate form derived from S ←֓, where S is
the starting nonterminal. For grammar G14, above, the follow set of S is {a,b,
←֓} and the follow set of A is {a,b}, as shown by the following derivations:
S ←֓ ⇒ aAS ←֓⇒ acASS ←֓⇒ acASaAS ←֓
⇒ acASb ←֓
Fol(S) = {a,b,←֓}
S
rule 1 S rule 3
a A S
a⇒ a ac ⇒
A S
c A S
S S
rule 4 a A S a A S
rule 2
acb ⇒ acb ⇒
c A S c A S
ǫ ǫ b
rule 2 a A S
acbb ⇒ c A S b
ǫ b
Figure 4.7: Construction of a Parse Tree for acbb using selection sets
4.2. QUASI-SIMPLE GRAMMARS 107
a b c ↵
Rep (SAa) Rep (b) Reject Reject
S Retain Retain
Pop Pop Rep (SAc) Reject
A Retain Retain Retain
Pop Reject Reject Reject
a Advance S
Reject Pop Reject Reject ∇
b Advance
Reject Reject Pop Reject Initial
c Advance Stack
Reject Reject Reject Accept
∇
char inp;
void parse ()
{ inp = getInp();
S ();
if (inp==’N’) accept();
else reject();
}
void S ()
{ if (inp==’a’) // apply rule 1
{ inp = getInp();
A();
S();
}
// end rule 1
else if (inp==’b’) inp = getInp(); // apply rule 2
else reject();
}
void A ()
{ if (inp==’c’) // apply rule 3
{ inp = getInp();
A ();
S ();
}
// end rule 3
else if (inp==’a’ || inp==’b’) ; // apply rule 4
else reject();
}
Note that rule 4 is applied in method A() when the input symbol is a or
b. Rule 4 is applied by returning to the calling method without reading any
input characters. This is done by making use of the fact that Java permits null
statements (at the comment // apply rule 4). It is not surprising that a null
statement is used to process the null string.
out.
For example, in the pushdown machine of Figure 4.8, for the row labeled
A, we have filled in Pop, Retain under the input symbols a and b, but Reject
under the input symbol ←֓; the reason is that the selection set for the ǫ rule is
{a,b}. If we had not computed the selection set, we could have filled in all three
of these cells with Pop, Retain, and the machine would have produced the same
end result for any input.
Find the selection sets for the following grammar. Is the gram-
mar quasi-simple? If so, show a pushdown machine and a recursive
descent parser (show methods S() and A() only) corresponding to
this grammar.
1. S → bAb
2. S → a
3. A → ǫ
4. A → aSa
Solution:
In order to find the selection set for rule 3, we need to find the fol-
low set of the nonterminal A. This is the set of terminals (including
←֓) which could follow an A in a derivation from S ←֓.
S ←֓⇒ bAbS ←֓
The terminal b immediately follows an A in the derivation shown
above. We cannot find any other teminals that can follow an A in
a derivation from S ←֓. There fore, FOL(A) = {b}. The selection
sets can now be listed: texttt Sel(1) = {b}
Sel(2) = {a}
Sel(3) = FOL(A) = {b}
Sel(4) = {a}
a b ↵
Rep (a) Rep (bAb) Reject
S Retain Retain
Rep (aSa) Pop Reject
A Retain Retain
Pop Reject Reject
a Advance S
Reject Pop Reject ∇
b Advance
∇ Reject Reject Accept Initial
Stack
void S()
{ if (inp==’b’) // apply rule 1
{ getInp();
A();
if (inp==’b’) getInp();
else Reject();
} // end rule 1
else if (inp==’a’) getInp(); // apply rule 2
else reject();
}
void A()
{ if (inp==’b’) ; // apply rule 3
else if (inp==’a’) getInp() // apply rule 4
{ getInp();
S();
if (inp==’a’) getInp();
else reject();
} // end rule 4
else reject();
}
4.2.4 Exercises
1. Show the sequence of stacks for the pushdown machine of Figure 4.8 for
each of the following input strings:
(a) ab←֓
(b) acbb←֓
(c) aab←֓
2. Show a derivation tree for each of the input strings in Problem 1, using
grammar G14. Number the nodes of the tree to indicate the sequence in
which they were applied by the pushdown machine.
1. S → aAbS
2. S → ǫ
3. A → aSb
4. A → ǫ
In this discussion, the phrase any string always includes the null string,
unless otherwise stated. As each step is explained, we present the result of that
step when applied to the example, grammar G15.
G15:
1. S → ABc
2. A → bA
3. A → ǫ
4. B → c
S BW A
S BW B (from BDW)
A BW b
B BW c
S BW b (transitive)
S BW c
4.3. LL(1) GRAMMARS 113
S BW S
A BW A
B BW B (reflexive)
b BW b
c BW c
First(S) = {b,c}
First(A) = {b}
First(B) = {c}
First(b) = {b}
First(c) = {c}
In other words, find the union of the First(x) sets for each symbol on the
right side of a rule, but stop when reaching a non-nullable symbol.
For G15:
If the grammar contains no nullable rules, you may skip to step 12 at this
point.
Step 6. Compute the relation Is Followed Directly By :
B FDB X if there is a rule of the form
A → αBβXγ
where β is a string of nullable nonterminals, α, γ are strings of symbols, X
is any symbol, and A and B are nonterminals.
For G15:
c EO S
A EO A (from DEO)
b EO A
c EO B
(no transitive entries)
c EO c
S EO S (reflexive)
b EO b
B EO B
4.3. LL(1) GRAMMARS 115
W EO X
X FDB Y
Y BW Z
then W FB Z
For G15:
A EO A A FDB B B BW B A FB B
B BW c A FB c
b EO A B BW B b FB B
B BW c b FB c
B EO B B FDB c c BW c B FB c
c EO B c BW c c FB c
Notice that since we are looking for the follow set of a nullable nonterminal
in step 12, we have actually done much more than necessary in step 9. In step
9 we need produce only those pairs of the form A FB t, where A is a nullable
nonterminal and t is a terminal.
The algorithm is summarized in Figure 4.9. A context-free grammar is
LL(1) if rules defining the same nonterminal always have disjoint selection sets.
Grammer G15 is LL(1) because rules 2 and 3 (defining the nonterminal A) have
disjoint selection sets (the selection sets for those rules have no terminal symbols
in common). Note that if there are no nullable rules in the grammar, we can
get the selection sets directly from step 5 – i.e., we can skip steps 6-11. A graph
showing the dependence of any step in this algorithm on the results of other
steps is shown in Figure 4.10. For example, this graph shows that the results of
steps 3,6, and 8 are needed for step 9.
1❥
❥ 6❥ 7❥
2
❥ 9❥ 8❥
3
❥ 10❥
4
❥ 11❥
5
❥
12
Figure 4.10: Dependency graph for the steps in the algorithm for finding selec-
tion sets
118 CHAPTER 4. TOP DOWN PARSING
b c N
Rep (cBA) Rep (cBA) Reject
S Retain Retain
Rep (Ab) Pop Reject
A Retain Retain
Reject Rep (c) Reject
B Retain S
Pop Reject Reject ,
b Advance
Reject Pop Reject Initial
c Advance Stack
, Reject Reject Accept
selection set. For each terminal symbol, enter Pop, Advance in the cell in the
row and column labeled with that terminal. The cell in the row labeled ▽ and
the column labeled ←֓ should contain Accept. All other cells are Reject. The
pushdown machine for grammar G15 is shown in Figure 4.11.
void parse ()
{ getInp();
S ();
if (inp==’$\hookleftarrow$’) accept; else reject();
}
void S ()
{ if (inp==’b’ || inp==’c’) // apply rule 1
{ A ();
B ();
if (inp==’c’) getInp();
else reject();
} // end rule 1
else reject();
}
4.3. LL(1) GRAMMARS 119
void A ()
{ if (inp==’b’) // apply rule 2
{ getInp();
A ();
} // end rule 2
else if (inp==’c’) ; // apply rule 3
else reject();
}
void B ()
{ if (inp==’c’) getInp(); // apply rule 4
else reject();
}
Note that when processing rule 1, an input symbol is not read until a teminal
is encountered in the grammar rule (after checking for a or b, an input symbol
should not be read before calling procedure A).
Show the sequence of stacks that occurs when the pushdown ma-
chine of Figure 4.11 parses the string bcc←֓
Solution:
120 CHAPTER 4. TOP DOWN PARSING
b
A A A
b B B B c B c
S → c → c → c → c → c →
∇ ∇ ∇ ∇ ∇ ∇
c ↵
c → → Accept
∇ ∇
4.3.3 Exercises
1. Given the following information, find the Followed By relation (FB) as
described in step 9 of the algorithm for finding selection sets:
A EO A A FDB D D BW b
A EO B B FDB a b BW b
B EO B a BW a
2. Find the selection sets of the following grammar and determine whether
it is LL(1).
1. S → ABD
2. A → aA
3. A→ǫ
4. B → bB
5. B→ǫ
6. D → dD
7. D→ǫ
5. Step 3 of the algorithm for finding selection sets is to find the Begins With
relation by forming the reflexive transitive closure of the Begins Directly
With relation. Then add ’pairs of the form a BW a for each terminal a in
the grammar’; i.e., there could be terminals in the grammar which do not
4.4. PARSING ARITHMETIC EXPRESSIONS TOP DOWN 121
In order to determine whether this grammar is LL(1), we must first find the
selection set for each rule in the grammar. We do this by using the twelve step
algorithm given in Section 4.3.
3. Expr BW Expr
Expr BW Term
Term BW Term
Term BW Factor
Factor BW (
Factor BW var
Factor BW Factor
( BW (
var BW var
122 CHAPTER 4. TOP DOWN PARSING
Expr BW Factor
Expr BW (
Expr BW var
Term BW (
Term BW var
* BW *
+ BW +
) BW )
4. First(Expr) = {(,var}
First(Term) = {(,var}
First(Factor) = {(,var}
Since there are no nullable rules in the grammar, we can obtain the selection
sets directly from step 5. This grammar is not LL(1) because rules 1 and 2
define the same nonterminal, Expr, and their selection sets intersect. This is
also true for rules 3 and 4.
Incidentally, the fact that grammar G5 is not suitable for top down parsing
can be determined much more easily by inspection of the grammar. Rules 1 and
3 both have a property known as left recursion:
1. Expr → Expr + T erm
3. T erm → T erm ∗ F actor
They are in the form:
A → Aa
Note that any rule in this form cannot be parsed top down. To see this,
consider the method for the nonterminal A in a recursive descent parser. The
first thing it would do would be to call itself, thus producing infinite recursion
with no base case (or ’escape hatch’ from the recursion). Any grammar with
left recursion cannot be LL(1).
4.4. PARSING ARITHMETIC EXPRESSIONS TOP DOWN 123
Expr
Term Elist
The left recursion can be eliminated by rewriting the grammar with an equiv-
alent grammar that does not have left recursion. In general, the offending rule
might be in the form:
A → Aα
A→β
Note in grammar G16 that an Expr is still the sum of one or more Terms
and a Term is still the product of one or more Factors, but the left recursion has
124 CHAPTER 4. TOP DOWN PARSING
been eliminated from the grammar. We will see later that this grammar also
defines the precedence of operators as desired. The student should construct
several derivation trees using grammar G16 in order to be convinced that it is
not ambiguous.
We now wish to determine whether this grammar is LL(1), using the algo-
rithm to find selection sets:
3. Expr BW Term
Elist BW +
Term BW Factor (from BDW)
Tlist BW *
Factor BW (
Factor BW var
Expr BW Factor
Term BW (
Term BW var (transitive)
Expr BW (
Expr BW var
Expr BW Expr
Term BW Term
Factor BW Factor
Elist BW Elist
Tlist BW Tlist (reflexive)
Factor BW Factor
+ BW +
* BW *
( BW (
var BW var
) BW )
8. Elist EO Expr
Term EO Expr
Elist EO Elist
Term EO Elist
Tlist EO Term
Factor EO Term (from DEO)
Tlist EO Tlist
Factor EO Tlist
) EO Factor
var EO Factor
Tlist EO Expr
Tlist EO Elist
Factor EO Expr
Factor EO Elist
) EO Term
) EO Tlist
) EO Expr (transitive)
) EO Elist
126 CHAPTER 4. TOP DOWN PARSING
var EO Term
var EO Tlist
var EO Expr
var EO Elist
Expr EO Expr
Term EO Term
Factor EO Factor
) EO )
var EO var (reflexive)
+ EO +
* EO *
( EO (
Elist EO Elist
Tlist EO Tlist
10.
Elist FB ←֓
Term FB ←֓
Expr FB ←֓
Tlist FB ←֓
Factor FB ←֓
11.
4.4. PARSING ARITHMETIC EXPRESSIONS TOP DOWN 127
Since all rules defining the same nonterminal (rules 2 and 3, rules 5 and 6,
rules 7 and 8) have disjoint selection sets, the grammar G16 is LL(1).
In step 9 we could have listed several more entries in the FB relation. For
example, we could have listed pairs such as var FB + and Tlist FB Elist.
These were not necessary, however; this is clear if one looks ahead to step 11,
where we construct the follow sets for nullable nonterminals. This means we
need to use only those pairs from step 9 which have a nullable nonterminal on
the left and a terminal on the right. Thus, we will not need var FB + because
the left member is not a nullable nonterminal, and we will not need Tlist FB
Elist because the right member is not a terminal.
Solution:
+ * ( ) var N
Rep(Elist Rep(Elist
Expr Reject Reject Term) Reject Term) Reject
Retain Retain
Rep(Elist
Elist Term +) Reject Reject Pop Reject Pop
Retain Retain Retain
Rep(Tlist Rep(Tlist
Term Reject Reject Factor) Reject Factor) Reject
Retain Retain
Rep(Tlist
Tlist Pop Factor *) Reject Pop Reject Pop
Retain Retain Retain Retain
Rep( Rep(var)
Factor Reject Reject )Expr( ) Reject Reject
Retain Retain
int inp;
final int var = 256;
final int endmarker = 257;
void Expr()
{ if (inp==’(’ || inp==var) // apply rule 1
{ Term();
4.4. PARSING ARITHMETIC EXPRESSIONS TOP DOWN 129
Elist();
} // end rule 1
else reject();
}
void Elist()
{ if (inp==’+’) // apply rule 2
{ getInp();
Term();
Elist();
} // end rule 2
else if (inp==’)’ || inp==endmarker)
; // apply rule 3, null statement
else reject();
}
void Term()
{ if (inp==’(’ || inp==var) // apply rule 4
{ Factor();
Tlist();
} // end rule 4
else reject();
}
void Tlist()
{ if (inp==’*’) // apply rule 5
{ getInp();
Factor();
Tlist();
} // end rule 5
else if (inp==’+’ || inp==’)’
|| inp==endmarker)
; // apply rule 6, null statement
else reject();
}
void Factor()
{ if (inp==’(’) // apply rule 7
{ getInp();
Expr();
if (inp==’)’) getInp();
else reject();
} // end rule 7
else if (inp==var) getInp(); // apply rule 8
else reject();
}
130 CHAPTER 4. TOP DOWN PARSING
4.4.1 Exercises
1. Show derivation trees for each of the following input strings, using gram-
mar G16.
Write the procedure for the nonterminal Stmt. Assume the selection set
for rule 4 is {(, identifier, number}.
6. (a) Show an LL(1) grammar for the language of regular expressions over
the alphabet {0,1}. Assume that concatenation is always designated
by a raised dot.
4.5. SYNTAX-DIRECTED TRANSLATION 131
(b) Show a derivation tree for the regular expression 1 + 0.1∗ (In the tree
it should be clear that 1∗ is a subexpression)
(c) Show the selection set for each rule in your grammar.
(d) Show a recursive descent parser corresponding to the grammar.
(e) Show a one state pushdown machine corresponding to the grammar.
7. Show how to eliminate the left recursion from each of the grammars shown
below:
(a) 1. A → Abc
2. A → ab
Expr
Term Elist
var [var] ǫ
Figure 4.13: A derivation tree for the expression var+var*var using grammar
G17
var + * ( )
Rep(Elist Rep(Elist
Expr Term) Term)
Retain Retain
Rep(Elist
Elist {+}Term+) Pop Pop
Retain Retain Retain
Rep(Tlist Rep(Tlist
Term Factor) Factor)
Retain Retain
Rep(Tlist
Tlist Pop {*}Factor*) Pop Pop
Retain Retain Retain Retain
Rep( {var} Rep(
Factor var ) )Expr( )
Retain Retain
var Pop
Advance
+ Pop
Advance
* Pop
Advance
( Pop
Advance
Expr
) Pop ,
Advance
Pop Pop Pop Pop Pop Pop Initial
{var} Retain Retain Retain Retain Retain Retain Stack
Out (var) Out (var) Out (var) Out (var) Out (var) Out (var)
Pop Pop Pop Pop Pop Pop
{+} Retain Retain Retain Retain Retain Retain
Out (+) Out (+) Out (+) Out (+) Out (+) Out (+)
Pop Pop Pop Pop Pop Pop
{*} Retain Retain Retain Retain Retain Retain
Out (*) Out (*) Out (*) Out (*) Out (*) Out (*)
, Accept
The method S for the recursive descent translator will print the action symbol
print only if the input is an a. It is important always to check for the input
symbols in the selection set before processing action symbols. Also, rule 3 is
really an epsilon rule in the underlying grammar, since there are no terminals
or nonterminals. Using the algorithm for selection sets, we find that:
Sel(1) = {a}
Sel(2) = {b}
Sel(3) = {←֓}
The recursive descent translator for grammar G18 is shown below:
void S ()
{ if (inp==’a’)
{ getInp(); // apply rule 1
System.out.println ("print");
S();
} // end rule 1
else if (inp==’b’)
{ getInp(); // apply rule 2
B();
} // end rule 2
else Reject ();
}
void B ()
{ if (inp==Endmarker) System.out.println ("print");
// apply rule 3
else Reject ();
}
With these concepts in mind, we can now write a recursive descent translator
to translate infix expressions to postfix according to grammar G17:
void Expr ()
{ if (inp==’(’ || inp==var)
{ Term (); // apply rule 1
Elist ();
} // end rule 1
else Reject ();
}
void Elist ()
{ if (inp==’+’)
{ getInp(); // apply rule 2
Term ();
System.out.println (’+’);
Elist ();
} // end rule 2
else if (inp==Endmarker || inp==’)’) ; // apply rule 3
else Reject ();
}
void Term ()
{ if (inp==’(’ || inp==var)
{ Factor (); // apply rule 4
Tlist ();
} // end rule 4
else Reject ();
}
void Tlist ()
{ if (inp==’*’)
{ getInp(); // apply rule 5
Factor ();
System.out.println (’*’);
Tlist ();
} // end rule 5
else if (inp==’+’ || inp==’)’ || inp=Endmarker) ;
// apply rule 6
else Reject ();
}
void Factor ()
{ if (inp==’(’)
{ getInp(); // apply rule 7
Expr ();
if (inp==’)’) getInp();
else Reject ();
} // end rule 7
136 CHAPTER 4. TOP DOWN PARSING
else if (inp==var)
{ getInp(); // apply rule 8
System.out.println ("var");
} // end rule 8
else Reject ();
}
Solution:
a b N
Rep (Sa{print}) Rep (Bv)
S Retain Retain Reject
Rep ({print})
B Reject Reject Retain
Pop
a Adv
Reject Pop
b Adv
Pop Pop Pop S
{print} Retain Retain Retain ,
Out ({print}) Out ({print}) Out ({print})
Initial
, Reject Reject Accept Stack
4.6. ATTRIBUTED GRAMMARS 137
4.5.3 Exercises
1. Consider the following translation grammar with starting nonterminal S,
in which action symbols are put out:
1. S → AbB
2. A → {w} a c
3. A → b A {x}
4. B → {y}
Show a derivation tree and the output string for the input bacb.
4. Write the Java statement which would appear in a recursive descent parser
for each of the following translation grammar rules:
(a) A → {w}a{x}BC
(b) A → a{w}{x}BC
(c) A → a{w}B{x}C
Expr23
+ Expr12 Expr11
is applied during the parse. One way for the student to understand attributed
grammars is to build derivation trees for attributed grammars. This is done
by first eliminating all attributes from the grammar and building a derivation
tree. Then, attribute values are entered in the tree according to the attribute
computation rules. Some attributes take their values from attributes of higher
nodes in the tree, and some atttributes take their values from attributes of lower
nodes in the tree. For this reason, the process of filling in attribute values is
not straightforward.
As an example, grammar G19 is an attributed (simple) grammar for prefix
expressions involving addition and multiplication. The attributes, shown as
subscripts, are intended to evaluate the arithmetic expression. The attribute on
the terminal const is the value of the constant as supplied by the lexical phase.
Note that this kind of expression evaluation is typical of what is done in an
interpreter, but not in a compiler.
G19:
Dcl
typeint V arlistint
vara Listint
, varb Listint
, varc Listint
ǫ
Figure 4.16: A attrubuted derivation tree for int a,b,c; using grammar G20
An attributed derivation tree for the input string int a,b,c is shown in
Figure 4.16. Note that the values of the attributes move either horizontally on
one level (rule 1) or down to a lower level (rules 2 and 3). It is important to
remember that the number and kind of attributes of a symbol must be consis-
tent throughout the entire grammar. For example, if the symbol Ai,s has two
attributes, the first being inherited and the second synthesized, then this must
be true everywhere the symbol A appears in the grammar.
// Wrapper class for ints which lets you change the value.
// This class is needed to implement attributed grammars
// with recursive descent
In this example, the methods S, A, and B (A and B are not shown) all return
values in their parameters (they are synthesized attributes, implemented with
references), and there is no contradiction. However, assuming that the attribute
of the B is synthesized, and the attribute of the A is inherited, the following
rule could not be implemented:
S → aAp Bq p←q
4.6. ATTRIBUTED GRAMMARS 141
In the method S, q will not have a value until method B has been called
and terminated. Therefore, it will not be possible to assign a value to p before
calling method A. This assumes, as always, that input is read from left to right.
Solution:
class RecDescent
{
// int codes for token classes
final int Num=0; // numeric constant
final int Op=1; // operator
final int End=2; // endmarker
Token token;
void Eval ()
{ MutableInt p = new MutableInt(0);
token = new Token();
token.getToken(); // Read a token from stdin
Expr(p);
// Show final result
if (token.getClass() == End)
System.out.println (p);
else reject();
}
} // end rule 1
else // should be a *, apply rule 2
{ token.getToken(); // read next token
Expr(q);
Expr(r);
p.set (q.get() * r.get());
} // end rule 2
else if (token.getClass() == Num) // is it a numeric constant?
{ p.set (token.getVal()); // apply rule 3
token.getToken(); // read next token
} // end rule 3
else reject();
}
4.6.2 Exercises
1. Consider the following attributed translation grammar with starting non-
terminal S, in which action symbols are output:
1. Sp → Aq br At p←r+t
2. Ap → ap {w}p c
3. Ap → bq Ar {x}p p←q+r
Show an attributed derivation tree for the input string a1 cb2 b3 a4 c, and
show the output symbols with attributes corresponding to this input.
3. Show an attributed derivation tree for each input string using the following
attributed grammar:
1. Sp → Aq,r Bt p←q∗t
r ←q+t
2. Ap,q → br At,u c u←r
4.7. AN ATTRIBUTED TRANSLATION GRAMMAR FOR EXPRESSIONS143
p←r+t+u
3. Ap,q → ǫ p←0
4. Bp → ap
(a) a2
(b) b1 ca3
(c) b2 b3 cca4
MULT B C T1
ADD A T1 T2
ADD T2 D T3
Note that our translator will have to find temporary storage locations (or use
a stack) to store intermediate results at run time. It would indicate that the final
result of the expression is in T3. In the attributed translation grammar G21,
shown below, all nonterminal attributes are synthesized, with the exception of
the first attribute on Elist and Tlist, which are inherited:
G21:
The intent of the action symbol {ADD}p,r,s is to put out an ADD atom with
operands p and r and result s. In many rules, several symbols share an attribute;
this means that the attribute is to have the same value on those symbols. For
example, in rule 1 the attribute of Term is supposed to have the same value as
the first attribute of Elist. Consequently, those attributes are given the same
name. This also could have been done in rules 3 and 6, but we chose not to
do so in order to clarify the recursive descent parser. For this reason, only
four attribute computation rules were needed, two of which involved a call to
alloc(). alloc() is a method which allocates space for a temporary result and
returns a reference to it (in our examples, we will call these temporary results
T1, T2, T3, etc). The attribute of an ident token is the value part of that
token, indicating the run-time location for the variable.
Solution:
ExprT 1
T erma Elista,T 1
identb ǫ
class Expressions
{
Token token;
void eval ()
{ MutableInt p = new MutableInt(0);
token = new Token();
token.getToken();
if (token.getClass()==Token.Rpar)
token.getToken();
else reject();
} // end rule 7
else if (token.getClass()==Token.Ident
|| token.getClass()==Token.Num)
{ p.set(alloc()); // apply rule 8
token.getToken();
} // end rule 8
else reject();
}
4.7.2 Exercises
1. Show an attributed derivation tree for each of the following expressions,
using grammar G21. Assume that the alloc method returns a new tem-
porary location each time it is called (T1, T2, T3, ...).
(a) a + b * c
(b) (a + b) * c
(c) (a)
(d) a * b * c
Boolean expressions are expressions such as x > y, or y−2 == 33. For those stu-
dents familiar with C and C++, the comparison operators return ints (0=false,
1=true), but Java makes a distinction: the comparison operators return boolean
results (true or false). If you have ever spent hours debugging a C or C++ pro-
gram which contained if (x=3)... when you really intended if (x==3) ...,
then you understand the reason for this change. The Java compiler will catch
this error for you (assuming that x is not a boolean).
Assignment is also slightly different in Java. In C/C++ assignment is an
operator which produces a result in addition to the side effect of assigning a
value to a variable. A statement may be formed by following any expression
with a semicolon. This is not the case in Java. The expression statement must
be an assignment or a method call. Since there are no methods in Decaf, we’re
left with an assignment.
if
(x==3)
[TST] // Branch to the Label only if
Stmt // x==3 is false
[Label]
/////////////////////////////////////////////////////
while
[Label1]
(x>2)
[TST] // Branch to Label2 only if
Stmt // x>2 is false
[JMP] // Unconditional branch to Label1
[Label2]
Recall our six comparison codes; to get the logical complement of any com-
parison, we simply subtract the code from 7 as shown below:
The TST atom represents a conditional branch in the object program. {T ST }a,b,,c,x
will compare the values stored at a and b, using the comparison whose code is
c, and branch to a label designated x if the comparision is true. In the grammar
rule above the attribute of BoolExpr, Lbl, is synthesized and represents the tar-
get label for the TST atom to branch in the object program. The attribute of
the token compare, c, is an integer from 1-6 representing the comparison code.
The use of comparison code 7-c inverts the sense of the comparison as desired.
4.8.3 Assignment
Next we handle assignment; an assignment is an operator which returns a result
that can be used as part of a larger expression. For example:
x = (y = 2) + (z = 3); // y is 2, z is 3, x is 5
(MOV, 3,,c)
(ADD, b,c,T1)
(MOV, T1,,a)
4.8. DECAF EXPRESSIONS 151
Note that the selection set for rule 2 is {ident} and the selection set for
rule 3 is {ident, (, num}. Since rules 2 and 3 define the same nonTerminal, this
grammar is not LL(1). We can work around this problem by noticing that an
assignment expression must have an assignment operator after the identifier.
Thus, if we peek ahead one input token, we can determine whether to apply
rule 2 or 3. If the next input token is an assignment operator, apply rule 2;
if not, apply rule 3. This can be done with a slight modification to our token
class - a peek() method which returns the next input token, but which has no
effect on the next call to getInput(). The grammar shown above is said to be
LL(2) because we can parse it top down by looking at no more than two input
symbols at a time.
Solution:
152 CHAPTER 4. TOP DOWN PARSING
BoolExprL1
Rvaluea RvalueT 1
AssignExprb identc ǫ
Rvalue3
T erm3 Elist3,3
F actor3 T list3,3 ǫ
num3 ǫ
4.8.4 Exercises
1. Show an attributed derivation tree using the grammar for Decaf expres-
sions given in this section for each of the following expressions or boolean
expressions (in part (a) start with Expr; in parts (b,c,d,e) start with Bool-
Expr):
(a) a = b = c
(b) a == b + c
(c) (a=3) <= (b=2)
(d) a == - (c = 3)
(e) a * (b=3) + c != 9
2. Show the recursive descent parser for the nonterminals BoolExpr, Rvalue,
and Elist given in the grammar for Decaf expressions. Hint: the selection
sets for the first eight grammar rules are:
4.9. TRANSLATING CONTROL STRUCTURES 153
== is 1 <= is 4
< is 2 >= is 5
> is 3 != is 6
154 CHAPTER 4. TOP DOWN PARSING
The LBL atom is used as a tag or marker so that JMP and TST atoms
can refer to a branch destination without knowing target machine addresses for
these destinations.
In addition, there is one more atom which is needed for assignment state-
ments; it is a Move atom, which simply assigns its source operand to its target
operand:
Figure 4.18 shows the sequence in which atoms are put out for the control
structures for while and for statements (from top to bottom). Figure 4.19
shows the same information for the if statement. These figures show the input
tokens, as well, so that you understand when the atoms are put out. The arrows
indicate flow of control at run time; the arrows do not indicate the sequence in
which the compiler puts out atoms. In Figures 4.18 and 4.19, the atoms are all
enclosed in a boundary: ADD and JMP atoms are enclosed in rectangles, LBL
atoms are enclosed in ovals, and TST atoms are enclosed in diamond shaped
parallelograms.
The control structures in Figure 4.18 correspond to the following statement
definitions:
1. Stmt → while(BoolExpr)Stmt
2. Stmt → f or(Expr; BoolExpr; Expr)Stmt
For the most part, Figures 4.18 and 4.19 are self explanatory. In Figure 4.19
we also show that a boolean expression always puts out a TST atom which
branches if the comparison is false. The boolean expression has an attribute
which is the target label for the branch. Note that for if statements , we must
jump around the statement which is not to be executed. This provides for a
relatively simple translation.
Unfortunately, the grammar shown above is not LL(1). This can be seen by
finding the follow set of ElsePart, and noting that it contains the keyword else.
Consequently, rules 4 and 5 do not have disjoint selection sets. However, it is
still possible to write a recursive descent parser. This is due to the fact that all
elses are matched with the closest preceeding unmatched if. When our parser
for the nonterminal ElsePart encounters an else, it is never wrong to apply
rule 4 because the closest preceeding unmatched if must be the one on top of
the recursive call stack. Aho et. al.[1] claims that there is no LL(1) grammar
for this language. This is apparently one of those rare instances where theory
fails, but our practical knowledge of how things work comes to the rescue.
4.9. TRANSLATING CONTROL STRUCTURES 155
for stmt
while stmt
for
while ( Expr ;
Lbl Lbl1
Lbl Lbl1
BoolExprLbl2 ;
( BoolExprLbl2 )
False
False
JMP Lbl1
Lbl Lbl4
Lbl Lbl2
Exprq )
JMP Lbl1
Lbl Lbl3
Stmt
JMP Lbl4
Lbl Lbl2
if stmt
Else Part (may be omitted)
If
else
Stmt
( BoolExprLbl1 )
False
Stmt
JMP Lbl2
ElsePart
Exprp
LBL Lbl2
comparec
Exprq
TST p,q,,7-c,Lbl
The for statement described in Figure 4.18 requires some additional expla-
nation. The following for statement and while statement are equivalent:
This means that after the atoms for the stmt are put out, we must put out
a jump to the atoms corresponding to expression E2. In addition, there must
be a jump after the atoms of expression E2 back to the beginning of the loop,
boolean. The LL(2) grammar for Decaf shown in the next section makes direct
use of Figures 4.18 and 4.19 for the control structures.
Show the atom string which would be put out that corresponds to
the following Java statement:
while (x > 0) Stmt
Solution:
(LBL, L1)
(TST,x,0,,4,L2) // Branch to L2 if x<=0
(JMP,L1)
(LBL,L2)
158 CHAPTER 4. TOP DOWN PARSING
4.9.1 Exercises
1. Show the sequence of atoms which would be put out according to Figures
4.18 and 4.19 for each of the following input strings:
(a) if (a==b)
while (x<y)
Stmt
(b) for (i = 1; i<=100; i = i+1)
for (j = 1; j<=i; j = j+1)
Stmt
(c) if (a==b)
for (i=1; i<=20; i=i+1)
Stmt1
else
while (i>0)
Stmt2
(d) if (a==b)
if (b>0)
Stmt1
else
while (i>0)
Stmt2
2. Show an attributed translation grammar rule for each of the control struc-
tures given in Figures 4.18 and 4.19. Assume if statements always have
an else part and that there is a method, newlab, which allocates a new
statement label.
1. W hileStmt → while(BoolExpr)Stmt
2. F orStmt → f or(AssignExpr; BoolExpr; AssignExpr)Stmt
3. If Stmt → if (BoolExpr)StmtelseStmt
What we really want to do is use a loop to scan for a list of identifiers separated
by commas. This can be done as follows:
We use this methodology also for the methods ArgList() and StmtList().
Note that the fact that we have assigned the same attribute to certain sym-
bols in the grammar, saves some effort in the parser. For example, the definition
of Factor uses a subscript of p on the Factor as well as on the Expr, identifier,
and number on the right side of the arrow. This simply means that the value
of the Factor is the same as the item on the right side, and the parser is simply
(ignoring unary operations):
4.10.1 Exercises
1. Show the atoms put out as a result of the following Decaf statement:
if (a==3)
{ a = 4;
for (i = 2; i<5; i=0 )
i = i + 1;
}
162 CHAPTER 4. TOP DOWN PARSING
else
while (a>5)
i = i * 8;
2. Explain the purpose of each atom put out in our Decaf attributed trans-
lation grammar for the for statement:
F orStmt → f or(OptExprp ; {LBL}Lbl1 OptBoolExprLbl3 ;
{JM P }Lbl2 {LBL}Lbl4 OptExprr ){JM P }Lbl1
{LBL}Lbl2 Stmt{JM P }Lbl4{LBL}Lbl3
Lbl1 ← newlab()Lbl2 ← newlab()
Lbl3 ← newlab()Lbl4 ← newlab()
3. The Java language has a switch statement.
(a) Include a definition of the switch statement in the attributed trans-
lation grammar for Decaf.
(b) Check your grammar by building an attributed derivation tree for a
sample switch statement of your own design.
(c) Include code for the switch statement in the recursive descent parser,
decaf.java and parse.java .
4. Using the grammar of Figure 4.20, show an attributed derivation tree for
the statement given in problem 1, above.
5. Implement a do-while statement in decaf, following the guidelines in
problem 3.
Bottom Up Parsing
1. S → S a B
2. S → c
3. B → a b
A derivation tree for the string caabaab is shown in Figure 5.1. The shift reduce
parser will proceed as follows: each step will be either a shift (shift an input
164
5.1. SHIFT REDUCE PARSING 165
S a B
S a B a b
c a b
Figure 5.1: Derivation tree for the string caabaab using grammar G22
When parsing the input string aaab, we reach a point where it appears that
we have a handle on top of the stack (the terminal a), but reducing that handle,
as shown in Figure 5.3, does not lead to a correct bottom up parse. This is called
a shift/reduce conflict because the parser does not know whether to shift an
input symbol or reduce the handle on the stack. This means that the grammar
is not LR, and we must either rewrite the grammar or use a different parsing
166 CHAPTER 5. BOTTOM UP PARSING
, caabaab
shift
,c aabaab
reduce using rule 2
,S aabaab
shift
,Sa abaab
shift
,Saa baab
shift
,Saab aab
reduce using rule 3
,SaB aab
reduce using rule 1
,S aab
shift
,Sa ab
shift
,Saa b
shift
,Saab
reduce using rule 3
,SaB
reduce using rule 1
,S
Accept
Figure 5.2: Sequence of stack frames parsing caabaab using grammar G22
5.1. SHIFT REDUCE PARSING 167
∇ aaab ↵
shift
∇ a aab ↵
reduce using rule 2
∇ s aab ↵
shift
∇ Sa ab ↵
shift/reduce conflict
reduce using rule 2 (incorrect)
∇ SS ab ↵
shift
∇ Ssa b ↵
shift
∇ Ssab ↵
reduce using rule 3
∇ SSb ↵
Syntax error (incorrect)
algorithm.
Another problem in shift reduce parsing occurs when it is clear that a reduce
operation should be performed, but there is more than one grammar rule whose
right hand side matches the top of the stack, and it is not clear which rule
should be used. This is called a reduce/reduce conflict. Grammar G24 is an
example of a grammar with a reduce/reduce conflict.
G24:
1. S → SA
2. S → a
3. A → a
Figure 5.4 shows an attempt to parse the input string aa with the shift reduce
algorithm, using grammar G24. Note that we encounter a reduce/reduce conflict
when the handle a is on the stack because we don’t know whether to reduce
using rule 2 or rule 3. If we reduce using rule 2, we will get a correct parse, but
if we reduce using rule 3 we will get an incorrect parse.
It is often possible to resolve these conflicts simply by making an assumption.
168 CHAPTER 5. BOTTOM UP PARSING
∇ aa ↵
shift
∇ a a ↵
reduce/reduce conflict (rules 2 and 3)
reduce using rule 3 (incorrect)
∇ A a ↵
shift
∇ Aa ↵
reduce/reduce conflict (rules 2 and 3)
reduce using rule 2 (rule 3 will also yield a syntax error)
∇ AS ↵
Syntax error
For example, all shift/reduce conflicts could be resolved by shifting rather than
reducing. If this assumption always yields a correct parse, there is no need to
rewrite the grammar.
In examples like the two just presented, it is possible that the conflict can be
resolved by looking ahead at additional input characters. An LR algorithm that
looks ahead k input symbols is called LR(k). When implementing programming
languages bottom up, we generally try to define the language with an LR(1)
grammar, in which case the algorithm will not need to look ahead beyond the
current input symbol. An ambiguous grammar is not LR(k) for any value of k
i.e. an ambiguous grammar will always produce conflicts when parsing bottom
up with the shift reduce algorithm. For example, the following grammar for if
statements is ambiguous:
1. Stmt → if (BoolExpr) Stmt else Stmt
2. Stmt → if (BoolExpr) Stmt
Stmt Stmt
Stack Input
Figure 5.6: Parser configuration before reading the else part of an if statement
it will always find the correct interpretation. Alternatively, the ambiguity may
be removed by rewriting the grammar, as shown in section 3.1.
Solution:
170 CHAPTER 5. BOTTOM UP PARSING
∇ caab ↵
shift
∇ c aab ↵
reduce using rule 2
∇ S aab ↵
shift
∇ Sa ab ↵
shift
∇ Saa b ↵
shift
∇ Saab ↵
reduce using rule 3
∇ SaB ↵
reduce using rule 1
∇ S ↵
Accept
5.1.1 Exercises
1. For each of the following stack configurations, identify the handle using
the grammar shown below:
1. S → SAb
2. S → acb
3. A → bBc
4. A → bc
5. B → ba
6. B → Ac
(a) ▽ SSAb
(b) ▽ SSbbc
(c) ▽ SbBc
5.2. LR PARSING WITH TABLES 171
(d) ▽ Sbbc
2. Using the grammar of Problem 1, show the sequence of stack and input
configurations as each of the following strings is parsed with shift reduce
parsing:
(a) acb
(b) acbbcb
(c) acbbbacb
(d) acbbbcccb
(e) acbbcbbcb
1. S → S ab
2. S → b A
3. A → b b
4. A → b A
5. A → b bc
6. A → c
(a) b c
(b) b b c a b
(c) b a c b
4. Assume that a shift/reduce parser always chooses the lower numbered rule
(i.e., the one listed first in the grammar) whenever a reduce/reduce con-
flict occurs during parsing, and it chooses a shift whenever a shift/reduce
conflict occurs. Show a derivation tree corresponding to the parse for the
sentential form if (BoolExpr) if (BoolExpr) Stmt else Stmt, using
the following ambiguous grammar. Since the grammar is not complete,
you may have nonterminal symbols at the leaves of the derivation tree.
1. Stmt → if (BoolExpr) Stmt else Stmt
2. Stmt → if (BoolExpr) Stmt
Stack Input
▽S ab←֓
in which the bottom of the stack is to the left. The action shift will result in
the following configuration:
Stack Input
▽ Sa b←֓
The a has been shifted from the input to the stack. Suppose, then, that in
the grammar, rule 7 is:
7. B → Sa
Select the row of the goto table labeled ▽ and the column labeled B. If the
entry in this cell is push X, then the action reduce 7 would result in the following
configuration:
Stack Input
▽X b←֓
Figure 5.7 shows the LR parsing tables for grammar G5 for arithmetic
expressions involving only addition and multiplication (see section 3.1). As in
previous pushdown machines, the stack symbols label the rows, and the input
symbols label the columns of the action table. The columns of the goto table
are labeled by the nonterminal being reduced. The stack is initialized with a
▽ symbol to mark the bottom of the statck, and blank cells in the action table
indicate syntax errors in the input string. Figure 5.8 shows the sequence of
configurations which would result when these tables are used to parse the input
string (var+var)*var.
5.2. LR PARSING WITH TABLES 173
A c t i o n T a b l e
+ * ( ) var N
, shift ( shift var
G o T o T a b l e
Expr Term Factor
, push Expr1 push Term2 push Factor4
Expr1
Term1
Factor3
( push Expr5 push Term2 push Factor4
Expr5 ,
)
+ push Term1 push Factor4 Initial
Term2 Stack
* push Factor3
Factor4
var
Figure 5.7: Action and Goto tables to parse simple arithmetic expressions
174 CHAPTER 5. BOTTOM UP PARSING
, (var+var)*var N
shift (
,( var+var)*var N
shift var
,(var +var)*var N
reduce 6 push Factor4
,(Factor4 +var)*var N
reduce 4 push Term2
,(Term2 +var)*var N
reduce 2 push Expr5
,(Expr5 +var)*var N
shift +
,(Expr5+ var)*var N
shift var
,(Expr5+var )*var N
reduce 6 push Factor4
,(Expr5+Factor4 )*var N
reduce 4 push Term1
,(Expr5+Term1 )*var N
reduce 1 push Expr5
,(Expr5 )*var N
shift )
,(Expr5) *var N
reduce 5 push Factor4
,Factor4 *var N
reduce 4 push Term2
,Term2 *var N
shift *
,Term2* var N
shift var
,Term2*var N
reduce 6 push Factor3
,Term2*Factor3 N
reduce 3 push Term2
,Term2 N
reduce 2 push Expr1
,Expr1 N
Accept
G5:
1. Expr → Expr + Term
2. Expr → Term
3. Term → Term * Factor
4. Term → Factor
5. Factor → ( Expr )
6. Factor → var
1. Find the action corresponding to the current input and the top stack symbol.
2. If that action is a shift action:
a. Push the input symbol onto the stack.
b. Advance the input pointer.
3. If that action is a reduce action:
a. Find the grammar rule specified by the reduce action.
b. The symbols on the right side of the rule should also be on the top of the
stack -- pop them all off the stack.
c. Use the nonterminal on the left side of the grammar rule to indicate a
column of the goto table, and use the top stack symbol to indicate a row
of the goto table. Push the indicated stack symbol onto the stack.
d. Retain the input pointer.
4. If that action is blank, a syntax error has been detected.
5. If that action is Accept, terminate.
6. Repeat from step 1.
Solution:
176 CHAPTER 5. BOTTOM UP PARSING
, var*var N
shift var
,var *var N
reduce 6 push Factor4
,Factor4 *var N
reduce 4 push Term2
,Term2 *var N
shift *
,Term2* var N
shift var
,Term2*var N
reduce 6 push Factor3
,Term2*Factor3 N
reduce 3 push Term2
,Term2 N
reduce 2 push Expr1
,Expr1 N
Accept
There are three principle techniques for constructing the LR parsing tables.
In order from simplest to most complex or general, they are called: Simple LR
(SLR), Look Ahead LR (LALR), and Canonical LR (LR). SLR is the easiest
technique to implement, but works for a small class of grammars. LALR is
more difficult and works on a slightly larger class of grammars. LR is the most
general, but still does not work for all unambiguous context free grammars. In
all cases, they find a rightmost derivation when scanning from the left (hence
LR). These techniques are beyond the scope of this text, but are described in
Parsons [17] and Aho et. al. [1].
5.2.1 Exercises
1. Show the sequence of stack and input configurations and the reduce and
goto operations for each of the following expressions, using the action and
goto tables of Figure 5.7.
(a) var
(b) (var)
(c) var + var * var
(d) (var*var) + var
(e) (var * var
5.3. SABLECC 177
5.3 SableCC
For many grammars, the LR parsing tables can be generated automatically from
the grammar. There are several software systems designed to generate a parser
automatically from specifications (as mentioned in section 2.4). In this chapter
we will be using software developed at McGill University, called SableCC.
1. Package
2. Helpers
3. States
4. Tokens
5. Ignored Tokens
6. Productions
The first four sections were described in section 2.4. The Ignored Tokens
section gives you an opportunity to specify tokens that should be ignored by the
parser (typically white space and comments). The Productions section contains
the grammar rules for the language being defined. This is where syntactic
structures such as statements, expressions, etc. are defined. Each definition
consists of the name of the syntactic type being defined (i.e. a nonterminal), an
equal sign, an EBNF definition, and a semicolon to terminate the production.
As mentioned in section 2.4, all names in this grammar file must be lower case.
An example of a production defining a while statement is shown below (l par
and r par are left parenthesis and right parenthesis tokens, respectively):
stmt = while l par bool expr r par stmt ;
178 CHAPTER 5. BOTTOM UP PARSING
language.grammar
sablecc
Translation.java Compiler.java
javac javac
Translation.class Compiler.class
Note that the semicolon at the end is not the token for a semicolon, but a
terminator for the stmt rule. Productions may use EBNF-like constructs. If x
is any grammar symbol, then:
x? // An optional x (0 or 1 occurrences of x)
x* // 0 or more occurrences of x
x+ // 1 or more occurrences of x
The names single and multiple enable the user to refer to one of these al-
ternatives when applying actions in the Translation class. Labels must also be
used when two identical names appear in a grammar rule. Each item label must
be enclosed in brackets, and followed by a colon:
Since there are two occurrences of assign expr in the above definition of a
for statement, they must be labeled. The first is labeled init, and the second
is labeled incr.
Infix Postfix
2 + 3 * 4 2 3 4 * +
2 * 3 + 4 2 3 * 4 +
( 2 + 3 ) * 4 2 3 + 4 *
2 + 3 * ( 8 - 4 ) - 2 2 3 8 4 - * + 2 -
There are four sections in the grammar file for this program. The first section
specifies that the package name is ’postfix’. All java software for this program
will be part of this package. The second section defines the tokens to be used.
180 CHAPTER 5. BOTTOM UP PARSING
No Helpers are needed, since the numbers are simple whole numbers, specified
as one or more digits. The third section specifies that blank (white space)
tokens are to be ignored; this includes tab characters and newline characters.
Thus the user may input infix expressions in free format. The fourth section,
called Productions, defines the syntax of infix expressions. It is similar to the
grammar given in section 3.1, but includes subtraction and division operations.
Note that each alternative definition for a syntactic type must have a label in
braces. The grammar file is shown below:
Package postfix;
Tokens
number = [’0’..’9’]+;
plus = ’+’;
minus = ’-’;
mult = ’*’;
div = ’/’;
l\_par = ’(’;
r\_par = ’)’;
blank = (’ ’ | 10 | 13 | 9)+ ;
semi = ’;’ ;
Ignored Tokens
blank;
Productions
expr =
{term} term |
{plus} expr plus term |
{minus} expr minus term
;
term =
{factor} factor |
{mult} term mult factor |
{div} term div factor
;
factor =
{number} number |
{paren} l_par expr r_par
;
Now we wish to include actions which will put out postfix expressions.
SableCC will produce parser software which will create an abstract syntax tree
for a particular infix expression, using the given grammar. SableCC will also
5.3. SABLECC 181
package postfix;
import postfix.analysis.*; // needed for DepthFirstAdapter
import postfix.node.*; // needed for syntax tree nodes.
There are other methods in the DepthFirstAdapter class which may also be
overridden in the Translation class, but which were not needed for this example.
They include the following:
• There is an ’in’ method for each alternative, which is invoked when a node
is about to be visited. In our example, this would include the method
public void inAMultTerm (AMultTerm node)
• There is a ’case’ method for each alternative. This is the method that visits
all the descendants of a node, and it is not normally necessary to over-
ride this method. An example would be public void caseAMultTerm
(AMultTerm node)
• There is also a ’case’ method for each token; the token name is prefixed
with a ’T’ as shown below:
The other way to solve this problem would be to leave the grammar as is
and override the case method for this alternative. The case methods have not
been explained in full detail, but all the user needs to do is to copy the case
method from DepthFirstAdapter, and add the action at the appropriate place.
In this example it would be:
The student may have noticed that SableCC tends to alter names that were
included in the grammar. This is done to prevent ambiguities. For example,
l par becomes LPar, and bool expr becomes BoolExpr.
In addition to a Translation class, we also need a Compiler class. This is the
class which contains the main method, which invokes the parser. The Compiler
class is shown below:
package postfix;
import postfix.parser.*;
import postfix.lexer.*;
import postfix.node.*;
import java.io.*;
System.out.println();
}
catch(Exception e)
{ System.out.println(e.getMessage()); }
}
}
MUL T2 T3 T4
ADD T1 T4 T5
SUB T5 T6 T7
5.3. SABLECC 185
Solution:
package exprs;
import exprs.analysis.*;
import exprs.node.*;
import java.util.*; // for Hashtable
import java.io.*;
int alloc()
{ return ++avail; }
5.3.4 Exercises
1. Which of the following input strings would cause this SableCC program
to produce a syntax error message?
Tokens
a = ’a’;
b = ’b’;
c = ’c’;
newline = [10 + 13];
Productions
line = s newline ;
s = {a1} a s b
| {a2} b w c
;
w = {a1} b w b
| {a2} a c
;
package ex5_3;
import ex5_3.analysis.*;
import ex5_3.node.*;
import java.util.*;
import java.io.*;
RETRIEVE employee_file
PRINT
5.3. SABLECC 189
2+3.2e-2
2+3*5/2
(2+3)*5/2
16/(2*3 - 6*1.0)
2.032
9.5
12.5
infinity
Unfortunately, the grammar and Java code shown below are incorrect.
There are four mistakes, some of which are syntactic errors in the gram-
mar; some of which are syntactic Java errors; some of which cause run-time
errors; and some of which don’t produce any error messages, but do pro-
duce incorrect output. Find and correct all four mistakes. If possible, use
a computer to help debug these programs.
The grammar, exprs.grammar is shown below:
Package exprs;
Helpers
digits = [’0’..’9’]+ ;
exp = [’e’ + ’E’] [’+’ + ’-’]? digits ;
Tokens
number = digits ’.’? digits? exp? ;
plus = ’+’;
minus = ’-’;
mult = ’*’;
div = ’/’;
l_par = ’(’;
r_par = ’)’;
newline = [10 + 13] ;
blank = (’ ’ | ’t’)+;
semi = ’;’ ;
Ignored Tokens
190 CHAPTER 5. BOTTOM UP PARSING
blank;
Productions
exprs = expr newline
| exprs embed
;
embed = expr newline;
expr =
{term} term |
{plus} expr plus term |
{minus} expr minus term
;
term =
{factor} factor |
{mult} term mult factor |
{div} term div factor |
;
factor =
{number} number |
{paren} l_par expr r_par
;
package exprs;
import exprs.analysis.*;
import exprs.node.*;
import java.util.*;
6. Show the SableCC grammar which will check for proper syntax of regular
expressions over the alphabet {0,1}. Observe the precedence rules for the
three operations. Some examples are shown:
(0+1)*.1.1 *0
0.1.0* (0+1)+1)
((0)) 0+
5.4 Arrays
Although arrays are not included in our definition of Decaf, they are of such
great importance to programming languages and computing in general, that we
would be remiss not to mention them at all in a compiler text. We will give a
brief description of how multi-dimensional array references can be implemented
and converted to atoms, but for a more complete and efficient implementation
the student is referred to Parsons [17] or Aho et. al. [1].
The main problem that we need to solve when referencing an array element
is that we need to compute an offset from the first element of the array. Though
the programmer may be thinking of multi-dimensional arrays (actually arrays
of arrays) as existing in two, three, or more dimensions, they must be physically
mapped to the computer’s memory, which has one dimension. For example, an
array declared as int n[][][] = new int [2][3][4]; might be envisioned by
the programmer as a structure having three rows and four columns in each of
two planes as shown in Figure 5.10 (a). In reality, this array is mapped into
a sequence of twenty-four (2*3*4) contiguous memory locations as shown in
Figure 5.10 (b). The problem which the compiler must solve is to convert an
array reference such as n[1][1][0] to an offset from the beginning of the storage
area allocated for n. For this example, the offset would be sixteen memory cells
(assuming that each element of the array occupies one memory cell).
To see how this is done, we will begin with a simple one-dimensional array
and then proceed to two and three dimensions. For a vector, or one-dimensional
array, the offset is simply the subscripting value, since subscripts begin at 0 in
Java. For example, if v were declared to contain twenty elements, char v[] =
new char[20];, then the offset for the fifth element, v[4], would be 4, and in
general the offset for a reference v[i] would be i. The simplicity of this formula
results from the fact that array indexing begins with 0 rather than 1. A vector
maps directly to the computer’s memory.
Now we introduce arrays of arrays, which, for the purposes of this discussion,
we call multi-dimensional arrays; suppose m is declared as a matrix, or two-
dimensional array, char m[][] = new char [10][15];. We are thinking of
this as an array of 10 rows, with 15 elements in each row. A reference to an
element of this array will compute an offset of fifteen elements for each row after
5.4. ARRAYS 193
..
...
..
..
..
..
..
.. ..... ..
. (a)
..
... ..
..
.. ..
.. ..
.. ..
.. ..
.
^ ^ ^ ^ ^
n[0][0][0] n[0][1][0] n[0][2][0] n[0][3][0] n[1][2][3]
(b)
the first. Also, we must add to this offset the number of elements in the selected
row. For example, a reference to m[4][7] would require an offset of 4*15 + 7 =
67. The reference m[r][c] would require an offset of r*15 + c. In general, for
a matrix declared as char m[][] = new char [ROWS][ COLS], the formula
for the offset of m[r][c] is r*COLS + c.
For a three-dimensional array, char a[][][] = new char [5][6][7];, we must sum
an offset for each plane (6*7 elements), an offset for each row (7 elements), and
an offset for the elements in the selected row. For example, the offset for the ref-
erence a[2][3][4] is found by the formula 2*6*7 + 3*7 + 4. The reference a[p][r][c]
would result in an offset computed by the formula p*6*7 + r*7 + c. In general,
for a three-dimensional array, new char [PLANES][ROWS][COLS], the reference
a[p][r][c] would require an offset computed by the formula p*ROWS*COLS +
r*COLS + c.
We now generalize what we have done to an array that has any number of
dimensions. Each subscript is multiplied by the total number of elements in all
higher dimensions. If an n dimensional array is declared as char a[][]...[]
= new char[D1][D2 ][D3 ]...[Dn], then a reference to a[S1 ][S2 ][S3 ]...[Sn]
will require an offset computed by the following formula:
S1 *D2 *D3 *D4 *...*Dn + S2 *D3 *D4 *...*Dn + S3 *D4 *...*Dn + ... + Sn−1 *Dn
+ Sn .
In this formula, Di represents the number of elements in the ith dimension
and Si represents the ith subscript in a reference to the array. Note that in some
languages, such as Java and C, all the subscripts are not required. For example,
the array of three dimensions a[2][3][4], may be referenced with two, one, or
even zero subscripts. a[1] refers to the address of the first element in the second
194 CHAPTER 5. BOTTOM UP PARSING
This extension merely states that a variable may be followed by a list of sub-
scripting expressions, each in square brackets (the nonterminal Subs represents
a list of subscripts).
Grammar G23 shows rules 7-9 of grammar G22, with attributes and action
symbols. Our goal is to come up with a correct offset for a subscripted variable
in grammar rule 8, and provide its address for the attribute of the Subs defined
in that rule.
Grammar G23:
Assume the array m has been declared to have two planes, four
rows, and five columns: m = new char[2] [4] [5];. Show the at-
tributed derivation tree generated by grammar G23 for the array ref-
erence m[x][y][z]. Use Factor as the starting nonterminal, and
show the subscripting expressions as Expr, as done in Figure 4.12.
Also show the sequence of atoms which would be put out as a result
of this array reference.
196 CHAPTER 5. BOTTOM UP PARSING
F actora[T 1]
(check)4,a
Figure 5.11: A derivation tree for the array reference a[p][r][c], which is declared
as int a[3][5][7] using grammar G23.
Solution:
F actorm[T 1]
(check)4,m
The atoms put out are:
{M OV }0,,T 1 {M U L}x,=20,T 2 {ADD}T 1,T 2,T 1 {M U L}y,=5,T 3 {ADD}T 1,T 3,T 1
{M U L}z,=1,T 4 {ADD}T 1,T 4,T 1 {check}4,m
5.4.1 Exercises
1. Assume the following array declarations:
Show the attributed derivation tree resulting from grammar G23 for each
of the following array references. Use Factor as the starting nonterminal,
and show each subscript expression as Expr, as done in Figure 5.11. Also
show the sequence of atoms that would be put out.
(a) v[7]
(b) m[q][2]
(c) a3[11][b][4]
(d) z[2][c][d][2]
(e) m[1][1]
2. The discussion in this section assumed that each array element occupied
one addressable memory cell. If each array element occupies SIZE memory
cells, what changes would have to be made to the general formula given
in this section for the offset? How would this affect grammar G23?
3. 3. You are given two vectors: the first, d, contains the dimensions of a
declared array, and the second, s, contains the subscripting values in a
reference to that array.
(a) Write a Java method :
int offSet (int d[], int s[]);
that computes the offset for an array reference a[s0 ][s1 ]...[smax−1 ] where
the array has been declared as char a[d0 ][d1 ] ... [dmax − 1].
(b) Improve your Java method, if possible, to minimize the number of
run-time multiplications.
will not be necessary for the Decaf compiler, syntax analysis and semantic anal-
ysis have been combined into one program.
The complete SableCC grammar file and Translation source code is shown
in AppendixB and is explained here. The input to SableCC is the file de-
caf.grammar, which generates classes for the parser, nodes, lexer, and analysis.
In the Tokens section, we define the two types of comments; comment1 is a
single-line comment, beginning with // and ending with a newline character.
comment2 is a multi-line comment, beginning with /* and ending with */. Nei-
ther of these tokens requires the use of states, which is why there is no States
section in our grammar. Next each keyword is defined as a separate token taking
care to include these before the definition of identifiers. These are followed by
special characters ’+’, ’-’, ;, .... Note that relational operators are defined collec-
tively as a compare token. Finally we define identifiers and numeric constants
as tokens. The Ignored Tokens are space and the two comment tokens.
The Productions section is really the Decaf grammar with some modifica-
tions to allow for bottom-up parsing. The major departure from what has been
given previously and in Appendix A, is the definition of the if statement. We
need to be sure to handle the dangling else appropriately; this is the ambiguity
problem discussed in section 3.1 caused by the fact that an if statement has
an optional else part. This problem was relatively easy to solve when parsing
top-down, because the ambiguity was always resolved in the correct way simply
by checking for an else token in the input stream. When parsing bottom-up,
however, we get a shift-reduce conflict from this construct. If we rewrite the
grammar to eliminate the ambiguity, as in section 3.1 (Grammar G7), we still
get a shift-reduce conflict. Unfortunately, in SableCC there is no way to resolve
this conflict always in favor of a shift (this is possible with yacc). Therefore, we
will need to rewrite the grammar once again; we use a grammar adapted from
Appel [3]. In this grammar a no short if statement is one which does not contain
an if statement without a matching else. The EBNF capabilities of SableCC
are used, for example, in the definition of compound stmt, which consists of a
pair of braces enclosing 0 or more statements. The complete grammar is shown
in appendix B. An array of Doubles named ’memory’ is used to store the values
of numeric constants.
The Translation class, also shown in appendix B, is written to produce atoms
for the arithmetic operations and control structures. The structure of an atom is
shown in Figure 5.12. The Translation class uses a few Java maps: the first map,
implemented as a HashMap and called ’hash’, stores the temporary memory
location associated with each sub-expression (i.e. with each node in the syntax
tree). It also stores label numbers for the implementation of control structures.
Hence, the keys for this map are nodes, and the values are the integer run-time
memory locations, or label numbers, associated with them. The second map,
called ’nums’, stores the values of numeric constants, hence if a number occurs
several times in a Decaf program, it need be stored only once in this map. The
third map is called ’identifiers’. This is our Decaf symbol table. Each identifier is
stored once, when it is declared. The Translation class checks that an identifier is
not declared more than once (local scope is not permitted), and it checks that an
5.5. CASE STUDY: SYNTAX ANALYSIS FOR DECAF 199
op Operation of Atom
left Left operand location
right Right operand location
result Result operand location
cmp Comparison code for TST atoms
dest Destination, for JMP, LBL, and TST atoms
identifier has been declared before it is used. For both numbers and identifiers,
the value part of each entry stores the run-time memory location associated with
it. The implementation of control structures for if, while, and for statements
follows that which was presented in section 4.9. A boolean expression always
results in a TST atom which branches if the comparison operation result is
false. Whenever a new temporary location is needed, the method alloc provides
the next available location (a better compiler would re-use previously allocated
locations when possible). Whenever a new label number is needed, it is provided
by the lalloc method. Note that when an integer value is stored in a map, it
must be an object, not a primitive. Therefore, we use the wrapper class for
integers provided by Java, Integer. The complete Translation class is shown in
appendix B and is available at http://www.rowan.edu/~bergmann/books.
For more documentation on SableCC, visit http://www.sablecc.org.
5.5.1 Exercises
1. Extend the Decaf language to include a do statement defined as:
DoStmt → do Stmt while ( BoolExpr ) ;
Modify the files decaf.grammar and Translation.java, shown in Appendix
B so that the compiler puts out the correct atom sequence implementing
this control structure, in which the test for termmination is made after the
body of the loop is executed. The nonterminals Stmt and BoolExpr are
already defined. For purposes of this assignment you may alter the atom
method so that it prints out its arguments to stdout rather than building
a file of atoms.
ing this control structure. The nonterminals Expr and Stmt are already
defined, as are the tokens number and end. The token switch needs to
be defined. Also define a break statement which will be used to transfer
control out of the switch statement. For purposes of this assignment, you
may alter the atom() function so that it prints out its arguments to std-
out rather than building a file of atoms, and remove the call to the code
generator.
Code Generation
202
6.1. INTRODUCTION TO CODE GENERATION 203
implementing a compiler for a new machine, and we already have compilers for
our old machine, all we need to do is write the back end, since the front end is
not machine dependent. For example, if we have a Pascal compiler for an IBM
PS/2, and we wish to implement Pascal on a new RISC (Reduced Intruction Set
Computer) machine, we can use the front end of the existing Pascal compiler
(it would have to be recompiled to run on the RISC machine). This means that
we need to write only the back end of the new compiler (refer to Figure 1.9).
Our life is also simplified when constructing a compiler for a new program-
ming language on an existing computer. In this case, we can make use of the
back end already written for our existing compiler. All we need to do is rewrite
the front end for the new language, compile it, and link it together with the
existing back end to form a complete compiler. Alternatively, we could use an
editor to combine the source code of our new front end with the source code of
the back end of the existing compiler, and compile it all at once.
For example, suppose we have a Pascal compiler for the Macintosh, and
we wish to construct an Ada compiler for the Macintosh. First, we under-
stand that the front end of each compiler translates source code to a string of
atoms (call this language Atoms), and the back end translates Atoms to Mac
machine language (Motorola 680x0 instructions). The compilers we have are
C Pas → Mac
Pas
C
and
Pas → Mac
Mac
C
, the compiler we want is
Ada → Mac
Mac
C
, and each is composed of two parts, as shown in Figure 6.1. We write
Ada → Atoms
Pas
, which is the front end of an Ada compiler and is also shown in Figure 6.1.
We then compile the front end of our Ada compiler as shown in Figure 6.2
and link it with the back end of our Pascal compiler to form a complete Ada
compiler for the Mac, as shown in Figure 6.3.
The back end of the compiler consists of the code generation phase, which we
will discuss in this chapter, and the optimization phases, which will be discussed
in Chapter 7. Code generation is probably the least intensively studied phase
of the compiler. Much of it is straightforward and simple; there is no need for
extensive research in this area. In the past most of the research that has been
done is concerned with methods for specifying target machine architectures, so
that this phase of the compiler can be produced automatically, as in a compiler-
compiler. In more recent years, research has centered on generating code for
embedded systems, special-purpose computers, and multi-core systems.
CPas
Pas → Mac
= CPas
Pas → Atoms
+C
Atoms → Mac
Pas
CMac
Pas → Mac
= CMac
Pas → Atoms
+ CMac
Atoms → Mac
CMac
Ada → Mac
= CMac
Ada → Atoms
+ CMac
Atoms → Mac
CPas
Ada → Atoms
Mac
CPas
Ada → Atoms
✲
CMac
Pas → Mac ✲
CMac
Ada → Atoms
Figure 6.2: Compile the front end of the Ada compiler on the Mac
CMac
Ada → Atoms
+ CMac
Atoms → Mac
= CMac
Ada → Mac
Figure 6.3: Link the front end of the Ada compiler with the back end of the
Pascal compiler to produce a complete Ada compiler.
6.1. INTRODUCTION TO CODE GENERATION 205
code.
Solution:
We want CRISC
Ada → RISC
Write (in Pascal) the back end of a compiler for the RISC ma-
chine:
CPas
Atoms → RISC
CPas
Pas → RISC✲
CMac
Pas → Mac ✲ CMac
Pas → RISC
But this is still not what we want, so we load the output into the
Mac’s memeory and compile again:
Mac
CPas
Pas → RISC✲
CMac
Pas → RISC ✲ CRISC
Pas → RISC
6.1.1 Exercises
1. Show the big C notation for each of the following compilers (assume that
each uses an intermediate form called Atoms):
(a) The back end of a compiler for the Sun computer.
206 CHAPTER 6. CODE GENERATION
(b) The source code, in Pascal, for a COBOL compiler whose target ma-
chine is the PC.
(c) The souce code, in Pascal, for the back end of a FORTRAN compiler
for the Sun.
(a)CPas
Lisp → PC
CPCPas → PC
(b) C CPas
Lisp → Atoms Pas → Atoms
Pas
CPas
Atoms → PC
CPCPas → PC
(c) C C
Lisp → Atoms Atoms → PC
PC PC
3. Given a Sparc computer and the following compilers, show how to gen-
erate a Pascal (Pas) compiler for the MIPS machine without doing any
more programming. (Unfortunately, you cannot afford to buy a MIPS
computer.)
CPas
Pas → Sparc
= CPas
Pas → Atoms
+ CPas
Atoms → Sparc
CSparc
Pas → Sparc
= CSparc
Pas → Atoms
+ CSparc
Atoms → Sparc
CPas
Atoms → MIPS
Solution:
6.2.1 Exercises
1. For each of the following Java statements we show the atom string pro-
duced by the parser. Translate each atom string to instructions, as in
the sample problem for this section. You may assume that variables and
labels are represented by symbolic addresses.
(a) { a = b + c * (d - e) ;
b = a;
}
(SUB, d, e, T1)
(MUL, c, T1, T2)
(ADD, b, T2, T3)
(MOV, T3,, a)
(MOV, a,, b)
(b)
for (i=1; i<=10; i++) j = j/3 ;
(MOV, 1,, i)
(LBL, L1)
(TST, i, 10,, 3, L4) // Branch if i>10
(JMP, L3)
(LBL, L5)
(ADD, 1, i, i) // i++
6.3. SINGLE PASS VS. MULTIPLE PASSES 209
(c)
if (a!=b+3) a = 0; else b = b+3;
(ADD, b, 3, T1)
(TST, a, T1,, 1, L1) // Branch if a==b+3
(MOV, 0,, a) // a = 0
(JMP, L2)
(LBL, L1)
(ADD, b, 3, T2) // T2 = b + 3
(MOV, T2,, b) // b = T2
(LBL, L2)
3. Why is it important for the code generator to know how many instructions
correspond to each atom class?
4. How many machine language instructions would correspond to an ADD
atom on each of the following architectures?
(a) Zero address architecture (a stack machine)
(b) One address architecture
(c) Two address architecture
(d) Three address architecture
5. A code generator which scans this file of atoms once is called a single pass
code generator, and a code generator which scans it more than once is called a
multiple pass code generator.
The most significant problem relevant to deciding whether to use a single
or multiple pass code generator has to do with forward jumps. As atoms are
encountered, instructions can be generated, and the code generator maintains
a memory address counter, or program counter. When a Label atom is encoun-
tered, a memory address value can be assigned to that Label atom (a table of
labels is maintained, with a memory address assigned to each label as it is
defined). If a Jump atom is encountered with a destination that is a higher
memory address than the Jump instruction (i.e. a forward jump), the label to
which it is jumping has not yet been encountered, and it will not be possible
to generate the Jump instruction completely at this time. An example of this
situation is shown in Figure 6.5 in which the jump to Label L1 cannot be gener-
ated because at the time the JMP atom is encountered the code generator has
not encountered the definition of the Label L1, which will have the value 9.
A JMP atom results in a CMP (Compare instruction) followed by a JMP
(Jump instruction), to be consistent with the sample architecture presented in
section 6.5, below. There are two fundamental ways to resolve the problem of
forward jumps. Single pass compilers resolve it by keeping a table of Jump
instructions which have forward destinations. Each Jump instruction with a
forward reference is generated incompletely (i.e., without a destination address)
when encountered, and each is also entered into a fixup table, along with the
Label to which it is jumping. As each Label definition is encountered, it is
entered into a table of Labels, along with its address value. When all of the
atoms have been read, all of the Label atoms will have been defined, and, at
this time, the code generator can revisit all of the Jump instructions in the
Fixup table and fill in their destination addresses. This is shown in Figure 6.6
for the same atom sequence shown in Figure 6.5. Note that when the (JMP,
L1) atom is encountered, the Label L1 has not yet been defined, so the location
of the Jump (8) is entered into the Fixup table. When the (LBL, L1) atom
is encountered, it is entered into the Label table, because the target machine
address corresponding to this Label (9) is now known. When the end of file
(EOF) is encountered, the destination of the Jump instruction at location 8 is
6.3. SINGLE PASS VS. MULTIPLE PASSES 211
Figure 6.6: Use of the Fixup Table and Label Table in a single pass code gen-
erator
(LBL, L1)
(TST, i, x,, 3, L2) // Branch if T1 is false
(ADD, x, 2, T1)
212 CHAPTER 6. CODE GENERATION
(JMP,L1) 7-8
(LBL,L1) L1 9
...
EOF
(MOV, T1, , x)
(MUL, i, 3, T2)
(MOV, T2, , i)
(JMP, L1) // Repeat the loop
(LBL, L2) // End of loop
Solution:
(1) Single Pass
...
1 JMP 14
(MOV, T1,, x) 5
(MUL, i, 3, T2) 7
(MOV, T2,, i) 10
(JMP, L1) 12
(LBL, L2) 14 L2 14
...
6.3.1 Exercises
1. The following atom string resulted from the Java statement:
for (i=a; i<b+c; i++) b = b/2;
Translate the atoms to instructions as in the sample problem for this
section using two methods: (1) a single pass method with a Fixup table
for forward Jumps and (2) a multiple pass method. Refer to the variables
a,b,c symbolically.
6.4. REGISTER ALLOCATION 215
(MOV, a,, i)
(LBL, L1)
(ADD, b, c, T1) // T1 = b+c
(TST, i, T1,, 4, L2) // If i>=b+c, exit loop
(JMP, L3) // Exit loop
(LBL, L4)
(ADD, i, 1, i) // Increment i
(JMP, L1) // Repeat loop
(LBL, L3)
(DIV, b, =’2’, T3) // Loop Body
(MOV, T3,, b)
(JMP, L4) // Jump to increment
(LBL, L2)
2. Repeat Problem 1 for the atom string resulting from the Java statement:
if (a==(b-33)*2) a = (b-33)*2;
else a = x+y;
3. (a) What are the advantages of a single pass method of code generation
over a multiple pass method?
(b) What are the advantages of a multiple pass method of code generation
over a single pass method?
they are not general purpose registers; i.e., each one has a limited range of uses
or functions. In these cases the allocation of registers is not a problem.
However, most modern architectures have many CPU registers; the DEC
Vax, IBM mainframe, MIPS, and Motorola 680x0 architectures each has 16-
32 general purpose registers, for example, and the RISC (Reduced Instruction
Set Computer) architectures, such as the SUN SPARC, generally have about
500 CPU registers (though only 32 are used at a time). In this case, register
allocation becomes an important problem. Register allocation is the process
of assigning a purpose to a particular register, or binding a register to a pro-
grammer variable or compiler variable, so that for a certain range or scope of
instructions that register has the specified purpose or binding and is used for no
other purposes. The code generator must maintain information on which reg-
isters are used for which purposes, and which registers are available for reuse.
The main objective in register allocation is to maximize utilization of the CPU
registers, and to minimize references to memory locations.
It might seem that register allocation is more properly a topic in the area of
code optimization, since code generation could be done with the assumption that
there is only one CPU register (resulting in rather inefficient code). Nevertheless,
register allocation is always handled (though perhaps not in an optimal way)
in the code generation phase. A well chosen register allocation scheme can not
only reduce the number of instructions required, but it can also reduce the
number of memory references. Since operands which are used repeatedly can
be kept in registers, the operands do not need to be recomputed, nor do they
need to be loaded from memory. It is especially important to minimize memory
references in compilers for RISC machines, in which the objective is to execute
one instruction per machine cycle, as described in Tanenbaum [21].
An example, showing the importance of smart register allocation, is shown
in Figure 6.8 for the two statement program segment:
a = b + c * d ;
a = a - c * d ;
The smart register allocation scheme takes advantage of the fact that C*D
is a common subexpression, and that the variable A is bound, temporarily, to
register R2. If no attention is paid to register allocation, the two statements
in Figure 6.8 are translated into twelve instructions, involving a total of twelve
memory references. With smart register allocation, however, the two statements
are translated into seven instructions, with only five memory references. (Some
computers, such as the VAX, permit arithmetic on memory operands, in which
case register allocation takes on lesser importance.)
An algorithm which takes advantage of repeated subexpressions will be dis-
cussed in Section 7.2. Here, we will discuss an algorithm which determines how
many registers will be needed to evaluate an expression without storing subex-
pressions to temporary memory locations. This algorithm will also determine
the sequence in which subexpressions should be evaluated to minimize register
usage.
This register allocation algorithm will require a syntax tree for an expression
to be evaluated. Each node of the syntax tree will have a weight associated
6.4. REGISTER ALLOCATION 217
Figure 6.8: Register allocation, simple and smart, for a two statement program:
a = b+c*d; a = b-c*d;
-(2)
*(1) *(2)
Figure 6.9: A weighted syntax tree for a*b-(c+d)*(e+f) with weights shown
in parentheses
with it which tells us how many registers will be needed to evaluate each subex-
pression without storing to temporary memory locations. Each leaf node which
is a left operand will have a weight of one, and each leaf node which is a right
operand will have a weight of zero. The weight of each interior node will be
computed from the weights of its two children as follows: If the two children
have different weights, the parent’s weight is the maximum of the two children.
If the two children have the same weight, w, then the parent’s weight is w+1.
As an example, the weighted syntax tree for the expression a*b - (c+d) * (e+f)
is shown in Figure 6.9 from which we can see that the entire expression should
require two registers.
Intuitively, if two expressions representing the two children of a node, N, in
a syntax tree require the same number of registers, we will need an additional
register to store the result for node N, regardless of which subexpression is
evaluated first. In the other case, if the two subexpressions do not require the
same number of registers, we can evaluate the one requiring more registers first,
218 CHAPTER 6. CODE GENERATION
LOD r1,c
ADD r1,d r1 = c + d
LOD r2,3
ADD r2,f r2 = e + f
MUL r1,r2 r1 = (c+d) * (e+f)
LOD r2,a
MUL r2,b r2 = a * b
SUB r2,41 r2 = a*b - (c+d)*(e+f)
Figure 6.10: Code generated for a*b-(c+d) * (e+f) using Figure 6.9
Solution:
LOD r1,a
LOD r2,b
DIV r2,c b/c
6.5. CASE STUDY: A CODE GENERATOR FOR THE MINI ARCHITECTURE219
6.4.1 Exercises
1. Use the register allocation algorithm given in this section to construct a
weighted syntax tree and generate code for each of the given expressions,
as done in Sample Problem ??. Do not attempt to optimize for common
subexpressions.
(a) a + b * c - d
(b) a + (b + (c + (d + e)))
(c) (a + b) * (c + d) - (a + b) * (c + d)
(d) a / (b + c) - (d + (e - f)) + (g - h * i) * (j * (k / m))
2. Show an expression different in structure from those in Problem 1 which
requires:
(a) two registers (b) three registers
As in Problem 1, assume that common subexpressions are not detected
and that Loads and Stores are minimized.
3. Show how the code generated in Problem 1 (c) can be improved by making
use of common subexpressions.
we will work with an example of a code generator, and it now becomes necessary
to specify a target machine architecture. It is tempting to choose a popular
machine such as a RISC, Intel, Motorola, IBM, or Sparc CPU. If we did so, the
student who had access to that processor could conceivably generate executable
code for that machine. But what about those who do not have access to the
chosen processor? Also, there would be details, such as Object formats (the
input to the linker), and supervisor or system calls for certain operations, which
we have not explained.
For these reasons, we choose our own simulated machine. This is an archi-
tecture which we will specify for the student. We also provide a simulator for
this machine, written in the C language. Thus, anyone who has a C compiler
has access to our simulated machine, regardless of the actual platform on which
it is running. Another advantage of a simulated architecture is that we can
make it as simple as necessary to illustrate the concepts of code generation. We
need not be concerned with efficiency or completeness. The architecture will be
relatively simple and not cluttered with unnecessary features.
Absolute Mode
4 1 3 4 20
op 0 cmp r1 s2
Register-displacement Mode
4 1 3 4 4 16
op 1 cmp r1 r2 d2
The Compare field in either instruction format (cmp) is used only by the
Compare instruction to indicate the kind of comparison to be done on arithmetic
data. In addition to a code of 0, which always sets the flag to True, there are
six valid comparison codes as shown below:
1 == 4 <=
2 < 5 >=
3 > 6 !=
The following example of a Mini program will replace the memory word at loca-
tion 0 with its absolute value. The memory contents are shown in hexadecimal,
and program execution is assumed to begin at memory location 1.
222 CHAPTER 6. CODE GENERATION
Loc Contents
0 00000000 Data 0
1 00100000 CLR r1 Put 0 into register r1.
2 64100000 CMP r1,Data,4 Is 0 <= Data?
3 50000006 JMP Stop If so, finished.
4 20100000 SUB r1,Data If not, find 0-Data.
5 80100000 STO r1,Data
6 90000000 Stop HLT Halt processor
Each atom class is specified with an integer code, and each record may have
up to six fields specifying the atom class, the location of the left operand, the
location of the right operand, the location of the result, a comparison code (for
TST atoms only), and a destination (for JMP, LBL, and TST atoms only). Note
that a JMP atom is an unconditional branch, whereas a JMP instruction is a
6.5. CASE STUDY: A CODE GENERATOR FOR THE MINI ARCHITECTURE223
inp is used to hold the atom which is being processed. The dest field of an atom
is the destination label for jump instructions, and the actual address is obtained
from the labels table by a function called lookup(). The code generator sends
all instructions to the standard output file as hex characters, so that the user
has the option of discarding them, storing them in a file, or piping them directly
into the Mini simulator. The generated instructions are shown to the right in
Figure 6.11.
The student is encouraged to use, abuse, modify and/or distribute (but not
for profit) the software shown in the Appendix to gain a better understanding
of the operation of the code generator.
6.5.4 Exercises
1. How is the compiler’s task simplified by the fact that floating-point is the
only numeric data type in the Mini architecture?
70100021
10300022
18370002
3. Show the code, in hex, generated by the code generator for each of the
following atom strings. Assume that A and B are stored at locations 0
and 1, respectively. Allocate space for the temporary value T1 at the end
of the program.
(a)
class left right result cmp dest
MULA B T1 - -
LBL - - - - L1
TST A T1 - 2 L1
JMP - - - - L2
MOVT1 - B - -
LBL - - - - L2
(b)
class left right result cmp dest
NEG A - T1 - -
LBL - - - 0 L1
MOVT1 - B - -
TST B T1 - 4 L1
6.6. CHAPTER SUMMARY 225
(c)
class left right result cmp dest
TST A B - 6 L2
JMP - - - - L1
LBL - - - - L2
TST A T1 - 0 L2
LBL - - - - L1
three arguments specifying the operation code and two operands. The code
generator, shown in Appendix B.3, uses a two pass method to handle forward
references.
Chapter 7
Optimization
227
228 CHAPTER 7. OPTIMIZATION
❄
Improved Object Code (instructions)
the input to the global optimization phase, the output of the global optimization
phase is the input to the code generator, the output of the code generator is the
input to the local optimization phase, and the output of the local optimization
phase is the final output of the compiler. The three compiler phases shown in
Figure 7.1 make up the back end of the compiler, discussed in Section 6.1.
In this discussion on improving performance, we stress the single most im-
portant property of a compiler - that it preserve the semantics of the source
program. In other words, the purpose and behavior of the object program
should be exactly as specified by the source program for all possible inputs.
There are no conceivable improvements in efficiency which can justify violating
this promise.
Having made this point, there are frequently situations in which the compu-
tation specified by the source program is ambiguous or unclear for a particular
computer architecture. For example, in the expression (a + b) ∗ (c + d) the
compiler will have to decide which addition is to be performed first (assuming
that the target machine has only one Arithmetic and Logic Unit). Most pro-
gramming languages leave this unspecified, and it is entirely up to the compiler
7.1. INTRODUCTION AND VIEW OF OPTIMIZATION 229
7.1.1 Exercises
1. Using a Java compiler,
(a) what would be printed as a result of running the following:
{
int a, b;
230 CHAPTER 7. OPTIMIZATION
b = (a = 2) + (a = 3);
System.out.println ("a is " + a);
}
a = 3 * f(x) ;
(b) What can you can conclude about the associativity of addition with
computer arithmetic?
a = (b + c) * (b + c) ;
7.2. GLOBAL OPTIMIZATION 231
(ADD, b, c, T1)
(ADD, b, c, T2)
(MUL, T1, T2, T3)
(MOV, T3,, a)
(ADD, b, c, T1)
(MUL, T1, T1, a)
The sequence of atoms put out by the parser could conceivably be as shown
in Figure 7.2.
Every time the parser finds a correctly formed addition operation with two
operands it blindly puts out an ADD atom, whether or not this is necessary. In
the above example, it is clearly not necessary to evaluate the sum b + c twice.
In addition, the MOV atom is not necessary because the MUL atom could store
its result directly into the variable a. The atom sequence shown in Figure 7.3 is
equivalent to the one given in Figure 7.2, but requires only two atoms because
it makes use of common subexpressions and it stores the result in the variable
a, rather than a temporary location.
In this section, we will demonstrate some techniques for implementing these
optimization improvements to the atoms put out by the parser. These im-
provements will result in programs which are both smaller and faster, i.e., they
optimize in both space and time.
It is important to recognize that these optimizations would not have been
possible if there had been intervening Label or Jump atoms in the parser output.
For example, if the atom sequence had been as shown in Figure 7.4, we could
not have optimized to the sequence of Figure 7.3, because there could be atoms
which jump into this code at Label L1, thus altering our assumptions about
the values of the variables and temporary locations. (The atoms in Figure 7.4
do not result from the given Java statement, and the example is, admittedly,
artificially contrived to make the point that Label atoms will affect our ability
to optimize.)
By the same reasoning, Jump or Branch atoms will interfere with our ability
to make these optimizing transformations to the atom sequence. In Figure 7.4
the MUL atom cannot store its result into the variable a, because the compiler
does not know whether the conditional branch will be taken.
The optimization techniques which we will demonstrate can be effected only
in certain subsequences of the atom string, which we call basic blocks. A basic
block is a section of atoms which contains no Label or branch atoms (i.e., LBL,
TST, JMP). In Figure 7.5, we show that the atom sequence of Figure 7.4 is
divided into three basic blocks.
232 CHAPTER 7. OPTIMIZATION
(ADD, b, c, T1)
(LBL, L1)
(ADD, b, c, T2)
(MUL, T1, T2, T3)
(TST, b, c,, 1, L3)
(MOV, T3,, a)
Each basic block is optimized as a separate entity. There are more advanced
techniques which permit optimization across basic blocks, but they are beyond
the scope of this text. We use a Directed Acyclic Graph, or DAG, to implement
this optimization. The DAG is directed because the arcs have arrows indicating
the direction of the arcs, and it is acyclic because there is no path leading from
a node back to itself (i.e., it has no cycles). The DAG is similar to a syntax tree,
but it is not truly a tree because some nodes may have more than one parent
and also because the children of a node need not be distinct. An example of a
DAG, in which interior nodes are labeled with operations, and leaf nodes are
labeled with operands is shown in Figure 7.6.
Each of the operations in Figure 7.6 is a binary operation (i.e., each operation
has two operands), consequently each interior node has two arcs pointing to the
two operands. Note that in general we will distinguish between the left and
right arc because we need to distinguish between the left and right operands
of an operation (this is certainly true for subtraction and division, which are
not commutative operations). We will be careful to draw the DAGs so that it
is always clear which arc represents the left operand and which arc represents
the right operand. For example, in Figure 7.6 the left operand of the addition
labeled T3 is T2, and the right operand is T1. Our plan is to show how to build
a DAG from an atom sequence, from which we can then optimize the atom
sequence.
We will begin by building DAGs for simple arithmetic expressions. DAGs
can also be used to optimize complete assignment statements and blocks of
statements, but we will not take the time to do that here. To build a DAG,
given a sequence of atoms representing an arithmetic expression with binary
operations, we use the following algorithm:
7.2. GLOBAL OPTIMIZATION 233
T3
+
T2 +
T1 *
a b
Figure 7.6: An example of a DAG
1. Read an atom.
2. If the operation and operands match part of the existing DAG (i.e., if they
form a sub DAG), then add the result Label to the list of Labels on the
parent and repeat from Step 1. Otherwise, allocate a new node for each
operand that is not already in the DAG, and a node for the operation.
Label the operation node with the name of the result of the operation.
3. Connect the operation node to the two operands with directed arcs, so
that it is clear which operand is the left and which is the right.
4. Repeat from Step 1.
(MUL, a, b, T1)
(MUL, a, b, T2)
(ADD, T1, T2, T3)
(MUL, a, b, T4)
(ADD, T3, T4, T5)
We follow the algorithm to build the DAG, as shown in Figure 7.7, in which
we show how the DAG is constructed as each atom is processed.
The DAG is a graphical representation of the computation needed to evaluate
the original expression in which we have identified common subexpressions. For
example, the expression a * b occurs three times in the original expression a
* b + a * b + a * b. The three atoms corresponding to these subexpressions
store results into T1, T2, and T4. Since the computation need be done only
once, these three atoms are combined into one node in the DAG labeled T1.2.4.
234 CHAPTER 7. OPTIMIZATION
T1 * (MUL a, b, T1)
a b
a b
T1.2 *
a b
T3 + (MUL a, b, T4)
T1.2.4 *
a b
T5
+ (MUL a, b, T4)
T3 +
T1.2.4 *
a b
Figure 7.7: Building the DAG for a * b + a * b + a * b
7.2. GLOBAL OPTIMIZATION 235
After that point, any atom which uses T1, T2, or T4 as an operand will point
to T1.2.4.
We are now ready to convert the DAG to a basic block of atoms. The algo-
rithm given below will generate atoms (in reverse order) in which all common
subexpressions are evaluated only once:
1. Choose any node having no incoming arcs (initially there should be only
one such node, representing the value of the entire expression).
3. Delete this node and its outgoing arcs from the DAG.
4. Repeat from Step 1 as long as there are still operation nodes remaining in
the DAG.
Construct the DAG and show the optimized sequence of atoms for
the Java expression (a - b) * c + d * (a - b) * c. The atoms
produced by the parser are shown below:
(SUB, a, b, T1)
(MUL, T1, c, T2)
(SUB, a, b, T3)
(MUL, d, T3, T4)
(MUL, T4, c, T5)
(ADD, T2, T5, T6)
236 CHAPTER 7. OPTIMIZATION
T5
+ (ADD, T3, T1.2.4, T5)
T3 +
T1.2.4 *
a b
T1.2.4 *
a b
a b
Figure 7.8: Generating atoms from the DAG for a * b + a * b + a * b
7.2. GLOBAL OPTIMIZATION 237
Solution:
(SUB, a, b, T1.3)
(MUL, d, T1.3, T4)
(MUL, T4, c, T5)
(MUL, T1.3, c, T2)
(ADD, T2, T5, T6)
T6
+
T5
*
T4 * * T2
d - c
T1.3
a b
(JMP, L1)
(MUL, a, b, T1)
(SUB, T1, c, T2)
(ADD, T2, d, T3)
(LBL, L2)
Thus, the three atoms following the JMP and preceding the LBL can all be
removed from the program without changing the purpose of the program:
238 CHAPTER 7. OPTIMIZATION
{
a = b + c * d; // This statement has no effect and can be removed.
b = c * d / 3;
c = b - 3;
a = b - c;
System.out.println (a + b + c);
}
(JMP, L1)
(LBL, L2)
{
for (int i=0; i<1000; i++)
{ a = sqrt (x); // loop invariant
vector[i] = i * a;
}
}
{
a = sqrt (x); // loop invariant
for (int i=0; i<1000; i++)
{
vector[i] = i * a;
}
}
{
a = 2 * 3; // a must be 6
b = c + a * a; // a*a must be 36
}
{
a = 6; // a must be 6
b = c + 36; // a*a must be 36
}
a + b == b + a Addition is commutative
(a + b) + c == a + (b + c) Addition is Associative
a * (b + c) == a * b + a * c Multiplication distributes over addition
(LBL, L1)
(TST, a, b,, 1, L2)
(SUB, a, 1, a)
(MUL, x, 2, b)
(ADD, x, y, z)
(ADD, 2, 3, z)
(JMP, L1)
(SUB, a, b, a)
(MUL, x, 2, z)
(LBL, L2)
Solution:
(LBL, L1)
(TST, a, b,, 1, L2)
(SUB, a, 1, a)
(MUL, x, 2, b) Reduction in strength
(ADD, x, y, z) Elimination of dead code
(ADD, 2, 3, z) Constant folding, loop invariant
(JMP, L1)
(SUB, a, b, a) Unreachable code
(MUL, x, 2, z) Unreachable code
(LBL, L2)
(MOV, 5,, z)
(LBL, L1)
(TST, a, b,, 1, L2)
(SUB, a, 1, a)
242 CHAPTER 7. OPTIMIZATION
(ADD, x, x, b)
(JMP, L1)
(LBL, L2)
7.2.3 Exercises
1. Eliminate common subexpressions from each of the following strings of
atoms, using DAGs as shown in Sample Problem ?? (we also give the
Java expressions from which the atom strings were generated):
(a) (b + c) * d * (b + c)
(ADD, b, c, T1)
(MUL, T1, d, T2)
(ADD, b, c, T3)
(MUL, T2, T3, T4)
(b) (a + b) * c / ((a + b) * c - d)
(ADD, a, b, T1)
(MUL, T1, c, T2)
(ADD, a, b, T3)
(MUL, T3, c, T4)
(SUB, T4, d, T5)
(DIV, T2, T5, T6)
(c) (a + b) * (a + b) - (a + b) * (a + b)
(ADD, a, b, T1)
(ADD, a, b, T2)
(MUL, T1, T2, T3)
(ADD, a, b, T4)
(ADD, a, b, T5)
(MUL, T4, T5, T6)
(SUB, T3, T6, T7)
(d) ((a + b) + c) / (a + b + c) - (a + b + c)
7.2. GLOBAL OPTIMIZATION 243
(ADD, a, b, T1)
(ADD, T1, c, T2)
(ADD, a, b, T3)
(ADD, T3, c, T4)
(DIV, T2, T4, T5)
(ADD, a, b, T6)
(ADD, T6, c, T7)
(SUB, T5, T7, T8)
(e) a / b - c / d - e / f
(DIV, a, b, T1)
(DIV, c, d, T2)
(SUB, T1, T2, T3)
(DIV, e, f, T4)
(SUB, T3, T4, T5)
2. How many different atom sequences can be generated from the DAG given
in your response to Problem 1 (e), above?
3. In each of the following sequences of atoms, eliminate the unreachable
atoms: (a)
(ADD, a, b, T1)
(LBL, L1)
(SUB, b, a, b)
(TST, a, b,, 1, L1)
(ADD, a, b, T3)
(JMP, L1)
(b)
(ADD, a, b, T1)
(LBL, L1)
(SUB, b, a, b)
(JMP, L1)
(ADD, a, b, T3)
(LBL, L2)
(c)
(JMP, L2)
(ADD, a, b, T1)
(TST, a, b,, 3, L2)
244 CHAPTER 7. OPTIMIZATION
(SUB, b, b, T3)
(LBL, L2)
(MUL, a, b, T4)
int f (int d)
{ int a,b,c;
a = 3;
b = 4;
d = a * b + d;
return d;
}
(b)
int f (int d)
{ int a,b,c;
a = 3;
b = 4;
c = a +b;
d = a + b;
a = b + c * d;
b = a + c;
return d;
}
(b)
(ADD, a, b, T1)
(SUB, T1, c, T2)
The code generator would then produce the following instructions from the
atoms:
LOD R1,a
ADD R1,b
STO R1,T1
LOD R1,T1
SUB R1,c
STO R1,T2
Notice that the third and fourth instructions in this sequence are entirely
unnecessary since the value being stored and loaded is already at its destina-
tion. The above sequence of six instructions can be optimized to the following
sequence of four instructions by eliminating the intermediate Load and Store
instructions as shown below:
7.3. LOCAL OPTIMIZATION 247
LOD R1,a
ADD R1,b
SUB R1,c
STO R1,T2
A reading of this atom stream is ”Test for a greater than b, and if true, jump
to the assignment. Otherwise, jump around the assignment.” The reason for
this somewhat convoluted logic is that the TST atom uses the same comparison
code found in the expression. The instructions generated by the code generator
from this atom stream would be:
LOD R1,a
CMP R1,b,3 //Is R1 > b?
JMP L1
CMP 0,0,0 // Unconditional Jump
JMP L2
L1:
LOD R1,b
STO R1,a
L2:
LOD R1,a
CMP R1,b,4 // Is R1 <= b?
JMP L1
LOD R1,b
STO R1,a
L1:
MUL R1, 1
ADD R1, 0
7.3.1 Exercises
1. Optimize each of the following code segments for unnecessary Load/Store
instructions:
(a) (b)
LOD R1,a LOD R1,a
ADD R1,b LOD R2,c
STO R1,T1 ADD R1,b
LOD R1,T1 ADD R2,b
SUB R1,c STO R2,T1
STO R1,T2 ADD R1,c
LOD R1,T2 LOD R2,T1
STO R1,d STO R1,T2
STO R2,c
7.3. LOCAL OPTIMIZATION 249
2. Optimize each of the following code segments for unnecessary jump over
jump instructions:
(a) (b)
CMP R1,a,1 CMP R1,a,5
JMP L1 JMP L1
CMP 0,0,0 CMP 0,0,0
JMP L2 JMP L2
L1: L1:
ADD R1,R2 SUB R1,a
L2: L2:
(c)
L1:
ADD R1,R2
CMP R1,R2,3
JMP L2
CMP 0,0,0
JMP L1
L2:
3. Use any of the local optimization methods of this section to optimize the
following code segment:
251
252 Glossary
balanced binary search tree - A binary search tree in which the difference
in the heights of both subtrees of each node does not exceed a given constant.
basic block - A group of atoms or intermediate code which contains no
label or branch code.
binary search tree - A connected data structure in which each node has,
at most, two links and there are no cycles; it must also have the property that
the nodes are ordered, with all of the nodes in the left subtree preceding the
node, and all of the nodes in the right subtree following the node.
bison - A public domain version of yacc.
bootstrapping - The process of using a program as input to itself, as in
compiler development, through a series of increasingly larger subsets of the
source language.
bottom up parsing - Finding the structure of a string in a way that
produces or traverses the derivation tree from bottom to top.
byte code - The intermediate form put out by a java compiler.
closure - Another term for the Kleene * operation.
code generation - The phase of the compiler which produces machine
language object code from syntax trees or atoms.
comment - Text in a source program which is ignored by the compiler, and
is for the programmer’s reference only.
compile time - The time at which a program is compiled, as opposed to
run time. Also, the time required for compilation of a program.
compiler - A software translator which accepts, as input, a program writ-
ten in a particular high-level language and produces, as output, an equivalent
program in machine language for a particular machine.
compiler-compiler - A program which accepts, as input, the specifications
of a programming language and the specifications of a target machine, and
produces, as output, a compiler for the specified language and machine.
conflict - In bottom up parsing, the failure of the algorithm to find an
appropriate shift or reduce operation.
constant folding - An optimization technique which involves detecting
operations on constants, which could be done at compile time rather than at
run time.
context-free grammar - A grammar in which the left side of each rule
consists of a nonterminal being rewritten (type 2).
context-free language - A language which can be specified by a context-
free grammar.
context-sensitive grammar - A grammar in which the left side of each
Glossary 253
rule consists of a nonterminal being rewritten, along with left and right context,
which may be null (type 1).
context-sensitive language - A language which can be specified by a
context-sensitive grammar.
conventional machine language - The language in which a computer
architecture can be programmed, as distinguished from a microcode language.
cross compiling - The process of generating a compiler for a new computer
architecture, automatically.
DAG - Directed acyclic graph.
data flow analysis - A formal method for tracing the way information
about data objects flows through a program, used in optimization.
dead code - Code, usually in an intermediate code string, which can be
removed because it has no effect on the output or final results of a program.
derivation - A sequence of applications of rewriting rules of a grammar,
beginning with the starting nonterminal and ending with a string of terminal
symbols.
derivation tree - A tree showing a derivation for a context-free grammar, in
which the interior nodes represent nonterminal symbols and the leaves represent
terminal symbols.
deterministic - Having the property that every operation can be completely
and uniquely determined, given the inputs (as applied to a machine).
deterministic context-free language - A context-free language which
can be accepted by a deterministic pushdown machine.
directed acyclic graph (DAG) - A graph consisting of nodes connected
with one-directional arcs, in which there is no path from any node back to itself.
disjoint - Not intersecting.
embedded actions - In a yacc grammar rule, an action which is not at the
end of the rule.
empty set - The set containing no elements.
endmarker - A symbol, N, used to mark the end of an input string (used
here with pushdown machines).
equivalent grammars - Grammars which specify the same language.
equivalent programs - Programs which have the same input/output rela-
tion.
example (of a nonterminal) - A string of input symbols which may be
derived from a particular nonterminal.
expression - A language construct consisting of an operation and zero, one,
or two operands, each of which may be an object or expression.
254 Glossary
shift operation - The operation of pushing an input symbol onto the pars-
ing stack, and advancing to the next input symbol, in a bottom up parsing
algorithm.
shift reduce parser - A bottom up parsing algorithm which uses a sequence
of shift and reduce operations to transform an acceptable input string to the
starting nonterminal of a given grammar.
shift/reduce conflict - In bottom up parsing, the failure of the algorithm to
determine whether a shift or reduce operation is to be performed in a particular
stack and input configuration.
simple algebraic optimization - The elimination of instructions which
add 0 to or multiply 1 by a number.
simple grammar - A grammar in which the right side of every rule begins
with a terminal symbol, and all rules defining the same nonterminal begin with
a different terminal.
simple language - A language which can be described with a simple gram-
mar.
single pass code generator - A code generator which keeps a fixup table
for forward references, and thus needs to read the intermediate code string only
once.
single pass compiler - A compiler which scans the source program only
once.
source language - The language in which programs may be written and
used as input to a compiler.
source program - A program in the source language, intended as input to
a compiler.
state - A machine’s status, or memory/register values. Also, in SableCC,
the present status of the scanner.
States - The section of a Sablecc specification in which lexical states may
be defined.
starting nonterminal - The nonterminal in a grammar from which all
derivations begin.
stdin - In Unix or MSDOS, the standard input file, normally directed to the
keyboard.
stdout - In Unix or MSDOS, the standard output file, normally directed to
the user.
string - A list or sequence of characters from a given alphabet.
string space - A memory buffer used to store string constants and possibly
identifier names or key words.
260 Glossary
symbol table - A data structure used to store identifiers and possibly other
lexical entities during compilation.
syntax - The specification of correctly formed strings in a language, or the
correctly formed programs of a programming language.
syntax analysis - The phase of the compiler which checks for syntax errors
in the source program, using, as input, tokens put out by the lexical phase and
producing, as output, a stream of atoms or syntax trees.
syntax directed translation - A translation in which a parser or syntax
specification is used to specify output as well as syntax.
syntax tree - A tree data structure showing the structure of a source pro-
gram or statement, in which the leaves represent operands, and the internal
nodes represent operations or control structures.
synthesized attributes - Those attributes in an attributed grammar which
receive values from lower nodes in the derivation tree.
target machine - The machine for which the output of a compiler is in-
tended.
terminal symbol - A symbol in the input alphabet of a language specified
by a grammar.
token - The output of the lexical analyzer representing a single word in the
source program.
Tokens - The section of a Sablecc specification in which tokens are defined.
top down parsing - Finding the structure of a string in a way that produces
or traverses the derivation tree from top to bottom.
Translation class - An extension of the DepthFirstAdapter class generated
by SableCC. It is used to implement actions in a translation grammar.
translation grammar - A grammar which specifies output for some or all
input strings.
underlying grammar - The grammar resulting when all action symbols
are removed from a translation grammar.
unreachable code - Code, usually in an intermediate code string, which
can never be executed.
unrestricted grammar - A grammar in which there are no restrictions on
the form of the rewriting rules (type 0).
unrestricted language - A language which can be specified by an unre-
stricted grammar.
white space - Blank, tab, or newline characters which appear as nothing
on an output device.
Glossary 261
A
Appendix A - Decaf
Grammar
263
264 APPENDIX A - DECAF GRAMMAR
class AClass {
public static void main (String[] args)
{float cos, x, n, term, eps, alt;
// compute the cosine of x to within tolerance eps
// use an alternating series
x = 3.14159;
eps = 0.1;
n = 1;
cos = 1;
term = 1;
alt = -1;
while (term>eps)
{
term = term * x * x / n / (n+1);
cos = cos + alt * term;
alt = -alt;
n = n + 2;
}
}
}
Figure A.1: Decaf program to compute the cosine function using a Taylor series
Appendix B - Decaf
Compiler
266
B.2. SOURCE CODE FOR DECAF 267
These are all plain text files, so you should be able to simply choose File |
Save As from your browser window. Create a subdirectory named decaf, and
download the files *.java to the decaf subdirectory. Download all other files into
your current directory (i.e. the parent of decaf).
To build the Decaf compiler, there are two steps (from the directory con-
taining decaf.grammar). First generate the parser, lexer, analysis, and node
classes (the exact form of this command could depend on how SableCC has
been installed on your system):
$ sablecc decaf.grammar
The second step is to compile the java classes that were not generated by
SableCC:
$ javac decaf/*.java
You now have a Decaf compiler; the main method is in decaf/Compiler.class.
To compile a decaf program, say cos.decaf, invoke the compiler, and redirect
stdin to the decaf source file:
$ java decaf.Compiler < cos.decaf
This will create a file named atoms, which is the the result of translating
cos.decaf into atoms. To create machine code for the mini architecture, and
execute it with a simulator, you will need to compile the code generator, and
the simulator, both of which are written in standard C:
$ cc gen.c -o gen
$ cc mini.c -o mini
Now you can generate mini code, and execute it. Simply invoke the code
generator, which reads from the file atoms, and writes mini instructions to
stdout. You can pipe these instructions into the mini machine simulator:
$ gen | mini
The above can be simplified by using the scripts provided. To compile the
file cos.decaf, without executing, use the compile script:
$ compile cos
To compile and execute, use the compileAndGo script:
$ compileAndGo cos
This software was developed and tested using a Sun V480 running Solaris
10. The reader is welcome to adapt it for use on Windows and other systems.
A flow graph indicating the relationships of these files is shown in Figure B.2
in which input and output file names are shown in rectangles. The source
files are shown in Appendix B.2, below, with updated versions available at
http://cs.rowan.edu/~bergmann/books.
language.grammar
sablecc Compiler.java
aProgram.decaf
Translation.java Compiler.class
(stdin)
javac java
gen
stdout
(minicode)
mini
stdout
(simulationdisplay)
// decaf.grammar
// SableCC grammar for decaf, a subset of Java.
// March 2003, sdb
Package decaf;
Helpers // Examples
letter = [’a’..’z’] | [’A’..’Z’] ; // w
digit = [’0’..’9’] ; // 3
digits = digit+ ; // 2040099
exp = [’e’ + ’E’] [’+’ + ’-’]? digits; // E-34
newline = [10 + 13] ;
non_star = [[0..0xffff] - ’*’];
non_slash = [[0..0xffff] - ’/’];
non_star_slash = [[0..0xffff] - [’*’ + ’/’]];
Tokens
comment1 = ’//’ [[0..0xffff]-newline]* newline ;
comment2 = ’/*’ non_star* ’*’
(non_star_slash non_star* ’*’+)* ’/’ ;
r_bracket = ’]’ ;
comma = ’,’ ;
semi = ’;’ ;
identifier = letter (letter | digit | ’_’)* ;
number = (digits ’.’? digits? | ’.’digits) exp? ;
// Example: 2.043e+5
misc = [0..0xffff] ;
Ignored Tokens
comment1, comment2, space;
Productions
assign_expr? r_par
stmt_no_short_if
;
The file Translation.java is the Java class which visits every node in the
syntax tree and produces a file of atoms and a file of constants. It is described
272 APPENDIX B - DECAF COMPILER
in section 5.5.
// Translation.java
// Translation class for decaf, a subset of Java.
// Output atoms from syntax tree
// sdb March 2003
// sdb updated May 2007
// to use generic maps instead of hashtables.
package decaf;
import decaf.analysis.*;
import decaf.node.*;
import java.util.*;
import java.io.*;
// All stored values are doubles, key=node, value is memory loc or label number
Map <Node, Integer> hash = new HashMap <Node, Integer> (); // May 2007
AtomFile out;
//////////////////////////////////////////////////
// Definition of Program
//////////////////////////////////////////////////
// Definitions of declaration and identlist
//////////////////////////////////////////////////
// Definition of for_stmt
//////////////////////////////////////////////////
// Definition of while_stmt
B.2. SOURCE CODE FOR DECAF 275
/////////////////////////////////////////////
// Definition of if_stmt
outAIfElseStmt (node);
}
///////////////////////////////////////////////////
// Definition of bool_expr
////////////////////////////////////////////////
// Definition of expr
////////////////////////////////////////////////
// Definition of assign_expr
////////////////////////////////////////////////
// Definition of rvalue
////////////////////////////////////////////////
// Definition of term
////////////////////////////////////////////////
// Definition of factor
///////////////////////////////////////////////////////////////////
// Send the run-time memory constants to a file for use by the code generator.
void outConstants()
{ FileOutputStream fos = null;
DataOutputStream ds = null;
int i;
try
{ fos = new FileOutputStream ("constants");
ds = new DataOutputStream (fos);
}
catch (IOException ioe)
{ System.err.println ("IO error opening constants file for output: "
+ ioe);
}
try
{ for (i=0; i<=memHigh ; i++)
if (memory[i]==null) ds.writeDouble (0.0); // a variable is bound here
else
ds.writeDouble (memory[i].doubleValue());
}
catch (IOException ioe)
{ System.err.println ("IO error writing to constants file: "
+ ioe);
}
try { fos.close(); }
catch (IOException ioe)
{ System.err.println ("IO error closing constants file: "
+ ioe);
}
}
//////////////////////////////////////////////////////////
// Put out atoms for conversion to machine code.
// These methods display to stdout, and also write to a
// binary file of atoms suitable as input to the code generator.
void atom (String atomClass, Integer left, Integer right, Integer result)
B.2. SOURCE CODE FOR DECAF 281
{ System.out.println (atomClass + " T" + left + " T" + right + " T" +
result);
Atom atom = new Atom (atomClass, left, right, result);
atom.write(out);
}
void atom (String atomClass, Integer left, Integer right, Integer result,
Integer cmp, Integer lbl)
{ System.out.println (atomClass + " T" + left + " T" + right + " T" +
result + " C" + cmp + " L" + lbl);
Atom atom = new Atom (atomClass, left, right, result, cmp, lbl);
atom.write(out);
}
Integer alloc()
{ return new Integer (++avail); }
Integer lalloc()
{ return new Integer (++lavail); }
The file Compiler.java defines the Java class which contains a main method
which invokes the parser to get things started. It is described in section 5.5.
// Compiler.java
// main method which invokes the parser and reads from
// stdin
// March 2003 sdb
package decaf;
import decaf.parser.*;
import decaf.lexer.*;
import decaf.node.*;
282 APPENDIX B - DECAF COMPILER
import java.io.*;
System.out.println();
}
catch(Exception e)
{ System.out.println(e.getMessage()); }
}
}
The file Atom.java defines the Java class Atom, which describes an Atom
and permits it to be written to an output file. It is described in section 5.5.
// Atom.java
// Define an atom for output by the Translation class
package decaf;
import java.io.*;
class Atom
// Put out atoms to a binary file in the format expected by
// the old code generator.
{
static final int ADD = 1;
static final int SUB = 2;
B.2. SOURCE CODE FOR DECAF 283
int cls;
int left;
int right;
int result;
int cmp;
int lbl;
The file AtomFile.java defines the Java class AtomFile, which permits output
of an Atom to a file.
// AtomFile.java
// Create the binary output file for atoms
// March 2003 sdb
package decaf;
import java.io.*;
class AtomFile
{
B.3. CODE GENERATOR 285
FileOutputStream fos;
DataOutputStream ds;
String fileName;
void close()
{ try
{ ds.close(); }
catch (IOException ioe)
{ System.err.println ("IO error closing atom file
(" + fileName + "): " + ioe);
}
}
}
Program variables and temporary storage are located beginning at memory lo-
cation 0, consequently the Mini machine simulator needs to know where the first
instruction is located. The function out mem() sends the constants which have
been stored in the target machine memory to stdout. The function dump atom()
is included for debugging purposes only; the student may use it to examine the
atoms produced by the parser.
The code generator solves the problem of forward jump references by making
two passes over the input atoms. The first pass is implemented with a function
named build labels() which builds a table of Labels (a one dimensional array),
associating a machine address with each Label.
The file of atoms is closed and reopened for the second pass, which is
implemented with a switch statement on the input atom class. The impor-
tant function involved here is called gen(), and it actually generates a Mini
machine instruction, given the operation code (atom class codes and corre-
sponding machine operation codes are the same whenever possible), register
number, memory operand address (all addressing is absolute), and a com-
parison code for compare instructions. Register allocation is kept as simple
as possible by always using floating-point register 1, and storing all results
in temporary locations. The source code for the code generator, from the
file gen.c, is shown below. For an updated version of this source code, see
http://cs.rowan.edu/~bergmann/books.
/* gen.c
Code generator for mini architecture.
Input should be a file of atoms, named "atoms"
********************************************************
Modified March 2003 to work with decaf as well as miniC.
sdb
*/
#define NULL 0
#include "mini.h"
#include "miniC.h"
/* code_gen () */
void main() /* March 2003, for Java. sdb */
{ int r;
break;
case MOV: gen (LOD, r=regalloc(), inp.left);
gen (STO, r, inp.result);
break;
}
get_atom();
}
gen (HLT);
}
get_atom()
/* read an atom from the file of atoms into inp */
/* ok indicates that an atom was actually read */
{ int n;
dump_atom()
{ printf ("op: %d left: %04x right: %04x result: %04x cmp: %d dest: %d\n",
inp.op, inp.left, inp.right, inp.result, inp.cmp, inp.dest); }
int regalloc ()
/* allocate a register for use in an instruction */
{ return 1; }
build_labels()
/* Build a table of label values on the first pass */
{
get_atom();
while (ok)
{
if (inp.op==LBL)
labels[inp.dest] = pc;
out_mem()
/* send target machine memory contents to stdout. this is the beginning of the object file, to b
{
ADDRESS i;
data_file_ptr = fopen ("constants","rb");
290 APPENDIX B - DECAF COMPILER
void get_data()
/* March 2003 sdb
read a number from the file of constants into inp_const
and store into memory.
*/
{ int i,status=1;
double inp_const;
for (i=0; status; i++)
{ status = fread (&inp_const, sizeof (double), 1, data_file_ptr);
memory[i].data = inp_const ;
}
end_data = i-1;
}
The Mini machine simulator is simply a C program stored in the file mini.c.
It reads instructions and data in the form of hex characters from the standard
input file, stdin. The instruction format is as specified in Section 6.5.1, and is
specified with a structure called fmt in the header file, mini.h.
The simulator begins by calling the function boot(), which loads the Mini
machine memory from the values in the standard input file, stdin, one memory
location per line. These values are numeric constants, and zeroes for program
variables and temporary locations. The boot() function also initializes the
program counter, PC (register 1), to the starting instruction address.
The simulator fetches one instruction at a time into the instruction register,
ir, decodes the instruction, and performs a switch operation on the operation
code to execute the appropriate instruction. The user interface is designed
to allow as many instruction cycles as the user wishes, before displaying the
machine registers and memory locations. The display is accomplished with the
dump() function, which sends the Mini CPU registers, and the first sixteen
memory locations to stdout so that the user can observe the operation of the
simulated machine. The memory locations displayed can be changed easily by
setting the two arguments to the dumpmem() function. The displays include
both hex and decimal displays of the indicated locations.
As the user observes a program executing on the simulated machine, it is
probably helpful to watch the memory locations associated with program vari-
ables in order to trace the behavior of the original MiniC program. Though
the compiler produces no memory location map for variables, their locations
can be determined easily because they are stored in the order in which they
are declared, beginning at memory location 3. For example, the program that
computes the cosine function begins as shown here:
In this case, the variables cos, x, n, term, eps, and alt will be stored in that
291
292 APPENDIX C - MINI SIMULATOR
instruction format:
bits function
0-3 opcode 1 r1 = r1+s2
2 r1 = r1-s2
4 r1 = r1*s2
5 r1 = r1/s2
7 pc = S2 if flag JMP
8 flag = r1 cmp s2 CMP
9 r1 = s2 Load
10 s2 = r1 Store
11 r1 = 0 Clear
#include <stdio.h>
#include mini.h
#define PC reg[1]
main ()
{
int n = 1, count;
switch (ir.instr.op)
{ case ADD: fpreg[ir.instr.r1].data = fpreg[ir.instr.r1].data +
memory[addr].data;
break;
case SUB: fpreg[ir.instr.r1].data = fpreg[ir.instr.r1].data -
memory[addr].data;
break;
case MUL: fpreg[ir.instr.r1].data = fpreg[ir.instr.r1].data *
memory[addr].data;
break;
case DIV: fpreg[ir.instr.r1].data = fpreg[ir.instr.r1].data /
memory[addr].data;
break;
case JMP: if (flag) PC = addr; /* conditional jump */
break;
case CMP: switch (ir.instr.cmp)
{case 0: flag = TRUE; /* unconditional */
break;
case 1: flag = fpreg[ir.instr.r1].data == memory[addr].data;
break;
case 2: flag = fpreg[ir.instr.r1].data < memory[addr].data;
break;
294 APPENDIX C - MINI SIMULATOR
break;
case 5: flag = fpreg[ir.instr.r1].data >= memory[addr].data;
break;
case 6: flag = fpreg[ir.instr.r1].data != memory[addr].data;
}
case LOD: fpreg[ir.instr.r1].data = memory[addr].data;
break;
case STO: memory[addr].data = fpreg[ir.instr.r1].data;
break;
case CLR: fpreg[ir.instr.r1].data = 0.0;
break;
case HLT: n = -1;
}
}
dump ();
printf ("Enter number of instruction cycles, 0 for no change, or -1 to quit\n");
/* read from keyboard if stdin is redirected */
fscanf (tty,"%d", &count);
if (count!=0 && n>0) n = count;
}
}
void dump ()
{ dumpregs();
dumpmem(0,15);
}
void dumpregs ()
{int i;
char * pstr;
low = low/4*4;
high = (high+4)/4*4 - 1;
if (flag) f = "TRUE"; else f = "FALSE";
printf ("memory\t\t\t\t\tflag = %s\naddress\t\tcontents\n",f);
for (i=low; i<=high; i+=4)
printf ("%08x\t%08x %08x %08x %08x\n\t\t%8e %8e %8e %8e\n",
i,memory[i].instr,memory[i+1].instr,memory[i+2].instr,
memory[i+3].instr, memory[i].data, memory[i+1].data,
memory[i+2].data,
memory[i+3].data);
}
void boot()
/* load memory from stdin */
{ int i = 0;
The only source files that have not been displayed are the header files. The
file miniC.h contains declarations, macros, and includes which are needed by the
compiler but not by the simulator. The file mini.h contains information needed
by the simulator.
The header file miniC.h is shown below:
/* Symbol table */
struct Ident * HashTable[HashMax];
/* ADD, SUB, MUL, DIV, and JMP are also atom classes */
/* The following atom classes are not op codes */
#define NEG 10
#define LBL 11
#define TST 12
#define MOV 13
FILE * atom_file_ptr;
ADDRESS avail = 0, end_data = 0;
int err_flag = FALSE; /* has an error been detected? */
#define FALSE 0
/* Memory word on the simulated machine may be treated as numeric data or as an instruction */
union { float data;
unsigned long instr;
} memory [MaxMem];
union {
struct fmt instr;
unsigned long full32;
} ir;
298
Bibliography
[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Com-
pilers: Principles, Techniques, and Tools. Addison Wesley, 2007.
[2] Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Archi-
tectures. Morgan Kaufmann, 2002.
[3] Andrew W. Appel. Modern Compiler Implementation in Java. Cambridge
University Press, 2002.
[4] Ken Arnold and James Gosling. The Java Programming Lanuguage. Ad-
dison Wesley, 1996.
[5] William A. Barrett, Rodney M. Bates, David A. Gustafson, and John D.
Couch. Compiler Construction: Theory and Practice. Science Research
Associates, 1979.
[6] Bill Campbell, Swami lyer, and Bahar Akbal-Delibas. Introduction to Com-
piler Construction in a Java World. CRC, 2013.
[7] Noam Chomsky. Certain formal properties of grammars. Information and
Control, 2(2):137–167, June 1958.
[8] Noam Chomsky. Syntactic Structures. Mouton, 1965.
[9] Keith D. Cooper and Linda Torczon. Engineering a Compiler. Morgan
Kaufmann, 2012.
[10] Etienne Gagnon. Sablecc, an object oriented comp;iler framework. Master’s
thesis, McGill University. available at http://www.sablecc.org.
[11] Jim Holmes. Object-Oriented Compiler Construction. Prentice Hall, 1995.
[12] John E. Hopcroft and Jeffrey D. Ullman. Introduction to Automata Theory,
Languages, and Computation. Addison Wesley, 1979.
[13] Bruce Hutton. Language implementation. unpublished lecture notes, Uni-
versity of Auckland, 1987.
[14] Samuel N. Kamin. Programming Languages: An Interpreter Based Ap-
proach. Addison Wesley, 1990.
299
300 BIBLIOGRAPHY
[15] Brian W. Kernighan and Rob Pike. The Unix Programming Environment.
Prentice Hall, 1984.
[16] Donald E. Knuth. On the translation of languages from left to right. In-
formation and Control, 8(6):607–639, Dec 1965.
[17] Thomas W. Parsons. Introduction to Compiler Construction. Freeman,
1992.
[18] David A. Patterson and John L. Hennessy. Computer Organization and
Design. Morgan Kaufmann, 2012.
[19] Anthony J. Dos Reiss. Compiler Construction using Java, JavaCC, and
Yacc. Wiley, 2012.
[20] Sablecc. http://www.sablecc.org.
[21] Andrew S. Tanenbaum. Structured Computer Organization. Prentice Hall,
1990.
[22] Niklaus Wirth. Compiler Construction. Addison Wesley Longman, 1996.
Index
301
Index
∗ algebraic transformation
reflexive transitive closure of a re- optimization, 233
lation, 93 alphabet, 67
regular expression, 33 ambiguity
+ dangling else, 85
restricted closure in sablecc, 52 in arithmetic expressions, 85
union in programming languages, 85–87
of sets in sablecc, 51 ambiguous expression, 221
union of regular expressions, 32 ambiguous grammar, 72
−> architecture, 202
state transition in sablecc, 55 arithmetic expression
▽ parsing top down, 127
bottom of stack marker, 75 arithmetic expressions
ǫ attributed translation grammar, 139–
null string, 28 143
ǫ rule LL(1) grammar, 120
selection set, 103 LR parsing tables, 168
←֓ parsing bottom up, 168
pushdown machine end marker, 75 parsing top down, 117
φ translation grammar, 128
empty set, 28 arrays, 188–192
assembly language, 1
absolute address modes assignment, 144, 146
Mini architecure, 214 atom, 9
action symbol JMP, 144
in translation grammars, 128 label, 10
action table LBL, 144
LR parser, 167 TST, 144
actions Atom.java, 274
for finite state machines(, 42 AtomFile.java, 276
for finite state machines), 45 attribute computation rule, 134
ADD attributed derivation tree, 135
Mini instruction, 215 attributed grammar, 134–138
ADD atom, 201 inherited attributes, 135
address modes recursive descent, 136–138
Mini, 214 synthesized attributes, 135
algebraic optimization, 241 attributed translation grammar
302
INDEX 303
target machine, 2
mini, 283–289
terminal symbol
grammar, 67
token, 8, 37–39
token declarations
sablecc, 51–53
top-down parsing, 91–158
transitive
relation property, 93
translating arithmetic expressions with
recursive descent, 141
Translation, 194
translation
atoms to instructions, 201–202
control structures(, 149
control structures), 153
translation grammar, 128–133
arithmetic expressions, 128
Decaf expressions, 147
epsilon rule, 128
pushdown machine, 129