Laboratory Manual For Compiler Design: Robb T. Koether
Laboratory Manual For Compiler Design: Robb T. Koether
for
Compiler Design
Robb T. Koether
ii
Contents
I Preliminaries 11
1 Getting Started 13
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.2 The Cygwin Window . . . . . . . . . . . . . . . . . . . . . . . 14
1.3 The HOME Environment Variable . . . . . . . . . . . . . . . 15
1.4 The Java Development Kit . . . . . . . . . . . . . . . . . . . . 16
1.5 Running Java Programs . . . . . . . . . . . . . . . . . . . . . 16
1.6 The Java API . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.7 The JLex Java-based Lexical Analyzer Generator . . . . . . . 17
1.8 The CUP Java-based Parser Generator . . . . . . . . . . . . . 19
1.9 Zipping Files . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.10 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
II Lexical Analysis 23
1
2 CONTENTS
4 A Recursive-Descent Parser 41
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Modify the Lexical Analyzer . . . . . . . . . . . . . . . . . . . 42
4.3 The Recursive-Descent Parser . . . . . . . . . . . . . . . . . . 42
4.4 Derivations and Parse Trees . . . . . . . . . . . . . . . . . . . 44
4.5 Improving the Parser . . . . . . . . . . . . . . . . . . . . . . . 45
4.6 Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
IV A Simple Compiler 69
VI Functions 133
9
10 LIST OF FIGURES
Preliminaries
11
Laboratory 1
Getting Started
Key Concepts
• Environment variables
Preliminary
13
14 LABORATORY 1. GETTING STARTED
1.1 Introduction
In this lab we will download and install a number of programs. The purpose of this
is twofold. You will be more aware of the setup that we will be using and you will be
able to set up the same software on your own computer.
Type the command dir (directory) to see the names of all the files in the current
directory.
Choose the name of one of the subdirectories in the current directory and type
the command
$ cd directory-name
where directory-name is the name of the subdirectory that you chose. Now repeat
the commands pwd and dir. Type the command
$ cd ..
to return to the previous directory. A single period (.) refers to the current directory
and two periods (..) refers to the parent directory. For example, to move up two
levels and then down to a directory called programs, you could type
$ cd ../../programs
$ cd ~
To make this the home directory, bring up the System control panel. Click on
the Advanced tab and then click on Environment Variables. In the top section,
named User Variables, click New. Enter HOME as the Variable name and type
the exact pathname of your Coms 480 directory as the Variable value. If you open
a window to that directory, then you should be able to copy and paste the pathname.
Be sure to use the backslash as a separator between directories. Then click OK (on
all three windows) to save the settings.
Now close the Cygwin window and open a new one. (This is necessary in order
to reinitialize the environment variables.) This window should have opened to your
folder. You can confirm that by typing pwd. From now on, Cygwin will begin in this
folder.
16 LABORATORY 1. GETTING STARTED
http://java.sun.com/javase/downloads/index.jsp
Next to the title “Java SE Development Kit (JDK) 6 Update 11,” click on Download.
Follow the instructions through the next two web pages. (Do not download the Sun
Download Manager.) When you are finished there should be a Java folder in the
Program Files folder of the C drive. Inside one of the subfolders is the javac.exe
compiler.
$ javac Hello.java
You probably will get an error message, because the computer could not find the Java
compiler javac.exe.
Open the System program on the Control Panel and go the Environment Variables
window again. This time we must add a PATH variable. The PATH variable tells the
computer where to find executable files, including the Java compiler.
You may have to search for the Java compiler. Indeed, there may be more
than one on the computer. We will use the one in the folder named C:\Program
Files\Java\jdk1.6.0 11\bin. Once you know where it is, then create the PATH
variable with this pathname as its value, as you did the HOME variable earlier.
Close the Cygwin window and open a new one. Now try again to compile the
program. This time the program should compile. Type dir to see that the file
Hello.class is in the directory. This is the compiled Java program. Now run the
program by typing
$ java Hello
The program should print "Hello, world!" This confirms that you are set up to run
Java programs in the Cygwin window.
17
http://java.sun.com/j2se/1.4.2/docs/api/
Save this as a bookmark. You will want to return here many times later in the course.
This web site contains the documentation for all Java classes. For example, in
the upper left frame, scroll down and click on java.lang. In the window below, the
names of all the classes in the java.lang package appear. Click on Integer. To the
right you see the information about the Integer class.
Another difference between Java and C++ is that Java is weak on operators. Oper-
ators are defined only for the primitive objects: int, float, double, char, etc.
• How do you add two Integers and represent the result as an Integer?
http://www.cs.princeton.edu/~appel/modern/java/JLex/
18 LABORATORY 1. GETTING STARTED
This is the web site for JLex, the Java lexical analyzer generator. Click on Installa-
tion Instructions. Read these instructions carefully. The directory “J” will be your
Coms 480 folder. Create a subfolder named JLex. You can either do this in Windows
or you can type the command
$ mkdir JLex
When you are ready, click on Source Code and save it in your JLex folder. The
Java code for JLex (Main.java) should now be in the folder JLex. Move to the JLex
directory and compile JLex by typing
$ javac Main.java
Now we will test our installation by using a test program available from the JLex web
page. On the web page, click on Sample Input. Copy and paste the contents of the
page into a new file in CodeWarrior and save it in Lab 01 as sample.lex.
This file must be edited slightly in order to work on our PCs. In DOS files, each
line ends with a return character \r followed by a newline character \n. In UNIX
files, each line ends with only a newline character. Go to line 116 in sample.lex:
<YYINITIAL,COMMENT> \n { }
Immediately below this line, add a similar line, replacing \n with \r.
Now type
It probably did not work. The notation JLex.Main means the Main.class file in the
JLex folder. The program javac could not find the file Main.java. To solve this
problem, we must define the CLASSPATH variable. The CLASSPATH variable tells the
Java compiler where to look for Java source code files. Set this variable as before,
setting the value of CLASSPATH to
.;\\hams-acad-fs\Students\your-name\Coms 480
and restart the Cygwin window. Be sure to replace your-name with the name of
your workspace in the Students directory. Note the initial dot (.). This refers to the
current directory. Thus, Cygwin will search the current directory first. The semicolon
is a separator. Cygwin will next search
19
\\hams-acad-fs\Students\your-name\Coms 480
Now type
again. It should work. This creates the file sample.lex.java, which contains Java
source code for the classes Sample, Utility, Yytoken, and Yylex. Then type
$ javac sample.lex.java
$ java Sample
Enter various lines of C code and see what the output is. Enter a line that contains an
embedded /*-style comment. When you are satisfied, type CTRL-Z (as many times
as necessary) to indicate end of file. The program should terminate.
http://www2.cs.tum.edu/projects/cup/
This is the web site for CUP, the Java parser generator. CUP stands for Constructor
of Useful Parsers. It also is a play on the java theme. Get it? CUP? Java? Cup of
java?
We will save the CUP files in a CUP directory. Create a CUP directory now as a
subdirectory of your Coms 480 directory. On the web page, in the section “Archived
versions,” click on CUP 10k sourcecode release. This is the most recent stable
version. Then click on Save. Save it in your CUP folder. This will download a zip
file to be unzipped. All the files needed for CUP will be extracted. Double-click on
20 LABORATORY 1. GETTING STARTED
the file java cup v10k.tar.gz. Follow the Zip instructions. You should save the
extracted files in your CUP folder.
The Java classes in CUP are already compiled. We will now test CUP using
a slightly modified version of the sample program that appears in the CUP User’s
Manual. This example creates a program that evaluates integer expressions involving
+, -, *, /, %, and parentheses.
In the Lab 01 folder, there are the files scanner.java, grammar.cup, and
Evaluator.java. Type the command
Again, this probably did not work. The filename java cup.Main refers to the file
Main.class in the subdirectory java cup of the CUP directory. Java does not know
to look in the CUP directory for Java class files. Therefore, we must add the pathname
of the CUP directory to the paths to be searched in the CLASSPATH environment
variable. Make this change, close the Cygwin window, and open a new one.
Try again to execute the command
It should work. This will create the Java source files parser.java and sym.java
from the grammar file grammar.cup. Compile these two files and then compile the
files scanner.java and Evaluator.java.
Run the evaluator program by typing
$ java Evaluator
2 + 3 * 4;
Be sure to end the expression with a semicolon. The program will print the value of
the expression. When you are finished, type CTRL-Z.
21
1.10 Assignment
Turn in the file Lab 01.zip.
22 LABORATORY 1. GETTING STARTED
Part II
Lexical Analysis
23
Laboratory 2
Key Concepts
• Lexical Analyzers
• Makefiles
Preliminary
Copy the folder Lab 02 from the Compiler Design CD to your Coms 480 folder.
2.1 Introduction
In this lab we will create a lexical analyzer that will return the tokens that occur in
statements of the form
id =expr;
25
26 LABORATORY 2. WRITING A LEXICAL ANALYZER
2.2 Makefiles
Open the file Makefile. A makefile consists mainly of a list of dependencies and
actions. A dependency is written in the form
target: sources
action
where target is a file name and sources is a list of file names. (The tab before the
action part is mandatory.) This means that the target file depends on the source files.
Whenever any of the source files is updated, then the target file will be updated by
performing the action. This is all governed by the timestamps on the files.
The makefile for Lab 2 contains the dependencies among the files used by this
program. In this case, it is very simple: there are only four files and two depen-
dencies. The file MyLexer.class depends on the file MyLexer.java and the file
Token.class depends (in the same way) on the file Token.java. That means that
whenever MyLexer.java is updated, i.e., modified, then MyLexer.class should also
be updated and whenever Token.java is updated, then Token.class should also be
updated. In the makefile, the line
MyLexer.class: MyLexer.java
javac MyLexer.java
$ make
Try this now. You should see that the Java compiler is invoked and MyLexer and
Token are compiled. Type the command again and you will see that it says that
MyLexer is up to date, so it does not recompile it.
Now we will delete the file MyLexer.class and rebuild it. Type the command
27
$ rm MyLexer.class
$ touch MyLexer.java
$ java MyLexer
The program should find the five tokens in this expression. Then type CTRL-Z and
press return again. CTRL-Z is interpreted as end-of-file.
This program does not recognize any grammar rules; those will come later. There-
fore, any string of legal tokens will be processed correctly. For example, try
$ 123***456++++
$ 1 23 * * * 45 6 +++ +
Run MyLexer again, using this input to see what the output is.
28 LABORATORY 2. WRITING A LEXICAL ANALYZER
to convert the contents of the buffer to a String. Go to the Java web page, access
the String page, and look at the String constructors. The URL is
http://java.sun.com/j2se/1.4.2/docs/api/
Find the constructor that is used here. You should develop the habit of referring to
the Java API pages as often as necessary, i.e., often. You will find the answers to
many of your Java questions there.
The heart of the MyLexer class is the next token() function. By looking at the
current character cVal, next token() is able to decide which type of token is being
read. It processes the token and returns the token type.
2.5 Assignment
For the “toy” lexer that we are building in this lab, the complete set of tokens is
shown in Table 2.1.
29
Complete the MyLexer class by adding code that will recognize each of these types
of token.
After you have finished the lexer, test it on the file testfile. The lexer uses
standard input (keyboard) and standard output (monitor), but you may redirect
them to files. To read input from the file testfile, type
Type this command and then open outfile.txt in CodeWarrior or Notepad and
inspect it. When the output is complicated, this method allows you to inspect it at
your leisure. Or you can print it and inspect it later.
Zip the files MyLexer.java, Token.java, and Makefile in a folder named Lab 02
and drop it in the dropbox.
30 LABORATORY 2. WRITING A LEXICAL ANALYZER
Laboratory 3
Key Concepts
• JLex
• Regular expressions
Read Sections 3.5 - 3.9 of Compilers: Principles, Techniques, and Tools. Also read
the JLex User’s Manual.
Preliminary
Copy the folder Lab 03 from the Compiler Design CD to your folder.
3.1 Introduction
In this lab we will create the same lexical analyzer as in Lab 2, but by using JLex
and a file tokens.lex. The file tokens.lex contains the rules for recognizing tokens
and the actions to take for each kind of token. JLex will use these rules to build a
Java program that will be a lexical analyzer. The rules in the file tokens.lex are
regular expressions and the lexical analyzer that JLex builds is a DFA. By inspecting
31
32 LABORATORY 3. A LEXICAL ANALYZER USING JLEX
the Java code that JLex produces, one could in principle figure out how it works,
although it is rather complicated.
3.2 JLex
JLex is a program that generates Java source code for a lexer. The original such
program was lex, which works on UNIX systems and creates C source code. The
gnu version of lex is called flex and it is included as one of the development tools in
cygwin. It also produces C source code.
Open the file tokens.lex in CodeWarrior. A JLex file is divided into three parts,
which are separated by %%. The first part contains code that is to be copied directly
into the Java class Yylex, which is the name of the lexer. In this example, this section
contains no code. The yy prefix is a carry-over from lex, which uses the yy prefix
on all of its variables and file names in order to avoid conflicts with any programmer
defined objects (provided the programmer avoids the yy prefix). See Section 2.1 of
the JLex User’s Manual for more information.
The second part contains various JLex directives and definitions. Read the JLex
User’s Manual for more information. This example contains one directive and three
definitions. The first directive is
%integer
It tells JLex that the tokens that are returned will be of type int. It also causes JLex
to define the constant YYEOF = -1 in the Yylex class. This constant will be returned
automatically by Yylex upon encountering end of file and then will be used in the
main function to terminate the program. See Section 2.2 of the JLex User’s Manual
for more information.
The three definitions
digit= [0-9]
num= {digit}+
ws= [\ \t]+
define a number and white space using regular expressions. A digit is a single character
in the range ’0’ through ’9’. A number is one or more digits. White space (ws) is
a blank or a tab. See Section 2.3 of the JLex User’s Manual for more information.
33
The third section lists the regular expressions to be matched and the actions to be
taken for each. In the example, the regular expressions are numbers num, a plus sign
"+", and a times sign "*". In each case, the action is to print a message telling the
type of token found and to return the corresponding constant from the Token class
(to be discussed shortly). In the case of a number, the message will also include the
string that constitutes the value of the token.
The double quotes around single characters are used to guard against the character
being interpreted as a metacharacter.
Since white space should be ignored, there is no action associated with it.
Finally, note the dot (.). This stands for any character except a newline. It
should be used and it should appear last. If a string matches more than one pattern,
JLex will choose the pattern that is listed earlier in the list. Thus, no pattern in the
list after the dot, except a newline, will be matched. In this example, the dot is used
to match any invalid character. Notice that the error message is handled by the Err
class.
token = lex.yylex();
The variable token is of type int because we told our lexer to return ints. You
should appreciate how simple the interface with the Lexer class is; all the work is
being done in the Yylex class.
This will create as output the file tokens.lex.java. Since we are creating the Yylex
class, we will rename this file Yylex.java. Then we compile Token.java, Yylex.java
and Lexer.java in the usual way.
Execute the above command now. Note the screen output from JLex. It describes
the number of states in the NFA that it builds. Then it converts the NFA to a DFA.
Finally, it minimizes the DFA and reports the number of states.
35
Open the file Yylex.java in CodeWarrior. Look at the Yylex class. It is com-
plicated, but not impossible to figure out. Go to the function yytext(). It builds a
string that contains the value of the current token. Note that it uses the same String
constructor that we used in our program MyLexer. Look in the file tokens.lex and
find the place where yytext() is used. Be aware that yytext() is not a public mem-
ber function of the Yylex class. The reason we can use it in the JLex file is because
this code is copied into the Yylex class. Therefore, when it is used, it is being used
within the class.
Now look at the function yylength(). This function returns the length of the
current token.
The function unpackFromString() is used to create the DFA transition tables
that drive the lexer. It is invoked three times just before the yylex() function (not
to be confused with the Yylex constructors).
In the yylex() function you will find the code that appeared in the tokens.lex
file as the action part of each token. It appears now in a switch structure.
Now compile the files Token.java, Yylex.java, and Lexer.java. Then test the
lexer by typing
$ java Lexer
The program will wait for input from the keyboard. Enter strings as you did in Lab
2. For example, try the strings
In the last example, did you get any error messages? Why?
By writing the rule, the makefile will automatically update any .class files by
this rule, as necessary. Thus we do not need to tell it explicitly to make Lexer.class,
Token.class, and Yylex.class.
In the next part of the makefile, there is the line
This says that the “file” all depends on the other three files. In fact, there is no
file all. (Notice that there is no action part telling how to update all.) This is a
common makefile technique used to force the makefile to update a list of targets.
The final part contains the two instructions that are needed to update the Yylex
class whenever a change is made to tokens.lex.
Read through the makefile and be sure that you understand it. For more infor-
mation on makefiles, there are many good tutorials on the web. For example, see
http://www.gnu.org/software/make/manual/make.html
3.8 Assignment
Add to the lexical analyzer the set of tokens shown in Table 3.1.
Your lexer must also detect the patterns described in Table 3.2, although they
require no action and no return value. These are the patterns that the compiler will
skip over.
Take actions similar to those already taken in the file tokens.lex. In the case of
identifiers, be sure to print the value of the identifier as well as the token type. Test
your work thoroughly. The //-style comments extend only to the end of the line. The
/*-style comments extend past line breaks to the next occurrence of */. Character
strings do not extend past line breaks.
This lab will serve as Project 1. Zip the files tokens.lex, Token.java, Lexer.java,
and Makefile in a folder named Project 1 and drop it in the dropbox.
37
Pattern Description
//-comment //, followed by zero or more characters, followed by end of line
/*-comment /*, followed by zero or more characters, followed by */
white space a blank or a tab
newline \n
return \r
Syntactic Analysis
39
Laboratory 4
A Recursive-Descent Parser
Key Concepts
• Recursive-descent parsers
• Leftmost derivations
• Parse trees
Preliminary
Copy the folder Lab 04 from the Compiler Design CD to your folder. Also, copy
the file MyLexer.java and Token.java from your Lab 02 folder to this folder. These
two files should have been modified to accept identifiers, subtraction, division, assign-
ments, parentheses, and semicolons.
4.1 Introduction
We will create a recursive descent parser that parses a series of statements of the form
id = expr;
41
42 LABORATORY 4. A RECURSIVE-DESCENT PARSER
E → T E0
E 0 → +T E 0 | ε
T → FT0
T 0 → *F T 0 | ε
F → num
Note that when there is a choice of rules to apply, the decision can be made by looking
at the current token only. If the current token does not fit any of the production rules,
then an error message is printed. This is a necessary characteristic of the grammar
in order for a recursive-descent parser to be feasible.
Notice that for each nonterminal in the grammar, there is a function in
RDParser.java that decides which rule to apply to that nonterminal. The main
function gets things started by initializing the lexer (MyLexer), getting the first token,
and calling on E(). When execution eventually returns, the class variable error tells
whether an error was encountered, which determines whether the input is accepted.
The function E() applies the rule E → T E 0 . It prints an informative message
and then calls on T() followed by Eprime(). The function Eprime() must make a
choice since there are two possible grammar to apply: E 0 → +T E 0 and E 0 → ε. This
43
choice is made by checking whether the current token is PLUS. The other functions
are similar.
Finally, the match() function verifies that the token is what it is supposed to be
and then it gets the next token.
Build the program by running the makefile. Then run the program by typing
$ java RDParser
Press CTRL-Z for EOF. The input should be accepted. We are using the “verbose”
versions of the lexer and the parser. Each program prints informative messages to
apprise the user of its progress. These messages can be extremely helpful in debugging.
You may turn them off by commenting them out. It is better not to delete them since
you may need them later for further debugging.
You should have gotten the output
E’ -> e
Accepted
Notice that some productions were applied after you indicated EOF. Why is that?
Draw the parse tree. How does it compare to the parse tree of the previous expression?
Test the program, sending the output to an output file outfile.txt. The output
will be rather long, but the input should be accepted. Print the output file.
Notice that RDParser.java prints the production used as soon as it decides which
one to use, even before the production has been satisfied. One effect of this is that
the productions are listed in the order of a leftmost derivation. As an experiment, in
each case when a production is matched, move the output statement to the end of
that block. For example, in the function E(), rewrite
as
T();
Eprime();
System.out.println("E -> T E’");
Build and run the parser again, sending the output to the file outfile2.txt. Print
the output file. Compare this output to the previous output. Read the output from
bottom to top. In what order did the productions appear? What kind of derivation
does this indicate? Can you explain this?
45
Notice how the product 3 * 4 * 5 * 6 was handled. What does this indicate
about the depth of the recursion? In the same vein, notice the point at which the
production E 0 → +T E 0 was matched. Again, this tells us something about the depth
of recursion.
Notice how much output appeared after the EOF token was received. This output,
when read from top to bottom, appears to indicate that the parser is a bottom-up
parser following a rightmost derivation. That is misleading; it is a top-down parser
that follows a leftmost derivation. However, this indicates that the difference is subtle.
Put the output statements back where they were. You may do this by pressing
CTRL-Z repeatedly to undo the earlier changes.
$ 123 ++ 456
What error was detected? Can you see why? Now try
$ 123 456
Was an error detected? Why not? You might notice that you did not need to indicate
EOF. Why not?
Trace through the program and find out exactly how execution terminated when
the next token after 123 was a number. Did this number token cause error to be set
to true?
The problem is that when the program quits, the last token received from the
lexer should be EOF. Otherwise, the lexer was not at the end of the expression.
However, the program does not check that this was the case. It simply quits after the
production for E has been satisfied. Correct this by requiring that after satisfying
the production for E, the token be EOF in order for the parser to report that the
input was accepted.
4.6 Assignment
Expand the grammar as follows
46 LABORATORY 4. A RECURSIVE-DESCENT PARSER
S → S 0S | ε
S 0 → id = E;
E → T E0
E 0 → +T E 0 | -T E 0 | ε
T → FT0
T 0 → *F T 0 | /F T 0 | ε
F → (E) | id | num
Be sure to test your program with incorrect input as well as correct input.
Place the files Token.java, MyLexer.java, RDParser.java, and Makefile in a
folder named Lab 04, zip it, and drop it in the dropbox.
Laboratory 5
Key Concepts
• Predictive parsing
Preliminary
Copy the folder Lab 05 from the Compiler Design CD to your folder. Also copy the
files Err.java and Warning.java from your Lab 03 Folder.
5.1 Introduction
In this lab we will use a lexer generated by JLex together with a table-driven predictive
parser. We will see how to set up the interface so that the parser will communicate
properly with the lexer. That will be quite simple. The same interface would be used
47
48 LABORATORY 5. USING JLEX WITH A PREDICTIVE PARSER
with RDParser or any other hand-written parser. Then we will look at how the parser
works. It is the model of elegance. Since the parser uses a stack, we will also have an
opportunity to become familiar with the Java Stack class and some issues related to
its use.
%type int
This prevents JLex from returning -1. Second, since the return value is no longer
automatic, we must tell JLex what it is. To do this, add the following in the directive
section of the JLex file.
%eofval{
return Token.EOF;
%eofval}
This provides the action that JLex will take on end-of-file. See Sections 2.2.12 and
2.2.17 in the JLex manual.
Currently the parser in PredParser.java makes no mention of a lexer. We must
add some lines to the parser so that it can accept tokens from a Yylex object. Open
the file PredParser.java.
First, we must declare lex to be a Yylex object. Add the class variable
token = lex.yylex();
E → T E0
E 0 → +T E 0 | ε
T → FT0
T 0 → *F T 0 | ε
F → (E) | id | num
Look at the beginning of the PredParser class. You will see the nonterminals defined
as the negative integer constants −1, −2, . . . , −5. The tokens, or terminals, will be
assigned non-negative integer values. Thus, the sign of the grammar symbol will be
our way of distinguishing between terminals and nonterminals. (Note that we avoid
the use of −0 for a nonterminal since 0 is used for tokens and +0 = −0.)
Next, you see the productions themselves, as strings. This array is included so
that we can print informative messages about which production is being matched.
Then there is a two-dimensional array containing the right-hand side of each
production, as a list of grammar symbols. It is here that we must be able to distinguish
between terminals and nonterminals, since, in general, they are mixed together in
productions.
50 LABORATORY 5. USING JLEX WITH A PREDICTIVE PARSER
+ * ( ) ID NUM $ ERROR
E −1 −1 E → T E0 −1 E → T E0 E → T E0 −1 −1
E0 E → +T E 0
0
−1 −1 E0 → ε −1 −1 0
E →ε −1
0
T −1 −1 T → FT −1 T → FT0 T → FT0 −1 −1
T0 T0 → ε T 0 → *F T 0 −1 T0 → ε −1 −1 T0 → ε −1
F −1 −1 F → (E) −1 F → id F → num −1 −1
Note that the tokens in each production are listed in reverse order. That is because
the predictive parsing algorithm requires that they be pushed onto the stack in reverse
order. If we store them in reverse order, then it will be simpler to push them later.
Next is an array of sizes. This array stores the number of grammar symbols on the
right-hand side of the productions. That is to facilitate the pushing of the symbols
onto the stack.
Notice how well organized all of this is. It should be fairly simple to add some
productions to the grammar and update the objects in the program.
http://java.sun.com/j2se/1.4.2/docs/api/
and look up the Stack class in the package java.util. Read the Method Summary.
Now we will create and initialize a Stack. First, declare stack to be a Stack
object:
In Java, Stacks can hold any kind of object whatsoever. That is why the return type
of peek() and pop() is Object. Every non-primitive class is a subclass of the Object
class. However, the primitive classes, such as int, are not subclasses of the Object
class. Therefore, we cannot push ints onto our stack. That’s too bad, because the
grammar symbols in this program are all ints. We will have to convert our int
primitives to the wrapper-class Integer objects. In general, to create an Integer
from an int, say n, we write the expression
new Integer(n)
Use this together with the push() function to initialize the stack with the grammar
symbols $ and E. Recall that $ is represented by Token.EOF.
Note that in the while statement we are using the Stack function empty() to
see if the stack is empty. Once we enter the while loop, the first thing we need to
do is to see what symbol is on top of the stack. We don’t want to pop it, but just
look at it. So we should use the peek() function. The peek() function will return a
52 LABORATORY 5. USING JLEX WITH A PREDICTIVE PARSER
reference to the Object on top of the stack. To use the object, we must cast it as an
Integer. Furthermore, to use it as an int, we must convert it to int. That is done
with the Integer function intValue(). Put this all together to write a statement,
or statements, that
• Converts it to Integer.
a + 3
a*b + 3*c
123*(dog + 456*(cat + bear)) + 789
53
a + 2b
b + + c
a & b
E 0 → -T E 0
T 0 → /F T 0
You will have to recompute FIRST and FOLLOW for the nonterminals and update
the parse table. There are no new nonterminals, but there are two new terminals and
two new productions, so you will have to
Everything else should work as before. I suggest that you add these things to the
ends of the arrays so that you do not disturb what is already there. The one exception
might be the list of tokens and the parse table. It is customary, but certainly not
necessary, that $ be the last token. That will be your call. However, it is necessary
that the ERROR token be at the end of the list of tokens since it is not a terminal.
Test your program with input such as
5.8 Assignment
Zip the files tokens.lex, Token.java, PredParser.java, Err.java, Warning.java,
and Makefile in a folder named Lab 05 and drop it in the dropbox.
54 LABORATORY 5. USING JLEX WITH A PREDICTIVE PARSER
Laboratory 6
Key Concepts
• Context-free grammars
• Semantic actions
• Precedence
• Associativity
• Shift/reduce conflicts
Read Sections 4.5 and 4.7 of Compilers: Principles, Techniques, and Tools. Also read
the CUP User’s Manual.
Preliminary
Copy the folder Lab 06 from the Compiler Design CD to your folder. Also copy the
file tokens.lex from your Lab 03 folder and the files Err.java and Warning.java
from your Lab 05 folder.
55
56 LABORATORY 6. USING JLEX AND CUP
6.1 Introduction
The program CUP is a parser generator. The programmer specifies a context-free
grammar in a .cup file. For each production in the grammar, he specifies an action
to be taken. Whenever a rule is matched, the specified action is taken.
CUP creates an LALR parser, which is a variation of an LR(1) parser. It creates
the action and goto tables and uses the LALR algorithm to parse the input. In this
lab we will learn how to write a CUP file and how to make CUP and JLex work
together to create a complete program.
import java_cup.runtime.Symbol;
This import statement will be copied into the file Yylex.java where it will make the
lexer aware of the Symbol class. The lexer will use the Symbol class, which is found
in the directory CUP/java cup/runtime, to return tokens to the parser. Later in this
lab, we will take a closer look at the Symbol class.
To tell JLex that the tokens will be sent to a program generated by CUP, we must
add the %cup directive
%cup
%integer
The return value will be of type Symbol, not int. We must use the %eofval to specify
the return value. In the directive section, add the lines
Now we must instruct the lexer to return Symbol objects consisting of a token and
possibly a value.
57
These are the two constructors that we will use. If the token does not have an
associated value, then we use the constructor that takes only the token number. For
example, the PLUS token has no associated value, so we would return
new Symbol(sym.PLUS)
If the token has an associated value, then we will use the second constructor, which
takes the token number and its value. For example, the ID token has an associated
value, the name of the identifier. In this case, we would return
In the JLex file, replace each Token return object with a Symbol object. For the ID
and NUM tokens, we must also return the associated value. This will come from the
yytext() function. Create a new String from this and use it as the second parameter
in the Symbol constructor for these two token types, as shown in the example above.
Also, be sure that the JLex file returns the token sym.error when an invalid token
is found. Make these changes throughout the JLex file. Now the JLex file is ready to
be used by CUP.
58 LABORATORY 6. USING JLEX AND CUP
import java_cup.runtime.*;
Next, we have some “parser code.” We could write a separate program to contain
a main function, as we did in Lab 3. Instead, for now we will use the parser code
directive to create a main function in the parser class. Read the parser code in the
CUP file. Later we will look at the parser class to see what this code does.
Next we list the terminals. Generally, these are the same as the tokens returned
by the lexer, except for EOF, which we do not list. CUP will copy these names to
the sym class and assign them numerical values. This file currently defines all of the
tokens that were used in Lab 3, even though a number of them are not used in the
grammar, yet. If any are missing, you will get an error message. In that case, just
go back and add them in. In any case, you will get some warnings about the unused
terminals. You can ignore those warnings for now.
Next we list the nonterminals. Each of these should appear on the left side of a
production later in the file. If any nonterminals are listed, but not used, then CUP
will give us a warning. Indeed, if one is listed, but is not accessible from the start
symbol, CUP will give us a warning.
To resolve shift/reduce conflicts arising from the operators, we may specify prece-
dence levels. The two lines
say that PLUS has a lower precedence than TIMES and that both operators are left
associative. (Terminals are listed in order of increasing precedence.)
where expr is the only nonterminal. This grammar by itself is ambiguous because
it is not clear whether addition or multiplication comes first, but the ambiguity has
been removed by the precedence rules. The expression E+E will not be reduced to E
if the next token is *, but the expression E*E will be reduced to E if the next token
is +. Also, in an expression such as b + c + d, left associativity will cause the rule
E → E+E to be applied first to b + c rather than to c + d.
Build the parser from the file grammar.cup, by typing the command
This creates the files parser.java and sym.java. Next, build the lexer by typing
the command
Now compile the files tokens.lex.java, parser.java, and sym.java. To run the
parser, type
$ java parser
$ 123 + 456
Use CTRL-Z (4 times!) to indicate EOF. Now run the program on the input
and note that the multiplication is done first. Run it again on the input
we may add a command-line argument that gives the expected number of conflicts.
In the output from the build attempt, count the number of conflicts (or read the
message that tells you the number). Then write
-expect n
in the above command before the <, where n is the number of conflicts that you
counted (the expected number). (Make this change in the makefile, if you are using
the makefile.) Build the program again and test it. It should work, although you
should test to see exactly what the precedence rules are. Use input such as
and
Next, note that after the parser object is created, we call on the function parse().
You can search throughout the file parser.java and you will not find the member
function parse(). But look at the first line of the parser class definition:
public class parser extends java_cup.runtime.lr_parser
This tell us that the parser class is a subclass of the lr parser class. Open the
file lr parser.java, which is found in the directory CUP/java cup/runtime. This
is an abstract class. Down around line 500 you will find the parse() function. Take
a few minutes to look it over. It is not hard to see that it is implementing an LR
algorithm. First, it initializes the stack to the start state (0). Then it goes into a
while loop where it gets the value of act from the action table. If act is greater
than 0, it represents a shift operation, so it pushes a symbol onto the stack. If act
is less than 0, it represents a reduce operation, so it pops several symbols from the
stack and pushes a new symbol. If act equals 0, it indicates an error. We used ideas
very similar to that in the predictive parser in Lab 5.
Go back to the main function in the parser class. Notice that we use a try/catch
construct. If an error occurs in the try clause, execution jumps to the catch clause.
Otherwise, when the try clause is finished, execution skips over the catch clause.
Notice how we have used this to report whether the input was accepted or rejected.
6.11 Assignment
Expand the grammar to include all of the following. This will be the full grammar
for our C compiler. In subsequent labs, we will implement more and more parts of
this grammar.
In the JLex file, add the keywords int, double, if, else, return, read, and
print as separate tokens. In Lab 7 we will learn a better way to handle keywords,
but this will suffice for now.
The terminals are given in Table 6.1. The nonterminals are listed in Table 6.2.
The precedence and associativity of the operators is shown in Table 6.3. The table is
arranged from highest to lowest precedence. Finally, the productions of the grammar
64 LABORATORY 6. USING JLEX AND CUP
Operator Associativity
UNARY, NOT right
TIMES, DIVIDE, MOD left
PLUS, MINUS left
LT, LE, GT, GE left
EQ, NE left
AND left
OR left
ASSIGN right
The Program
prog → glbls
Globals
glbls → glbls glbl
| ε
glbl → dcl
| f unc
Markers
m → ε
n → ε
Declarations
dcls → dcls dcl
| ε
dcl → type ID SEMI
type → INT
| DOUBLE
Function Definitions
f unc → f beg stmts m RBRACE
f beg → f name f args LBRACE dcls
f name → type ID
| ID
Function Arguments
f args → LPAREN args RPAREN
| LPAREN RPAREN
args → args COMMA arg
| arg
arg → type ID
66 LABORATORY 6. USING JLEX AND CUP
Statements
stmts → stmts m stmt
| ε
stmt → expr SEMI
| RETURN SEMI
| RETURN expr SEMI
| LBRACE stmts RBRACE
| IF LPAREN cexpr RPAREN m stmt
| IF LPAREN cexpr RPAREN m stmt n ELSE m stmt
| SEMI
| READ lval SEMI
| PRINT lval SEMI
Conditional Expr.
cexpr → expr EQ expr
| expr NE expr
| expr LT expr
| expr LE expr
| expr GT expr
| expr GE expr
| NOT cexpr
| cexpr AND m cexpr
| cexpr OR m cexpr
| LPAREN cexpr RPAREN
| expr
67
Numerical Expr.
exprs → exprs COMMA expr
| expr
expr → lval ASSIGN expr
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| expr MOD expr
| MINUS expr
| LPAREN expr RPAREN
| lval
| ID LPAREN RPAREN
| ID LPAREN exprs RPAREN
| NUM
| STR
lval → ID
For each production, provide an action that prints the production. For example,
the action for the production
expr → ID LPAREN exprs RPAREN
should be
add the phrase %prec UNARY after the action part. This will give the negation oper-
ator the same precedence as UNARY.
This lab will serve as Project 2. Zip the files tokens.lex, grammar.cup, and
Makefile in a folder named Project 2 and drop it in the dropbox.
68 LABORATORY 6. USING JLEX AND CUP
Part IV
A Simple Compiler
69
Laboratory 7
Key Concepts
• Symbol tables
• Reserved words
• Hash tables
• Linked lists
• Block level
• Semantic actions
• try/catch blocks
• Syntax errors
Read Sections 2.7, 5.1, 7.4, and 7.6 of Compilers: Principles, Techniques, and Tools.
Section 2.7 gives an easy introduction to the idea of a symbol table. Section 5.1
discusses inherited and synthesized attributes of grammar symbols. Some of Section
7.4 discusses function calls, which we are not concerned with yet. Try to understand
it now, and then we will re-read it later when we introduce function calls into our
compiler.
For information on hash tables, read the chapter or section on hash tables in the
textbook you used in Coms 262.
71
72 LABORATORY 7. THE SYMBOL TABLE
Preliminaries
Copy the folder Lab 07 from the Compiler Design CD to your folder. Also, copy
the files tokens.lex, grammar.cup, Err.java, and Warning.java from your Lab 06
folder to your Lab 07 folder.
7.1 Introduction
The main purpose of this lab is to introduce the symbol table, the table in which all
the identifiers are stored along with information about them. Since the symbol table
is rather complicated, I have written nearly all the code. You main job in this lab
will be to understand how the code works.
The compiler will interact with the symbol table in two fundamental ways. When
a variable is declared, the compiler will enter it as a new entry in the symbol table.
When a variable is referenced in an expression, the compiler will look it up in the
symbol table to retrieve necessary information about it, such as its data type.
This lab will also introduce (non-trivial) semantic actions. Up to this point, the
only actions taken by the parser in response to matching grammatical patterns has
been to announce that the pattern was matched. Now it will have to respond in a
more substantial way to declarations of variables and uses of variables in expressions.
within blocks within functions. At the present time, all of our identifiers are global.
We will not have local variables until we introduce functions in Lab 11.
When the program enters a new block (by encountering a left brace), a new level
is created in the symbol table. Any variables declared at that level are placed in this
level of the symbol table. When the program exits a block (by encountering a right
brace), the corresponding level of the symbol table is disposed of; those variables no
longer exist.
Since levels 3 and above of the symbol will be created and destroyed as we enter
and exit functions, we need to be able to create and destroy the various hash tables.
To do this, we will implement the symbol table as a linked list of hash tables. The
highest level, which will always be the current level, will be at the head of the list.
When we move back down to the next lower level, we will destroy this table. The
previous table will then move to the head of the list and become the current table.
This is as it should be. For suppose a variable count is declared as a global
variable and then later declared as a local variable to a function. If we encounter a
use of count within the function, it should refer to the local variable, not the global
variable. Our compiler will work this way if it searches the hash table that is at the
head of the list first. If it fails to find the identifier there, it should search the next
hash table, and so on.
To implement this linked list of hash tables, we use Java’s LinkedList class. Go
back to the Java web page and read about the LinkedList class. Note the functions
addLast(), removeLast(), and get(). These functions will allow to add and remove
hash tables from the list and to get a particular hash table from the list.
• init()
• initResWords()
• installResWord()
• enterBlock()
• leaveBlock()
75
• install()
• idLookup()
• idDump()
The function init() initializes the symbol table by creating the linked list and putting
a null entry at level 0. The function initResWords() installs the reserved words by
calling on installResWord() once for each reserved word. Right now we have only
one keyword: int. Later we will install other reserved words as they are needed.
Take a minute to look at the functions initResWords() and installResWord().
They are very simple.
Why install reserved words in the symbol table, as though they were identifiers?
The sole purpose for doing this is to simplify the lexer, which speeds up the process
of lexical analysis. If we had the lexer detect each reserved word on its own, the
DFA transition table would become quite large. Instead, we let the lexer first find
the reserved word as an identifier. Then it is looked up in the symbol table. If it is
found, then the block level is checked. If the block level is 1 (reserved word), then the
identifier must be a reserved word. Clever! This also prevents the use of a keyword
as an identifier.
When the compiler enters a new block, the block level should be increased by 1. When
it leaves a block, the block level should be decreased by 1. In addition to adjusting
the block level, the function enterBlock() must create a new hash table in the linked
76 LABORATORY 7. THE SYMBOL TABLE
list of hash tables. The function leaveBlock() must destroy the most recent hash
table in the list.
Read the code for enterBlock(). See that it increments level, creates a new
hash table, and adds it to the end of the linked list. Now look at the function
leaveBlock(). This is just as simple as enterBlock() since all we have to do is
to destroy the top-level hash table and decrement level. Write the body of this
function now.
The functions install() and idLookup() are a little more complicated.
This function has two parameters, which are the identifier as a string and the block
level at which it should be installed. If the block level parameter is 0, then the
identifier is installed at the current level level, whatever that may be. Otherwise, it
is installed at the specified level.
The second part of the function gets the hash table at the specified level from the
linked list.
The third part creates a new IdEntry object, assigns to it the currently known
information, and puts it in the hash table. As more information about the object
becomes available, we will look it up in the table and add the information. This
function uses the Hashtable functions get and put(). Be sure to read about these
functions on the Java website. What does get() return if the object is not found?
Obviously, s is the name of the identifier. The parameter blkLev is the block level at
which the identifier is defined. If a positive value is passed, then idLookup() will look
for the identifier at that block level only. If the value 0 is passed, then idLookup()
will search through all block levels, beginning with the current (highest) level. In
either case, if idLookup() cannot find the identifier, then it returns null.
77
lineNum++;
as the action to take when the newline character is matched. Now read the code for
the function lookup() and see that it looks up an identifier and takes the appropriate
action if it is found in the reserved-word part of the symbol table. This function is
called in the lexer action associated with identifiers later in the file. Change the action
for the identifier pattern so that it simply calls lookup() and returns to the parser
the value returned by lookup().
At this time, remove the patterns int, double, if, else, return, read, and print
from the lexer and have them installed in the symbol table by the initResWords()
function of the SymbolTable class.
In what follows, modify the actions for only these productions. The nonterminal dcl
represents a declaration. The nonterminal type is a data type (so far, only int). And
the nonterminal lval represents an l -value, i.e., an expression that is permitted to
appear on the left side of an assignment. In our compiler, that will probably be only
an identifier, but in C l -values include array members (e.g., a[i]), pre-incremented
objects (e.g., ++n), dereferenced pointers (e.g., *p), as well as a number of other
expressions. In C++ it also includes function calls if the function returns a value by
reference.
As we add to our compiler later, we will expand these productions. For example,
we will add
In that case, the identifier is installed in the symbol table at the current level and
some of its attributes are recorded.
Now look at the id() function. This function will be called when an identifier is
encountered outside of a declaration. In this case, the variable should already have
been declared and entered in the symbol table. If it hasn’t, then an error message
is printed and the identifier is installed at the current level, but with no attributes
other than its name and current block level. (Nothing else is known about it.)
$ java TableBuilder
The first thing that appears is some messages informing us what the block level is
and that keywords have been added to the symbol table.
Now, one by one, type the following lines and note the output.
int a;
int b;
int main()
{
a = b + 2;
b = a*(a - 1);
}
These lines are all correct, so there should be no error messages. Now enter some
lines containing undeclared variables and redeclared variables.
int a;
int b;
int a;
int main()
{
c = a + b;
}
While you are at it, in the same manner declare the nonterminal type to be of
Integer type. We will need that soon.
In the productions listed above in which ID or type appear on the right side,
introduce a variable associated with them. For IDs, this variable will automatically
take on the value that was placed in the Symbol object by the lexer. For example,
for the ID terminal, you should replace ID with ID:i. Now i is a String variable
whose value is the value of the ID token. Do a similar thing with type. We will
see shortly that nonterminals, such as type, get their values from their productions.
These values may now be used in the semantic actions.
With the production
SemanticAction.dcl(i, t);
82 LABORATORY 7. THE SYMBOL TABLE
Do not remove the output statement that is already there. Note that i and t, the
values of ID and type, are passed as parameters.
The actions of these two productions differ in two important ways. First, one
calls a semantic action function and the other does not. Second, one returns a value
through RESULT and the other does not. We will see numerous examples of each
type. Generally, we will write a special semantic action function if the action is at all
complicated. Otherwise, our CUP file would be too hard to read.
We see that the RESULT mechanism allows the attribute Ops.INT to be inherited
by the terminal ID from the nonterminal type. (The actual transfer takes place in
the dcl() function.) This is how CUP is able to handle inherited and synthesized
attributes.
Finally, with the production
SemanticAction.id(i);
Also, in the output statement in that action part, have it print the name of the
identifier.
Rebuild the program and run TableBuilder again, using the same input as above,
including the errors:
int a;
int b;
int a;
int main()
{
a = b + 2;
b = a*(a - 1);
c = a + b;
}
int a;
int b;
int main()
{
a = b # 2;
}
Second, let’s enter a line with correct tokens, but with invalid syntax. Run the
program and enter
int a;
int b;
int main()
{
(a + b) = 2;
}
Syntax error
Couldn’t repair and continue parse
appears. This appears nowhere in our code, so it must have been generated by CUP.
(It was.) The second error message
which we should recognize as the one we wrote in the catch block in the main function
of the TableBuilder class. Finally, note that the function idDump() was called, even
though there was a syntax error. This indicates that the try/catch blocks work as
expected, preventing the program from crashing.
84 LABORATORY 7. THE SYMBOL TABLE
7.12 Assignment
The SymbolTable class function idDump() currently does nothing except print a line.
Its purpose is to display all the contents of the symbol table. There is a Hashtable
member function elements() that can be used to display the elements of the hash
table. It returns an Enumeration object. Look up the Hashtable class and the
Enumeration interface on the Java website, read about them, and figure out how to
use them to display the contents of all levels of the symbol table. In the section on
the Enumeration interface, there is an excellent example that shows how to print the
values contained in a vector.
Zip the files tokens.lex, grammar.cup, IdEntry.java, SymbolTable.java,
Ops.java, SemanticAction.java, TableBuilder.java, and Makefile in a folder
named Lab 07 and drop it in the dropbox.
Laboratory 8
Key Concepts
• Dereferencing l -values
Read Sections 5.2 and 5.6 of Compilers: Principles, Techniques, and Tools. Section
5.2 introduces the basics of abstract syntax trees. Section 5.6 may be tough going,
but that material will be explained further in later labs. Try to get what you can
from it now.
Preliminaries
Copy the folder Lab 08 from the Compiler Design CD to your folder. Copy the files
IdEntry.java, Ops.java, SymbolTable.java, tokens.lex, grammar.cup, Err.java,
and Warning.java from your Lab 07 folder to this folder.
85
86 LABORATORY 8. THE ABSTRACT SYNTAX TREE
ASSIGN
ID
DIVIDE
tri_area
NUM
TIMES 2
DEREF DEREF
ID ID
base height
8.1 Introduction
The purpose of this lab is to create and print an abstract syntax tree for a C program.
The C program will use only a small subset of the grammar we introduced in Project
2.
As an example of a syntax tree, consider the statement
The root node is an assignment operation. Its left subtree is a pointer to tri area.
Its right subtree represents the expression (base * height)/2. The tree looks like
the tree in Figure 8.1.
The program TreeBuilder.java in this lab will display it in the form
ASSIGN INT
ID PTR|INT value = "tri_area"
87
DIVIDE INT
TIMES INT
DEREF INT
ID PTR|INT value = "base"
DEREF INT
ID PTR|INT value = "height"
NUM INT value = 2
In this display, each node is followed by its left subtree and then its right subtree,
indented one tab stop. Notice that base and height are dereferenced, but tri area
isn’t. That will be explained next.
object, which will be an identifier or a number (and later a string). The mode of
an exterior node will be the kind of object stored in that node. For example, if the
object is an integer variable (l -value), then the mode will be a pointer to an INT. If
the object is an integer constant, then the mode will be INT.
Open the file TreeNode.java. This file defines the TreeNode class whose objects
have the following attributes: the operation (oper) represented by the node, the
mode (mode) of the operation, a reference to the left subtree (left), a reference to
the right subtree (right), the identifier (id) represented by the node, the number
(num) represented by the node, and the string (str) represented by the node.
If the node is a binary interior node, then left and right will be non-null, and
id, num, and str will be undefined. On the other hand, if the node is an exterior
node, then left and right will be null, while exactly one of id, num, and str will
be defined, depending on the kind of exterior node. From time to time, we will have
unary interior nodes. They will always use the left subtree rather than the right
subtree.
Note the types of the data members oper, mode, left, right, id, num, and str.
Also, one constructor
public TreeNode(IdEntry i)
and the toString() function have been defined. You will define three additional
constructors. First, define the default constructor:
public TreeNode()
It should set oper, mode, and num to 0 and left, right, id, and str to null.
Next, define the following constructor.
public TreeNode(int op, int m, TreeNode l, TreeNode r)
The purpose of this constructor is to join together two existing trees, with root nodes
l and r, as the left and right subtrees of a new tree with this node as its root node.
In the root node, the value of oper should be op and the value of mode should be m.
Finally, define the constructor
public TreeNode(int n)
It will create a node that represents a number. The member oper should be Ops.NUM,
mode should be Ops.INT, and num should be the value of n.
Write these constructors. We will use these constructors later in this lab.
89
int opcode;
String opstring;
int kind;
The integer opcode is a value from the list defined in Ops.java. The string opstring
is a readable string version of the opcode. It will be used when the opcode needs to be
printed in a readable form. The integer kind will store the kind of node: leaf, unary,
or binary. This class creates the data type, but does not instantiate any objects.
That will be done next in the Utility class in the array opInfo[].
The TreeOps class has a constructor and two functions that return information
about an identifier. The function baseType() returns the fundamental data type of
an identifier, e.g., INT. Later it will also return the type DOUBLE.
The function baseSize() returns the size, in bytes, of the fundamental data type.
The size of an INT is 4. Later, it will also return 8 for the type DOUBLE.
Take a minute to look at these two functions to see how they work. Note especially
how the binary “and” operator & is used to mask out the pointer bit, leaving only
the bit for the base type.
The constant ERROR is a special case, since it does not represent an actual tree node;
the constant ALLOC is more typical.
Add entries for the remaining node types that were listed in the Ops class.
printTree(t, 0);
ID PTR|INT
There are two special cases concerning leaf nodes. Depending on whether the node is
an identifier node or a number node, we should also print the value of the identifier
92 LABORATORY 8. THE ABSTRACT SYNTAX TREE
or the number. For example, nodes for the integer variable count and the number
123 would be displayed as
respectively.
I have included a function printType() that will print the data mode of a node,
given the type data member of the TreeNode. Notice how it uses bitwise operators
to test the bits of the integer type.
arith()
assign()
deref()
mod()
negate()
print()
read()
The function arith() covers addition, subtraction, multiplication, and division. Since
the tree has exactly the same form in each of these operations, only one function is
needed for them. The function mod() handles the mod operator %. It is different from
the other arithmetic operators since it applies only to ints. The function assign()
handles assignments. The function deref() will dereference a variable. The function
negate() handles the operation of negating an expression.
93
NUM
INT
We will provide two other functions, read() and print(), that will read and
print, respectively, an integer variable. When we generate code, these two functions
will allow us to input and output values in order to test our programs better.
The purpose of each function is to build its part of the tree, printing the tree as
necessary. We will discuss how to write each of the eight functions listed above. We
will begin with a case so simple that it doesn’t require a function.
to be
Now the TreeNode that is returned by RESULT will be assigned to the expr nonter-
minal.
To test this, let’s make a modification. We can restore it to the above form once we
are sure that it is working. Replace the above statement with the following sequence
of statements:
Clearly, we are just inserting a call to printTree() in order to see the result.
Now test this by building TreeBuilder and running it with the test file testfile.c.
This file contains a simple (and pointless) C program that uses each of the syntactic
structures covered in this lab. At this point, TreeBuilder should print a NUM tree for
each integer constant that it encounters in the program.
As you add the other functions to SemanticAction.java, pause after each one,
build TreeBuilder, and test it on testfile.c. When this lab is done, be sure to
remove all the extraneous calls to Utility.printTree().
This will create a TreeNode from the symbol table entry p (an IdEntry object)
that was returned by idLookup(). Look in the file TreeNode.java to see what this
TreeNode constructor does. In particular, note that the mode is set to
i.type | Ops.PTR
For the time being, i.type is Ops.INT. Look in the file Ops.java to see the values
of Ops.INT and Ops.PTR. What value do we get when we “or” them together? This
is our method of recording the fact that the mode of the node is “pointer to int.”
Now continue by doing the following.
ALLOC
type
ID NUM
type|PTR INT
• In the production
RESULT = SemanticAction.id(i);
DEREF
mode
e.oper
e.mode
The first tree t1 represents the identifier. The second tree t2 represents the number
of bytes of memory required by the identifier. Note that its num data member is set
to the size of the base type of the identifier. Then an allocation tree is created with
t1 and t2 as its subtrees. The IdEntry object should be returned by dcl().
The return type of dcl() should now be IdEntry. Make these changes. In the
CUP file, make the nonterminal dcl of the IdEntry type. Also, change the action
associated with the production
Utility.printTree(t);
Add this statement to dcl() and then test it using the testfile.
our grammar.) The mode of the DEREF node is the base type of the identifier. If the
identifier’s mode is PTR|INT, then the mode of the DEREF node is INT.
Write the deref() function. The deref() function should return a TreeNode.
Make the appropriate changes in the grammar file. This is accomplished by intro-
ducing the variable v to represent lval in the production
RESULT = deref(v);
to the production. Add a printTree() statement and test the function with the
testfile.
is trivial. The tree representing the expression tree on the right is simply passed on
to the expression on the left. Write the action for this production that will do that.
No special semantic action function is necessary.
ASSIGN
v.mode
ID e.oper
v.mode e.mode
op
mode
e1.oper e2.oper
e1.mode e2.mode
Right now, all objects are ints, so e is an int and v is a pointer to an int. Later,
when we introduce doubles, it will be possible that e will be a double and v will be
a pointer to an int, or e will be an int and v will be a pointer to a double.
Make the corresponding changes in the CUP file to the production
MOD
INT
e1.oper e2.oper
e1.mode e2.mode
8.16 Assignment
Complete the set of functions by writing mod(), negate(), print(), and read().
Remove the calls to printTree() from all functions except dcl(), print(), and
read(). In the action associated with the production
Utility.printTree(e);
because this marks the final stage in the development of that tree.
The functions negate() and mod() are similar to arith() and deref(). They
are arithmetic, but negate() has only a left subtree. You should be able to write
them without much trouble. The function mod() should create the tree shown in
Figure 8.7 and the function negate() should create the tree in Figure 8.8.
The functions print() and read() each have an argument that represents an
identifier. They should build the trees appearing in Figure 8.9.
In both cases, the mode of the root node is the base type of the identifier node.
Each of these two functions should print the tree.
Throughout all of these functions, use the base type wherever appropriate. Also,
be sure to dereference l -values whenever necessary.
This lab will serve as Project 3. Zip the files tokens.lex, grammar.cup, Err.java,
Warning.java, Id.java, Ops.java, SemanticAction.java, SymbolTable.java, TreeBuilder.java,
100 LABORATORY 8. THE ABSTRACT SYNTAX TREE
NEGATE
e.mode
e.oper
e.mode
PRINT READ
v.mode v.mode
DEREF ID
v.mode v.mode
ID
v.mode
Code Generation
Key Concepts
• Command-line arguments
• Post-order traversals
• Stack operations
• Addressing modes
Read Sections 9.1 and 9.2 in Compilers: Principles, Techniques, and Tools. Read
Chapter 3 in the Intel Developer’s Manual, Vol. 1.
Preliminaries
Copy all of the files in your Lab 08 folder into a folder named Lab 09. Copy the files
in the folder Lab 09 from the Compiler Design CD to your Lab 09 folder. This will
replace the makefile and will add the files CodeGenerator.java, compiler v1.java,
and the NewPrintTreeFunc.java.
103
104 LABORATORY 9. CODE GENERATION
9.1 Introduction
In this lab we will finally create a working compiler. Our program, compiler v1.java,
will compile simple C programs into assembly code. If we invoke the gnu assembler,
we will obtain an executable program.
The subset of C that is handled by this compiler is very limited. All variables
must be global and they should be declared before main(). The program has exactly
one function, main(). We will implement our special read and print statements so
that our program can read and print integers. Therefore, a program that will print
the average of two integers should be written like this:
int a;
int b;
int average;
int main()
{
read a;
read b;
average = (a + b)/2;
print average;
}
Enter a: 8
Enter b: 16
average = 12
or if we wanted to compile the program and print the tree, we would type
The order of -t and -c does not matter, provided they appear after the name of the
compiler and before the name of the source file.
How does compiler v1 know if there are command-line arguments? Open the
file compiler v1.java. At the beginning of the main() function there is a for loop
that uses args.length. The object arg is an array of Strings and it appears as the
parameter of the main() function. When a command is typed, the operating system
passes all command-line arguments as members of the String array arg. This idea
was taken from C, where the prototype of main() is
In Java, arrays automatically come with a public member length. Thus, args.length
is the number of Strings in the array.
The for loop compares each argument to "-t" and to "-c". If there is a match,
then the corresponding boolean variable tree or code is set to true, for future
reference. Later, when printTree() is called, the Utility class checks the val-
ues of tree and code to decide what to output. Open Utility.java, look at the
printTree() function, and note that it uses compiler v1.code to decide whether
to call CodeGenerator.generateTreeCode() and that it uses compiler v1.tree to
decide whether to print the abstract syntax tree.
right subtree, then it generates code for the node. In other words, it performs a post-
order traversal of the tree, writing the code for each node. That is very important to
keep in mind as we consider how to write the assembly code.
Look at the function generateCode() and see that it distinguishes one special
case: ALLOC. This special case is not recursive, so we do not call traverseTree(). In
the future there will be a few more special cases. When a case is handled recursively,
at each level of recursion, the result of that operation is left on the stack to be used
at the next level up. That means that at the root node, the value will also be left on
the stack. This could be a problem in a for loop such as
since the 1000000 values assigned to a will be left on the stack, causing stack overflow.
Therefore, the last action in traverseTree() is to pop the final value of the tree to
avoid this problem.
Now look at the function generateNodeCode(). This is the function that writes
the assembly code for a single node. At this point, we have thirteen different types
of node (plus the error node). Let us look at each of these thirteen. I have arranged
them in alphabetical order in the program order to facilitate locating them. I suggest
that you maintain them in alphabetical order as we add more cases later.
9.4 Allocation
To allocate memory for a global, we write the assembly directive .comm, for common,
followed by the name of the global, followed by the number of bytes to be allocated.
The necessary information has all been stored in the tree, which makes it very simple
to write the assembly code. Pay careful attention to the way in which the name was
retrieved from the ID node in the syntax tree and the way in which the number was
retrieved from the NUM node. We will often have occasion to do that sort of thing in
the future.
For example, declaring the global a to be an int would be written in assembly
code as
.comm a,4
107
This assembler directive will allocate 4 bytes of memory and associate the name a
with the address of this memory.
Notice that we print the line
where name is the name of the global identifier. Notice that we use the mnemonic
lea (load effective address) instead of mov (move). Had we written
the effect would have been to move the value stored at name into register eax. That
is not what we want. However, if the value of name is needed, then the lea operation
will be followed by a dereference operation.
This example and the previous one make it clear that we must always be aware
of the exact effect of the assembly instruction. Are we operating on the value in the
register or on the value pointed to by the address in the register? Are we operating
on the address itself or on the value stored at the address? These distinctions are
very important.
Notice that the parentheses are used to indicate indirect addressing. That is, (%eax)
is interpreted as “the value stored at the address in eax,” whereas %eax means “the
value stored in eax.”
Write the code for the DEREF case.
int a;
int main()
{
read a;
print a;
}
Now build the compiler using the makefile. Then type the command
This will read the C program from test1.c and write the assembly program to
test1.s. Open the file test1.s to see what is there. (The extension .s is used for
assembly-language files.) If a lot of trace information was output to the file, then go
back into those files and comment those statements out. Do not remove them! They
will be very useful later on when we have to debug.
Now we will assemble the file test1.s. Type
The command gcc invokes the gnu C compiler. (The UNIX command is cc.) The
-o option means “output file,” and it designates test1.exe as the name of the out-
put file. If there were no error messages, then we now have an executable program
test1.exe. Let’s test it. The command to run test1.exe is
$ test1
However, the system does not know to look in this folder for executables. Therefore,
we must modify the PATH environment variable. Open the System control panel and
add the path “.” to the PATH variable. Recall that the dot means “the current
directory.” Also, it might be good to add the “.” at the beginning of the list of paths
rather than at the end so that if there is another executable out there with the same
name, it will find the one in this folder first. Close the Cygwin window and open a
new one so that this change will take effect.
Now run the program. The program should prompt you to enter a value for a.
Then it prints the value that you entered. We will follow this procedure at various
points in the lab, whenever we wish to test our work so far. But we must be sure
that our test program contains only operators that we have implemented.
Let us pause for a moment to fully take in what just happened. You just executed
a program whose code was produced by your compiler. That calls for a moment of
silent reflection. Reflect on the major milestones in your life: birth, marriage, death,
and the day you wrote a compiler that produced executable code. When you feel
ready, continue with the next section.
9.8 Numbers
We are now empowered. Let us push on. A number is an immediate value that should
be pushed onto the stack. The code to do this is
where number is a specific value. Note the use of $ to indicate an immediate value.
The value of number is stored in the node and must be written into this statement.
You will have to use the num field of the appropriate TreeNode object to get the value
of number. Write the code in the NUM case.
Before we can test this, we will need to encode assignment statements.
110 LABORATORY 9. CODE GENERATION
• Pop the address of the variable from the stack into a register.
Enter this code into the ASSIGN case of the generateNodeCode() function. This
example should set the pattern for many of the following operations.
Notice that I have included in-line comments for each assembly instruction. It is
worth your time to type these comments since they will be an enormous help when
you are trying to understand or debug the compiler-generated assembly code.
Now we can test a program with simple assignment statements that assign num-
bers and variables to variables. Be sure to rebuild your compiler before trying a test
program.
Create a file test2.c that contains the following C program.
111
int a;
int b;
int main()
{
a = 123;
b = a;
print a;
print b;
}
Compile this program (using compiler v1, not gcc!) and then assemble test2.s
(using gcc) and run test2.exe.
and write appropriate in-line comments for each line. For example, you might com-
ment “Pop right operand,” “Pop left operand,” “Add values,” and “Push result.” Be
brief, but informative.
Be sure to rebuild your compiler.
Create a file test3.c containing the C program:
int a;
int b;
int sum;
112 LABORATORY 9. CODE GENERATION
int main()
{
read a;
read b;
sum = a + b;
print sum;
}
Test this program. If it works, then test other addition statements, such as
sum = a + b + 2;
and
sum = a + (b + 2);
diff = a - b - 2;
and
diff = a - (b - 2);
operator. Write a test program test5.c to test this operator. Test the precedence
of negation by combining it with other operators. For example, you might try
a = -b + 2;
imul register
is designed to multiply the accumulator eax by the value in the specified register.
The product is stored in the 64-bit register pair edx:eax, with the high-order 4 bytes
of the product stored in edx and the low-order 4 bytes stored in eax. In our compiler,
we will assume that the product of any two 4-byte integers is another 4-byte integer,
or else the result will not be mathematically correct. (The value will “wrap around”
from 232 − 1 to −232 .) Therefore, the assembly code is
Add this code to the TIMES case. Then create, compile, and run a program test6.c
that will test multiplication. Be sure to test the associativity and precedence of
multiplication over addition and subtraction.
idiv register
then the divisor is assumed to be in the specified register and the dividend is in the
64-bit register pair edx:eax. The division produces both a quotient and a remainder.
Read the manual to see exactly where they are stored.
You must be a little careful with the register pair edx:eax. Our compiler assumes
that the dividend is only 32 bits, even though edx:eax is 64 bits. Therefore, you
should load the dividend into register eax. However, it is not enough simply to clear
register edx. If the value in eax is negative, then the sign must be extended through
edx. This will require an instruction that converts a doubleword to a quadword.
Find the convert-doubleword-to-quadword instruction in the Intel Developer’s Man-
ual. When you write the assembly code, after loading the divisor and the dividend,
convert the doubleword eax to the quadword edx:eax, then divide and push the
quotient.
Write a test program test7.c. Be sure to check that the division was done in the
right order. That is, the expression a/b should divide a by b, not b by a. Also, test
the associativity and precedence of division.
read a;
and
115
print a;
will generate the assembly code necessary to make function calls to scanf() and
printf().
The form of the scanf() function call is
where var1, ..., varn is a list of variable names. The string format contains %d to read
an int, %c to read a char, %s to read a string, and %f to read a float. For example,
if you wanted to read two ints and a float (in that order), then format would be
"%d%d%f". We will use only %d. Furthermore, since our read statements read only one
variable, the format string parameter will be followed by just one variable parameter.
A read tree has READ at the root and an ID node as the left subtree. When the
left subtree is evaluated, the address of the variable is pushed onto the stack.
The format string and the address of the variable must be passed to the function
as parameters. Parameter passing on the x86 is done by pushing the parameters onto
the stack, in order from right to left, so that they will be popped by the function in
the correct order, from left to right. Therefore, the address of the variable must be
pushed first. But this was already done when the left subtree was evaluated. So we
need only push the format string onto the stack. This is done by creating the string
in memory and pushing its address.
The assembly code for the call
scanf("%d", &a);
Again, the variable has already been dereferenced and its value pushed onto the stack
when the left subtree was evaluated, so that does not need to be done here.
Soon we will incorporate function calls into our compiler (version 3), so you should
try to understand what is going on here with scanf() and printf().
9.17 Assignment
Finish implementing all of the operations discussed above. The finished product will
be turned in as Project 4. Put all of your source files in a folder named Project 4,
zip it, and drop it in the dropbox. Congratulations! You have built a compiler!
Part V
117
Laboratory 10
Key Concept
• Casting types
Preliminaries
Copy the all files from your Lab 09 folder into a new folder named Lab 10.
10.1 Introduction
We have been using only one data type in our compiler so far in order to keep it simple.
Now we would like to introduce a second data type, double. We could introduce other
types, such as char, float, and short, but that would only take more time, with little
additional benefit. We will learn how different types are handled by working with just
the two types int and double. This will create the potential for mixed expressions,
including mixed assignment statements. In any situation where a particular type is
expected, we will have to check the actual type. If it is not the expected type, then a
type conversion will have to be performed or an error message will have to be printed.
119
120 LABORATORY 10. FLOATING-POINT NUMBERS AND THE AST
In this lab, we will build only the syntax tree. Before we can write the assembly
code, we must learn how the floating-point unit (FPU) works. All of that will be
done in the next lab.
Use compiler v1 for pattern and the wildcard * for files. Then make the change
in the files listed.
type → DOUBLE
is in the file grammar.cup. Include a semantic action similar to the semantic action
for the production
type → INT.
This defines Ops.DOUBLE to be the integer with bit 1 set (binary 00000010). The con-
stant Ops.INT is already defined to be the integer with bit 0 set (binary 00000001) and
Ops.PTR is the integer with bit 5 set (binary 00100000). In SemanticAction.java,
in the dcl() function, you will see that when a node for a variable is created, its type
is set to
121
id.type | Ops.PTR
where id.type is Ops.INT or Ops.DOUBLE. Thus, for int variables this value is binary
00100001 and for double variables it is binary 00100010. Later, in the baseType()
function, we wish to recover the value Ops.INT or Ops.DOUBLE from this value. We
will do this by “and”-ing the type with the value Ops.INT | Ops.DOUBLE (binary
00000011), which produces 00000001 for pointers to ints and 00000010 for pointers
to doubles, which are the values of Ops.INT and Ops.DOUBLE.
Therefore, the baseType() function in TreeOps.java should return the value
e.oper
e.mode
CAST
t
e.oper
e.mode
Next, we need a function that will convert an object to a specified type. For this,
we introduce the cast() function.
oper
mode
CAST CAST
mode mode
e1.oper e2.oper
e1.mode e2.mode
ASSIGN
v.mode
ID CAST
v.mode v.mode
e.oper
e.mode
double x;
x = 1; // Converts 1 to 1.0 and stores it
x = x/2; // Performs floating-point division, creating 0.5
x = 1/2; // Performs integer division, creating 0
10.13 Assignment
Write test programs that thoroughly test your program. Make sure that the syntax
tree contains CAST nodes only when necessary. Then put all of your source files and
the makefile in a folder named Lab 10, zip it, and drop it in the dropbox.
Laboratory 11
Key Concepts
• Floating-point arithmetic
Preliminaries
Copy all the files from your Lab 10 folder to a new folder named Lab 11. Copy the file
PrintRead.java from the Lab 11 folder on the Compiler Design CD to your Lab 11
folder.
11.1 Introduction
We begin with the most basic operations of load and store. Then we will consider
arithmetic instructions. The term load refers to moving data from somewhere else
127
128 LABORATORY 11. FLOATING-POINT NUMBERS AND THE FPU
(usually memory) into a register. The term store refers to moving data from a register
to somewhere else. Therefore, to dereference a floating-point variable is to load its
value into the FPU, and to assign a floating-point value to a variable is to store the
value in memory. We need to consider every case in CodeGenerator.java that is
affected by the existence of floating-point numbers.
which will push the dereferenced integer onto the stack. Notice that the two blocks
of code have the first instruction in common. Therefore, we should modify the DEREF
case so that it looks like
Modify the DEREF case in this way. Other cases will be modified in a similar way.
129
fstpl (%eax)
As with the DEREF case, we must use an if statement that distinguishes whether we
are storing an integer or a floating-point value. Using the DEREF case as a model,
modify the ASSIGN case so that it stores floating-point numbers.
faddp
This instruction will add st(1) and st(0), store the sum in st(1), and then pop
st(0) off the FPU stack, causing st(1) to move up to st(0). The other four oper-
ations are similarly performed by the instructions
fsubrp
fmulp
fdivrp
fchs
Note the “r” in fsubrp and fdivrp. It stands for “reverse.” Those instructions
reverse the order of the operands so that they compute a - b and a / b instead of
b - a and b / a.
The gnu assembler’s notation is different from many other assemblers. It adopted
the AT&T notation, which is different from the Intel notation. See the web page at
130 LABORATORY 11. FLOATING-POINT NUMBERS AND THE FPU
http://www.delorie.com/djgpp/doc/brennan/
brennan_att_inline_djgpp.html
That is, the first operand is normally the destination and the second operand is the
source. The gnu assembler reverses this. So we write
Apparently for the same reason, it reverses the meanings of fsubp and fsubrp. The
manual states that fsubp subtracts st(0) from st(1), storing the result in st(1).
The machine code for fsubp is DEE9. It also state that fsubrp subtracts st(1) from
st(0), storing the result in st(1). The machine code for fsubrp is DEE1. However,
the gnu assembler assembles fsubp as DEE1 and fsubrp as DEE9, just the reverse of
what the Intel manual says.
Modify the cases PLUS, MINUS, TIMES, DIVIDE, and NEGATE to perform floating-
point operations.
Now it remains only to handle mixed-mode expressions.
add $4,%esp
To convert a double to an int, the process is similar, but in reverse. The double is
initially on the FPU stack. The command fistpl converts the contents of st(0) to
an integer and stores it at the specified location, then it pops the value from the FPU
stack. (The letter l on the end means “long.” That is the gnu way of indicating a
doubleword, which in this instance is necessary. In general, the choices are b for byte,
w for word, and l for long. It is probably a good idea to use l on all instructions, but
I dropped it to keep things looking simple.) We would like to store the integer on the
stack, so the location should be (%esp). However, since this is not a push operation,
we must provide space on the stack. That is, we must subtract 4 from esp before
moving the value.
Write the code for the CAST case, using the ideas discussed above.
11.9 Assignment
This lab will also serve as Project 5. Place all of the source files and the makefile in
a folder named Lab 11, zip it, and drop it in the dropbox.
132 LABORATORY 11. FLOATING-POINT NUMBERS AND THE FPU
Part VI
Functions
133
Laboratory 12
Key Concepts
• Function definitions
• Formal arguments
• Local variables
• Return values
• Stack frames
Read Sections 6.1 - 6.3 of the Intel Developer’s Manual, Vol. 1. Also read Sections
7.1 - 7.5 in Compilers: Principles, Techniques, and Tools.
Preliminaries
Copy all of your source files from your Lab 11 folder to a new folder named Lab 12.
Copy the file SemanticActionFunctions.java from the Lab 12 folder on the Com-
piler Design CD. It contains a number of functions and skeletons of functions that
you will need to add to SemanticAction.java.
135
136 LABORATORY 12. FUNCTION DEFINITIONS AND THE AST
12.1 Introduction
Compiling function calls is fairly complicated. Therefore, we will break it up into
three labs. The first lab will build the syntax tree for the function definitions. The
next lab will build the syntax tree for the function calls. The third lab will generate
the code for both the function definitions and the function calls.
The main tasks we face in handling function definitions are
• Allocating memory (the activation record) on the runtime stack for local vari-
ables.
When you put them all together, the form of a function definition is
The syntax tree for the beginning of a function definition (fbeg) has the form shown
in Figure 12.1.
The ID node contains the name of the function and other information stored in
an IdEntry object. The NUM node is an integer representing the number of bytes
required by the local variables.
137
FUNC
retType
ID NUM
f.mode INT
RET
retType
e.oper
e.mode
As the formal parameters are processed, we must keep a running total of the
number of bytes used. This running total is used to assign an offset from the base
pointer ebp for each parameter. These will all be positive offsets as the parameters
are pushed onto the stack before the function call. This will be handled by the arg()
function.
As the local variable declarations are processed, we must keep another running
total of bytes used. This running total is also used to assign an offset from the base
pointer ebp for each local variable. These offsets will be negative since space for the
local variables is allocated on the stack after the function call has been made. This
will be handled by the dcl() function.
The syntax tree for a return statement has the form shown in Figure 12.2.
If e.mode is different from retType (the return type of the function), then the
expression e must be cast to the type retType.
Recall that a return statement does not necessarily occur at the end of a function
and that there may be more than one return statement in a function. Therefore, we
138 LABORATORY 12. FUNCTION DEFINITIONS AND THE AST
FEND
retType
ID
f.mode
need a separate tree to handle the details at the physical end of the function. The
syntax tree for the end of a function definition is almost trivial. See Figure 12.3.
The action to be taken is to reset the base pointer ebp and the stack pointer esp
to their previous values, the values that they had before the function was called.
As an example, the trees for the function definition
should be output as
FUNC PTR|PROC|INT
ID PTR|PROC|INT value = "sum"
NUM INT value = 4
ASSIGN INT
ID PTR|INT value = "c"
PLUS INT
DEREF INT
ID PTR|INT value = "a"
DEREF INT
ID PTR|INT value = "b"
139
RET INT
DEREF INT
ID PTR|INT value = "c"
FEND PTR|PROC|INT
ID PTR|PROC|INT value = "sum"
In the first tree, the integer 4 represents the number of bytes needed for the local
variable c.
In the file SymbolTable.java, install the keyword return, if you haven’t done that
already.
Change the name compiler v2 to compiler v3 throughout your files, as described
in Lab 10.
When we store a function name in the symbol table, we must store its type
as “pointer to a procedure that returns type,” where type is either int or double.
We have the values Ops.PTR and Ops.INT already defined, but we must add a new
symbolic constant Ops.PROC. Let its value be 1 << 3. Now we will be able to write
expressions like
make a call to
Be sure to pass the function value on to the nonterminal by using RESULT. The
fname() function registers the function name in the symbol table and returns an
IdEntry for the function. Therefore, the nonterminal fname must be declared to be
of IdEntry type. The second of those two productions indicates that if the return
type of a function is not declared, then it will be assumed to be int.
Next, consider the production
calls the function SemanticAction.fbeg(f). This function returns the same IdEntry
object that is passed to it, so the type of the nonterminal fbeg must be IdEntry.
The production
Both of these call the function SemanticAction.ret(), but the first one passes a NUM
tree containing the integer 0 while the second one passes the parameter e.
Let’s consider the semantic actions in the order in which they will be applied
by the compiler. Open the file SemanticActionFunctions.java. It contains a
new dcl() function and several skeletons functions. As each function is discussed
in the next section, copy and paste the function skeleton from this file to the file
SemanticAction.java, maintaining the functions in alphabetical order.
f name → type ID | ID
The second form is matched if the function is declared without specifying a return
type. The old C rule was that the default return type is int. In the second case, the
parameter
new Integer(Ops.INT)
Then we initialize some symbol table variables concerning the argument list and the
local variables. Add the statements
SymbolTable.fArgSize = 8;
SymbolTable.dclSize = 0;
SymbolTable.retType = type.intValue();
SymbolTable.enterBlock();
arg → type ID
Before installing the argument in the symbol table, we should look up the name to
see if it is already in the table at the local level. If it is, then we should print an error
message and not install it again. If it isn’t there, then we should go ahead and install
it.
Each argument must be installed in the symbol table at the local level along with
its data type, scope, and offset. The data type is given by the parameter type, the
scope is Ops.PARAM, and the offset is the current value of fArgSize.
After assigning fArgSize to id.offset, we should increment fArgSize by the
base size of this parameter in preparation for the next parameter.
Write the function farg().
143
id.scope = Ops.LOCAL;
SymbolTable.dclSize += TreeOps.baseSize(id.type);
id.offset = -SymbolTable.dclSize;
sets the scope to local, increments the size of the local variable block, and assigns the
negative of the current size as the offset for this variable. The reason we assign the
offset after incrementing the size rather than before, as we did in arg(), is because
we are building down from the base pointer now, whereas we were building up before.
Copy dcl() and paste it in SemanticAction.java in place of the old dcl()
function.
which means that the parameters (fargs) and the local variables (dcls) have been
processed. Thus, we are ready to print the FUNC syntax tree shown in Figue 12.2.
This is a very straightforward exercise using TreeNode constructors. The tree should
be constructed and printed and the parameter id should be returned.
Write the fbeg() function.
parameter e is a reference to the returned object. Call on the cast() function, which
will create a CAST node, if necessary.
Now build the RET tree, as shown in the diagram earlier in the lab, and print it.
Since the function fbeg() returned the IdEntry for the function name, we have that
parameter available now.
This function should build the FEND tree, as in Figure 12.3, display the tree, then
set the return type retType to 0 and call the leaveBlock() function.
Write the func() function.
12.11 Assignment
Put the source files and the makefile in a folder named Lab 12, zip it, and drop it in
the dropbox.
Laboratory 13
Key Concepts
• Function calls
• Actual parameters
• Strings
Preliminaries
Copy all of your source files from your Lab 12 folder to a new folder named Lab 13.
13.1 Introduction
In this lab, we will continue to build the syntax tree for functions, but this time we
will build the tree for the function call. Together with Lab 12, this will complete the
tree-building part. In the next lab, we will do the code generation for function calls,
which will result in Version 3 of our compiler.
145
146 LABORATORY 13. FUNCTION CALLS AND THE AST
CALL
INT
ID DEREF
copy INT
ID
b
int copy(int a)
{
return a;
}
The tree for the function call copy(b) would be as shown in Figure 13.1 and the
compiler would display it as
CALL INT
147
then the tree for the function call sum(10, 20, 30) would be the tree in Figure 13.2.
and the compiler would print
CALL INT
ID PTR|PROC|INT value = "sum"
LIST
LIST
NUM INT value = 10
NUM INT value = 20
NUM INT value = 30
Note that the arguments appear in order from bottom to top. That will be important
when the code generator pushes them onto the runtime stack, since they must be
pushed in order from right to left.
In our compiler we will pass arguments by value only. Thus, each expression in
the actual argument list must be evaluated before its value can be passed. Also,
148 LABORATORY 13. FUNCTION CALLS AND THE AST
CALL
INT
ID
LIST
sum
NUM
LIST 30
NUM NUM
10 20
our compiler will not check that the type of the argument matches the type of the
parameter. If they do not match, then the compiled program will not run correctly
and that will not be the fault of the compiler.
The function exprs() receives two parameters e1 and e2. The parameter e1
represents the list of expressions so far and e2 represents the latest expression to be
added to the list. A new LIST tree is to be created with the form shown in Figure
13.3.
Note that e1 is on the left. Thus, the rightmost argument e2 will appear highest
in the tree.
Write the code for the exprs() function.
The action for the production
exprs → expr
LIST
e1 e2
The call() function receives two parameters. The first parameter s is the name of
the function. The second parameter e is a LIST TreeNode at the root of a tree of
expressions or it is a single expression (when there is only one parameter), or it is
null (when there is no parameter).
The call() function should first look up the name of the function in the symbol
table at the global level. If the returned IdEntry id is null, then call() should
create an entry with type
Ops.PTR | Ops.PROC | Ops.INT
and scope Ops.GLOBAL. That is, the default return type is int.
Then it should create and return a tree of the form in Figure 13.4.
Now that we can call functions, we can call functions in the C library, such as
sqrt(), cos(), and printf(). This creates a new need: the need for strings, since
the first parameter of the printf() function is a format string. We have postponed
strings for as long as possible. Now we will deal with them.
CALL
id.mode
id
e
id.mode
returns a string, the quotations marks are part of the string. We should keep that
in mind.) Thus, we must instruct the parser to take the appropriate action in the
production
expr → STR
It will call on the str() function in the SemanticAction class. The str() function
should create and return a string TreeNode. To keep our compiler well organized, we
should create a TreeNode constructor for strings that is analogous to the TreeNode
constructors for identifiers and numbers that we created in earlier labs.
The body of this TreeNode constructor should be
oper = Ops.STR;
mode = Ops.PTR | Ops.CHAR;
str = s;
Note the appearance of two new constants, Ops.CHAR and Ops.STR. Add Ops.CHAR
to the Ops class as a data type, in the group with Ops.INT. You might give it the
value 1 << 2. Add Ops.STR as a new type of tree node, in the group with Ops.CALL.
Also, add a corresponding line in the opInfo[] array in Utility.java. We will not
actually have char objects in our programs, but the string type in C is officially a
pointer to char, so we need them for that purpose.
While you are in Utility.java, add a case to the function printType() that
handles Ops.CHAR. Note also that Ops.STR is a type of tree node, while the data
type is a pointer to a character. We must be careful to distinguish between the data
type and the tree type. Also, in the function printNode(), we have cases that print
identifier and number leaf nodes. Add a case that will print a string leaf node.
151
Create the TreeNode constructor and then have str() call on it, passing it the
string.
y = sqr(sqr(x));
13.7 Assignment
Put the source files and the makefile in a folder named Lab 13, zip it, and drop it in
the dropbox.
152 LABORATORY 13. FUNCTION CALLS AND THE AST
Laboratory 14
Key Concepts
• Pushing parameters
• Calling functions
• Clearing parameters
• String objects
Preliminaries
Copy all of your source files from your Lab 13 folder to a new folder named Lab 14.
14.1 Introduction
To implement function calls, we will need to create six new cases in CodeGenerator.java
to deal with new types of nodes in the syntax tree. The new cases are
Ops.CALL
Ops.FEND
Ops.FUNC
Ops.LIST
153
154 LABORATORY 14. FUNCTIONS AND CODE GENERATION
Ops.RET
Ops.STR
We will start with the FUNC case, which is also the simplest case.
FUNC
Write the Java code that will output the assembly code for a FUNC tree.
• Move the base pointer to the stack pointer, thereby restoring the old stack
pointer. (Recall that we saved the old stack pointer as the new base pointer.)
• Pop the value that is on the stack to the base pointer, thereby restoring the old
base pointer. (Recall that we pushed the old base pointer onto the stack.)
In that last step, we really should return 0 since the function has a return type and is
expecting a value. However, for simplicity we will not do that (unless you want to).
That means that it is the programmer’s responsibility to return a value whenever the
program expects a value, or else the program may crash. A value will not be returned
automatically.
That is all there is in the FEND case. This code appeared in the CodeGenerator
function finish(), which you should now remove. It was there to finish the main()
function. Also, remove the call to finish() found in compiler v3.java.
156 LABORATORY 14. FUNCTIONS AND CODE GENERATION
fills in the value 0. We must cast the return value to the return type. Then we must
check whether it is an int value or a double value to see how to return it. If it
represents a double value, then there is nothing to do. The double value is already
in register st(0) of the FPU, which is where it is expected to be upon return from
the function. On the other hand, if it represents an int value, then we need to pop
that value from the stack and place it in register eax, which is where it is expected
to be.
Write the code that will do this.
Why not return an integer value by pushing it onto the stack and letting the
calling function pop it off the stack? There is a good reason why we should not do
that. What is it?
As we process each parameter, we need to keep track of the size of the parameter
block. That is so that we can generate the instruction that will clear the parameter
block upon returning from the called function. The complication is that the parameter
list may itself contain calls to other functions (which may contain calls to functions,
etc.). Thus, we may have to interrupt our count of the parameter block size for this
function while we initiate a similar count for another function.
This calls for a stack. As a static object in the CodeGenerator class, create a
Stack named paramStack. Use the Java Stack class. You may look up the details at
http://java.sun.com/j2se/1.4.2/docs/api/
The Stack class is in the package java.util. Therefore, you must include the state-
ment
import java.util.*;
see if the result is floating-point. If it is, then move the value from st(0) onto the
runtime stack. Then add the parameter size to paramBlockSize.
After doing the right subtree, consider the left subtree. If it is itself a LIST tree,
then it can be handled recursively; just call on generateNodeCode() and consider it
done. On the other hand, if you are at the bottom of the tree, then the left subtree
is also an expression tree, so it must be evaluated and possibly moved from st(0) to
the runtime stack. The pattern here should be exactly the same as the pattern for
the right subtree. Be sure to add the parameter size to paramBlockSize.
call _fname
where fname is the name of the function, as stored in the symbol table. It can be
retrieved from the left subtree, which consists of a single identifier node.
Once the call is made and execution returns, there are two more things to be done.
We must clear the parameters off the stack and then, if the return value is an int,
we must push it onto the stack, where the next instruction expects to find it.
To clear the parameters, we need a statement of the form
add %n,%esp
where n is the size of the parameter block. Output this statement, using the value of
paramBlockSize. Then you must restore paramBlockSize by popping the previous
value off paramStack and assigning it to paramBlockSize.
Finally, check the type of the returned value. If it is int, then the value is currently
in eax. If so, then we need to push it onto the runtime stack. On the other hand, if
it is double, then it is already in st(0), which is where it should be.
Once you write all of that, the CALL case should be done.
There are two more details to deal with concerning the FUNC, FEND, LIST, and CALL
cases. The FUNC, FEND, and CALL trees all contain ID nodes on the left side. However,
in none of these cases should the ID tree be processed in the usual way. Since the
syntax tree is normally traversed post-order, the ID node would be processed before
we knew that it was part of the FUNC, FEND, or CALL tree. We faced a similar problem
159
with ALLOC trees earlier. We will solve this problem in a similar way. At the beginning
of the generateCode() function, make FUNC and FEND special cases along with ALLOC.
The action should be the same. In the traverseTree() function, make CALL a special
case. If the tree is a CALL tree, then we should skip processing the left and right
subtrees; they will be handled correctly in the CALL case of generateNodeCode().
The reason that we handle CALL differently from FUNC and FEND is that a call statement
with its attendant parameter list may appear within a statement, while FUNC and FEND
trees cannot occur within statements.
The LIST case must be handled differently because we traverse the LIST subtrees
from right to left. (Otherwise, the parameters will be pushed in the reverse order.)
The traverseTree() function processes them from left to right. Thus, the LIST case
must be handled similarly to the CALL case in traverseTree(). That is, skip over
the recursive calls. In the LIST case in the generateNodeCode(), we have it traverse
the right subtree first, then the left subtree.
n(%ebp)
where n is the offset for that identifier. Write the code for the ID case.
.data
L0n: .asciz "string"
.text
lea L0n,%eax # Load addr of string
push %eax # Push addr of string
where n is the current value of jmpLabel and "string" is the string. The variable
jmpLabel is a static member of the CodeGenerator class that is used to create unique
labels. Whenever it is used, it should be incremented before use to give the next label
number. The leading 0 is used to distinguish it from other labels that we will use
later. If you look at the code in the READ and PRINT cases, you will see how jmpLabel
was used there. The assembler treats L0n as a symbolic name whose value is the
address at which it occurs in the program. The string "string" is stored by the
assembler in the “data” area of the assembled program. That is what the directive
.data does. The .text directive sends us back to the code area.
Write the code for the STR case.
14.11 Assignment
This lab will serve as Project 6. Put all of your source files and the makefile in a
folder named Project 6, zip it, and drop it in the dropbox.
Part VII
161
Laboratory 15
Key Concepts
• if statements
• Backpatching
• Conditional expressions
• Linked lists
Read Sections 8.4 and 8.6 in Compilers: Principles, Techniques, and Tools.
Preliminaries
Copy all the source files from your Lab 14 folder to a new folder named Lab 15.
15.1 Introduction
At long last, we will incorporate decision structures into our compiler. The two basic
decision structures that we will implement are one-way if statements
163
164 LABORATORY 15. CONTROL FLOW AND THE AST
if (condition)
stmt
if (condition)
stmt1
else
stmt2
It turns out that the technique, called backpatching, that we use to do this will also
allow us to implement while loops and for loops very easily.
15.2 Version 4
This will be version 4 of our compiler, so change the name to compiler v4 throughout
your files. As before, use grep to find out where the name compiler v3 occurs and
then change it to compiler v4. Also, be sure that the keywords if and else have
been installed in the symbol table.
LoopBegin:
:
jmp LoopBegin
LoopEnd:
165
to create a loop.
With most statements, there is a single destination, which we will call the “next”
destination, to which execution goes once that statement has been executed. However,
in the case of conditional expressions, there are two destinations: a “true” destination
and a “false” destination.
The nonterminal cexpr is a conditional expression. For the time being, it will be a
numerical expression with the rule that zero is interpreted as false and nonzero is
interpreted as true. That is the standard rule in C.
In the first production,
the label produced by m serves as a destination for the preceding statement. In the
second production,
the label produced by m serves as the “true” destination for cexpr, i.e., the destination
when cexpr evaluates to true. In the third production,
the label produced by m1 serves as the “true” destination for cexpr and the label
produced by m2 serves as the “false” destination for cexpr. The jump produced by
n will jump over stmt2 .
166 LABORATORY 15. CONTROL FLOW AND THE AST
15.5 Backpatching
Backpatching is a technique of creating a temporary label (a backpatch label) as
the label of a destination that has yet to be determined. Once the destination is
determined, the backpatch label is resolved with an actual label. (Backpatch labels
are not used as actual labels.) Backpatch labels are named B1, B2, B3, ... and actual
labels are named L1, L2, L3, ..., except that, to avoid confusion, we will not use the
same number for both a backpatch label and an actual label.
In the case of statements matching the production
the “next” destination of stmts1 will be the label produced by m. But in the case of
if statements, we will need to use a backpatch node to store a pair of destinations
for the conditional expression.
The destination of the jump statement produced by n must be the same as the “next”
destination of stmt1 , which is also the “next” destination of stmt2 . Thus, all three
backpatch labels will be collected into a list and later resolved to the same destination.
Create the two BackpatchNode constructors now. The first should be the default
constructor. It should set trueList and falseList to null. The second should have
two parameters, tList and fList, that are LinkedLists. The parameter tList
should be assigned to trueList and the parameter fList should be assigned to
falseList. We will use these constructors shortly.
LABEL label=n
and
BLABEL blabel=n
We should write these two first since other functions will use them. In the TreeNode
constructor, the parameter op is either Ops.LABEL or Ops.BLABEL. Add the constants
168 LABORATORY 15. CONTROL FLOW AND THE AST
Ops.LABEL and Ops.BLABEL to the files Ops.java and Utility.java in the usual
way. The sole purpose of this constructor is to create and return a LABEL or BLABEL
node. The first parameter should be assigned to oper and the second should be
assigned to num.
The function newLabel() returns a new integer to be used for the next label.
It should increment a SemanticAction class variable labelNum and return its new
value. Write these two functions.
The function n() must first create a BLABEL tree. That is done in the same way
as you created a LABEL tree. Then create a JUMP tree, attach the BLABEL tree as its
left subtree, and print the tree. Finally, create a LinkedList containing the integer
that was used in the backpatch label and return that LinkedList. Therefore, the
prototype of n() is
The function merge() will take two linked lists b1 and b2 and merge them into
one list. The merged list will replace the old b1 and it will be returned.
The function backpatch() will resolve all the backpatch labels in the list b with
the destination label labl. It will construct and print an EQU tree (equate tree) for
each backpatch label in the list. (An EQU tree equates a backpatch label with an
actual label.)
Now let us write each of the three backpatching functions, beginning with
makeList(). You should go to the Java web site for the LinkedList class to see what
member functions are available.
The function makeList() should begin by creating a new LinkedList object.
Then it should add the label labl to the list and return the list. See the Java
LinkedList web page for details on the member functions.
The function merge() should merge the lists b1 and b2 and return the merged
list. Look at the LinkedList web page and figure out which LinkedList function(s)
will do this.
The function backpatch() is more substantial than the others, but still pretty
simple, thanks to the LinkedList class. First, it must create a LABEL TreeNode for
the label labl. Call it labTree. Then, for each Integer stored in the list b, it
must create a BLABEL TreeNode and then attach it and the LABEL TreeNode already
created as the left and right subtrees of a new EQU tree. (You will need to define
the symbol Ops.EQU in Ops.java and update the opInfo[] array in Utility.java.)
For example, if b is {3, 4, 6} and labl is 8, then the EQU trees in Figure 15.1 will be
created.
To create each BLABEL TreeNode, you will have to get the next backpatch label
out of the linked list and use intValue() to get its int value. Call it blabl. Then
the statement
will join them together in an EQU tree. The backpatch() function should print each
EQU tree as it is produced. Then it is finished.
170 LABORATORY 15. CONTROL FLOW AND THE AST
EQU EQU
EQU
BLABEL LABEL
6 8
import java.util.*;
at the beginning of the CUP file. This statement must also be included in the
SemanticAction.java file.
Similarly, the nonterminal m will be an Integer and the nonterminal cexpr will
be a BackpatchNode.
First, add semantic actions to the productions
m → ε
n → ε
The actions are simply calls to the SemanticAction functions m() and n().
Also, in several productions we need to add variable names to some of the symbols.
These productions should now be
171
The variables s, s1, s2, and n1 will be LinkedLists and the variables m1 and m2 will
be Integers.
Some of these productions do not currently have semantic actions associated with
them. The details of these actions will be relegated to functions in the SemanticAction
class, so right now all we need to do is to add the semantic action function calls.
In the production
we have already been using the action func(f), but now we must add two more
parameters. The new action is
SemanticAction.func(f, s, m1);
The production
The production
stmts :: =
That is because an empty statement should have an empty list of backpatch labels.
While we are on that subject of empty linked lists, certain other stmt productions
that previously returned null now must return an empty LinkedList, as in the above
172 LABORATORY 15. CONTROL FLOW AND THE AST
example. Make that change wherever necessary. In some cases that change will show
up in the SemanticAction function rather than in the CUP file. In general it will
show up at any point where we previously returned null, or nothing, for one of the
nonterminals stmts, stmt, or n.
For the productions
and
stmt ::= LPAREN cexpr:c RPAREN m:m1 stmt:s1 n:n1 ELSE m:m2 stmt:s2
and
respectively.
For the production
RESULT = s;
RESULT = SemanticAction.exprToCExpr(e);
We will write the semantic action functions one by one as we consider the different
types of statements.
173
next next
stmts m stmts ?
Draw a diagram that shows how the “branching” goes. See Figure 15.2.
This diagram indicates that the linked list from stmts1 should be backpatched
to the label m, and that the production should return the linked list from stmt (as
RESULT), to be resolved at some higher level in the parse tree. Thus, the stmts()
function should be
For all the hoopla, that was awfully simple. Once we finish with conditional expres-
sions, it will be just about as simple to deal with if statements. That is the power
of organization.
IF (cexpr) m stmt
is logically equivalent to
174 LABORATORY 15. CONTROL FLOW AND THE AST
CMPNE
mode
e1 e2
e1.mode e2.mode
CMPNE
e.mode
NUM e
0 e.mode
IF (expr != 0) m stmt
The “true” destination is m and the false destination is whatever label follows the if
statement.
Altogether we will have six different kinds of comparison nodes: CMPEQ, CMPNE,
CMPLT, CMPGT, CMPLE, and CMPGE. They correspond to the C operators ==, !=, <, >,
<=, and >=. Right now we need only the CMPNE, because the expression counts as true
if it does not equal 0. The general form of a comparison tree is shown in Figure 15.3.
Our grammar contains the production
cexpr → expr
in its left subtree. If the boolean value is true, it takes the jump. Otherwise, it
continues on to the next instruction. The next instruction should be a JUMP tree that
unconditionally jumps to the “false” destination. (Ops.JUMPT and Ops.JUMP must
also be added to Ops.java and Utility.java.)
The phrase
if (a)
would create the following combined JUMPT and CMPNE tree and the JUMP tree below
it.
JUMPT INT
BLABEL blabel=3
CMPNE INT
NUM INT value=0
DEREF INT
ID PTR|INT value="a"
JUMP INT
BLABEL blabel=4
IF ( cexpr ) m stmt ?
T next
Next, generate a new label labl2, use it to create a BLABEL node, and then
construct a JUMP tree with the BLABEL node attached on its left. This is the “false”
destination. Print this tree.
The final step is to create and return a BackpatchNode containing the “true” and
“false” destinations of cexpr. Use a BackpatchNode constructor, where labl1 is the
“true” destination and labl2 is the “false” destination.
The hard work is over. Now we can write the functions that handle the one-way
and two-way if statements.
T next
F next
next
fbeg stmts m }
if (a)
d = 100;
else if (b)
d = 200;
else if (c)
d = 300;
else
d = 400;
Now that we have if statements, we can write test programs with recursive function
calls. You might try a recursive version of the gcd() function.
The function ought to return void, but we have no void type, so we will make the
return type int and return a 0. The parameter num is the number of disks to be
moved. As long as it is more than 1, then we will make a recursive call. If it is 1,
then we will just move the disk. The parameters src and dst are the source and
destination posts. (The posts are numbered 1, 2, and 3, with the disks originally on
post 1 with the goal of moving them to post 3.) The parameter extra is the remaining
post. You will find this function in the file Hanoi.c.
Have the main function read the number of disks from the user, using our special
read statement. Call that number n. Then the initial function call to Hanoi() from
main() should be
Hanoi(n, 1, 3, 2);
15.20 Assignment
Write the production for a while loop. The form is similar to the form of the one-way
if statement, except that at the bottom of the loop, there is an unconditional branch
back to the conditional expression. Test your program with if statements and while
loops nested in various ways.
Place all of the source files, including a makefile, in a folder named Lab 15, zip it,
and drop it in the dropbox.
180 LABORATORY 15. CONTROL FLOW AND THE AST
Laboratory 16
Key Concepts
• Labels
• Equate statements
• Unconditional jumps
• Conditional jumps
Read Sections 3.4.3 and 8.1.2 in Intel’s Developer’s Manual, Vol. 1. These sections
discuss flags and condition codes. In Vol. 2, read about the FUCOMPP instruction
and the Jcc and JMP instructions. These instructions are used in conditional and
unconditional branches.
Preliminaries
Copy your source files from your Lab 15 folder to a new folder called Lab 16.
181
182 LABORATORY 16. CONTROL FLOW CODE GENERATION
16.1 Introduction
Most of the code generated in this lab is straightforward. The one feature that is
new to us is testing the condition codes for equality or inequality. You should read
the Intel manuals to become familiar with how conditional jumps are handled. The
single most important instruction is Jcc, where cc stands for a condition code.
We will generate code for the following types of tree node:
Ops.LABEL
Ops.EQU
Ops.JUMP
Ops.JUMPT
Ops.CMPNE
Ln:
L4:
LABEL label=n
Bn1=Ln2
183
where n1 is the number of the backpatch label and n2 is the number of the actual
label. For example, if the tree is
EQU
BLABEL blabel=6
LABEL label=8
B6=L8
An EQU tree has a LABEL tree as a subtree. As a subtree of an EQU tree, the LABEL tree
should not be processed as described above. Therefore, we must make a special case of
an EQU tree. Write the code for the EQU case as a special case in the generateCode()
function. Recall that ALLOC, FEND, and FUNC were also special cases.
JUMP INT
BLABEL blabel=n
where n is the number of the backpatch label. In some cases, it is possible that the
destination will be an actual label (LABEL) rather than a backpatch label (BLABEL).
This will happen with backward jumps since the destination is known. All forward
jumps will be jumps to backpatch labels.
An unconditional jump statement will have the form
jmp Bn
or
jmp Ln
where n is the number of the label. The JUMP case must also be handled as a special
case for the same reason that EQU was special: the LABEL or BLABEL subtrees should
not be handled as an ordinary LABEL or BLABEL case.
184 LABORATORY 16. CONTROL FLOW CODE GENERATION
CMPNE INT
NUM INT value=0
DEREF INT
ID PTR|INT value="a"
Note that the left subtree is a NUM node containing the number 0. Later when we
consider general boolean expressions, this may not be 0. The right subtree is the
numerical expression that appeared as the conditional expression in the if statement.
The form of the generated code for an integer comparison is
Notice that we first put 1 (true) in register ecx. Then if eax does not equal edx,
execution jumps to L02, leaving 1 in ecx. The mnemonic jne means “jump on not
equal.” However, if eax equals edx, then execution drops through to the statement
that decrements ecx, making it 0 (false). In either case, the value of ecx is pushed
onto the stack.
185
The code is a little different if the comparison is between floating point quantities.
Look up the mnemonics fucompp, fnstsw, and sahf in Intel’s Developer’s Manual,
Vol. 2A. The instruction fucompp will compare the two operands on top of the floating-
point stack and pop them both. It also sets certain bits (C0, C2, and C3) in the
floating-point status word, depending on how the comparison turns out. The next
instruction fnstsw will store the 16-bit FPU status word in register ax. Then the
instruction sahf stores ah in the eflags register. See Intel’s Developer’s Manual,
Vol. 1, Section 8.1.2, x87 FPU Status Register, and Section 8.1.3, Branching and
Conditional Moves on Condition Codes. You will see that sahf moves C0 to CF (carry
flag), C2 to PF (parity flag), and C3 to ZF (zero flag). These flags are automatically
tested when a conditional jump such as jne is executed.
The number of the label L02 was gotten from the jmpLabel class variable in the
CodeGenerator class. Be sure to include the leading 0 to distinguish L02 from L2,
which occurs elsewhere.
When you write the code for the CMPNE case, note that the integer and floating-
point cases begin and end with the same code. Only the middle parts are different.
JUMPT INT
BLABEL blabel=3
CMPNE INT
NUM INT value=0
DEREF INT
ID PTR|INT value="a"
The JUMPT tree should also be treated as a special case, since the “true” destination
label may be an actual label and we don’t want the label printed here as a label.
However, we should pass the right subtree to generateTreeCode() for processing.
The code generated by the CMPNE tree will leave the boolean value on the stack, which
the JUMPT tree will test.
This should complete the code generation for if statements and while statements.
16.8 Assignment
This lab will serve as Project 7. Place all of your source files in a folder named
Project 7, then zip the folder and drop it in the dropbox.
Laboratory 17
Boolean Expressions
Key Concepts
• Relational operators
• Boolean operators
Read Sections 8.4 and 8.6 in Compilers: Principles, Techniques, and Tools.
Preliminaries
Copy your source files from the Lab 16 folder to a new folder named Lab 17.
17.1 Introduction
The purpose of this lab is to implement the boolean operators &&, ||, and !, and the
relational operators ==, !=, <, >, <=, and >=. All of these operators appear in the
productions for conditional expressions.
187
188 LABORATORY 17. BOOLEAN EXPRESSIONS
CMPcc
mode
e1 e2
e1.mode e2.mode
Compiling these expressions is easy enough, and we are now experienced enough, that
we will do both the tree building and the code generation in a single lab.
17.2 Version 5
This is now Version 5 of our compiler. Make the necessary change in Utility.java,
compiler v4.java, and the makefile.
Name the semantic action function relOp(). The header of the function should
be
where op is the relational operator (e.g., Ops.CMPNE) and e1 and e2 are the expressions
that are being compared. The only issue other than building the tree is to determine
the mode of the operations. That will follow the same rule used in the function
arith(). If both operands are int, then the operation is int. If either operand is
double, then the operation is double, with possible casting.
Once the comparison tree is built, it must be attached to a JUMPT tree. To do this,
we will have relOp() call a function relOpToCExpr() which will be nearly identical
to the exprToCExpr() function that we wrote earlier. In fact, it is so similar that
you might want to copy, paste, and edit exprToCExpr() to create relOpToCExpr().
The only difference is that relOp() passes to relOpToCExpr() the comparison tree
already built. In exprToCExpr(), it was necessary to build the CMPNE tree first.
Therefore, the code in relOpToCExpr() should be identical to the subsequent code
in exprToCExpr(). Just as in exprToCExpr(), the function relOpToCExpr() should
return to relOp() a BackpatchNode containing the “true” and “false” destinations
of the JUMPT tree and the JUMP tree that were printed.
Write the relOp() and relOpToCExpr() functions in SemanticAction.java and
add the function calls in the CUP file.
T T
&& q to be true, both p and q must be true. Therefore, if we find p to be false, then
there is no need to evaluate q. In fact, this method of evaluation is required by the
ANSI C standard. To see how to implement this, we need a diagram. See Figure
17.2.
This indicates that the “true” list from cexpr1 should be backpatched to m and
the “false” list from cexpr1 should be merged with the “false” list from cexpr2 . (Note
that m must be one of the parameters of andOp().) Then a BackpatchNode should
be constructed and returned that has the “true” and “false” lists indicated in the
diagram. Write the andOp() function.
The “or” operator || is very similar to the “and” operator. In this case, short-
circuit evaluation of p || q means that if p is true, then there is no need to evaluate
q. Write the orOp() function.
Finally, there is the production
17.8 Assignment
Implement do loops and for loops. The productions for these statements are
I recommend that you begin with the do loops, since they are easier. In a do loop,
stmt1 is executed unconditionally on the first pass. Then cexpr is evaluated. If it is
192 LABORATORY 17. BOOLEAN EXPRESSIONS
false, then execution exits the loop. If it is true, then stmt1 is executed again and
then cexpr is evaluated again, and so on.
When the for loop is executed, expr1 is evaluated first and unconditionally. Next,
cexpr is evaluated. If it evaluates to false, then execution exits the for loop. If it
is true, then stmt is executed. After stmt is executed, expr2 is evaluated and then
cexpr is evaluated again, with the same consequences as previously described.
You will have to figure out where to use the markers m and n. Then draw the
backpatching diagram. Be sure that all the parts are connected and that the final
exit from each kind of loop is the false destination of cexpr. You should find that
the abstract syntax tree and the assembly code are created automatically by existing
functions. You may use the program pi.c as a test program.
This lab will serve as Project 8. Copy your source files and the makefile to a folder
name Project 8, zip it, and drop it in the dropbox.