E-Book (Parser Programming) - Crenshaw - Let's Build A Compiler (Nice Edition)
E-Book (Parser Programming) - Crenshaw - Let's Build A Compiler (Nice Edition)
24 July 1988.
INTRODUCTION
Introduction INTRODUCTION
INTRODUCTION
useful if you have a tool like YACC, and also don’t care how much
memory space the final product uses.
I also take a page from the work of Ron Cain, the author of the
original Small C. Whereas almost all other compiler authors have
historically used an intermediate language like P-code and divided
the compiler into two parts (a front end that produces P-code, and a
back end that processes P-code to produce executable object code),
Ron showed us that it is a straightforward matter to make a compiler
directly produce executable object code, in the form of assembler
language statements. The code will NOT be the world’s tightest
code ... producing optimized code is a much more difficult job. But
it will work, and work reasonably well. Just so that I don’t leave
you with the impression that our end product will be worthless, I
DO intend to show you how to “soup up” the compiler with some
optimization.
Finally, I’ll be using some tricks that I’ve found to be most helpful
in letting me understand what’s going on without wading through a
lot of boiler plate. Chief among these is the use of single-character
tokens, with no embedded spaces, for the early design work. I figure
that if I can get a parser to recognize and deal with I-T-L, I can get
it to do the same with IF-THEN-ELSE. And I can. In the second
“lesson,” I’ll show you just how easy it is to extend a simple parser
to handle tokens of arbitrary length. As another trick, I completely
ignore file I/O, figuring that if I can read source from the keyboard
and output object to the screen, I can also do it from/to disk files.
Experience has proven that once a translator is working correctly,
it’s a straightforward matter to redirect the I/O to files. The last
trick is that I make no attempt to do error correction/recovery. The
programs we’ll be building will RECOGNIZE errors, and will not
CRASH, but they will simply stop on the first error ... just like good
ol’ Turbo does. There will be other tricks that you’ll see as you go.
Most of them can’t be found in any compiler textbook, but they
work.
A word about style and efficiency. As you will see, I tend to write
programs in VERY small, easily understood pieces. None of the
procedures we’ll be working with will be more than about 15-20 lines
long. I’m a fervent devotee of the KISS (Keep It Simple, Sidney)
school of software development. I try to never do something tricky
or complex, when something simple will do. Inefficient? Perhaps, but
you’ll like the results. As Brian Kernighan has said, FIRST make it
4
INTRODUCTION The cradle
run, THEN make it run fast. If, later on, you want to go back and
tighten up the code in one of our products, you’ll be able to do so,
since the code will be quite understandable. If you do so, however, I
urge you to wait until the program is doing everything you want it
to.
I also have a tendency to delay building a module until I discover
that I need it. Trying to anticipate every possible future contingency
can drive you crazy, and you’ll generally guess wrong anyway. In this
modern day of screen editors and fast compilers, I don’t hesitate to
change a module when I feel I need a more powerful one. Until then,
I’ll write only what I need.
One final caveat: One of the principles we’ll be sticking to here is
that we don’t fool around with P-code or imaginary CPUs, but that
we will start out on day one producing working, executable object
code, at least in the form of assembler language source. However,
you may not like my choice of assembler language ... it’s 68000 code,
which is what works on my system (under SK*DOS). I think you’ll
find, though, that the translation to any other CPU such as the 80x86
will be quite obvious, though, so I don’t see a problem here. In fact, I
hope someone out there who knows the ’86 language better than I do
will offer us the equivalent object code fragments as we need them.
THE CRADLE
Every program needs some boiler plate ... I/O routines, error message
routines, etc. The programs we develop here will be no exceptions.
I’ve tried to hold this stuff to an absolute minimum, however, so that
we can concentrate on the important stuff without losing it among
the trees. The code given below represents about the minimum that
we need to get anything done. It consists of some I/O routines, an
error-handling routine and a skeleton, null main program. I call it our
cradle. As we develop other routines, we’ll add them to the cradle,
and add the calls to them as we need to. Make a copy of the cradle
and save it, because we’ll be using it more than once.
There are many different ways to organize the scanning activities
of a parser. In Unix systems, authors tend to use getc and ungetc.
I’ve had very good luck with the approach shown here, which is to
use a single, global, lookahead character. Part of the initialization
procedure (the only part, so far!) serves to “prime the pump” by
5
The cradle INTRODUCTION
reading the first character from the input stream. No other special
techniques are required with Turbo 4.0 ... each successive call to
GetChar will read the next character in the stream.
{--------------------------------------------------------------}
program Cradle;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look: char; { Lookahead Character }
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected(’’’’ + x + ’’’’);
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := upcase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: char;
begin
if not IsAlpha(Look) then Expected(’Name’);
GetName := UpCase(Look);
GetChar;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: char;
begin
if not IsDigit(Look) then Expected(’Integer’);
GetNum := Look;
GetChar;
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
6
INTRODUCTION The cradle
That’s it for this introduction. Copy the code above into TP and
compile it. Make sure that it compiles and runs correctly. Then
proceed to the first lesson, which is on expression parsing.
7
Part II
24 July 1988.
EXPRESSION PARSING
EXPRESSION PARSING Single digits
GETTING STARTED
If you’ve read the introduction document to this series, you will al-
ready know what we’re about. You will also have copied the cradle
software into your Turbo Pascal system, and have compiled it. So
you should be ready to go.
The purpose of this article is for us to learn how to parse and
translate mathematical expressions. What we would like to see as
output is a series of assembler-language statements that perform the
desired actions. For purposes of definition, an expression is the right-
hand side of an equation, as in
x = 2*y + 3/(4*z)
SINGLE DIGITS
9
Binary expressions EXPRESSION PARSING
Now run the program. Try any single-digit number as input. You
should get a single line of assembler-language output. Now try any
other character as input, and you’ll see that the parser properly re-
ports an error.
CONGRATULATIONS! You have just written a working transla-
tor!
OK, I grant you that it’s pretty limited. But don’t brush it off too
lightly. This little “compiler” does, on a very limited scale, exactly
what any larger compiler does: it correctly recognizes legal state-
ments in the input “language” that we have defined for it, and it
produces correct, executable assembler code, suitable for assembling
into object format. Just as importantly, it correctly recognizes state-
ments that are NOT legal, and gives a meaningful error message.
Who could ask for more? As we expand our parser, we’d better
make sure those two characteristics always hold true.
There are some other features of this tiny program worth men-
tioning. First, you can see that we don’t separate code generation
from parsing ... as soon as the parser knows what we want done,
it generates the object code directly. In a real compiler, of course,
the reads in GetChar would be from a disk file, and the writes to
another disk file, but this way is much easier to deal with while we’re
experimenting.
Also note that an expression must leave a result somewhere. I’ve
chosen the 68000 register DO. I could have made some other choices,
but this one makes sense.
BINARY EXPRESSIONS
Now that we have that under our belt, let’s branch out a bit. Admit-
tedly, an “expression” consisting of only one character is not going
to meet our needs for long, so let’s see what we can do to extend it.
Suppose we want to handle expressions of the form:
1+2
or 4-3
or, in general, <term> +/- <term>
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
Term;
EmitLn(’MOVE D0,D1’);
case Look of
’+’: Add;
’-’: Subtract;
else Expected(’Addop’);
end;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match(’+’);
Term;
EmitLn(’ADD D1,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match(’-’);
Term;
EmitLn(’SUB D1,D0’);
end;
{-------------------------------------------------------------}
When you’re finished with that, the order of the routines should
be:
Now run the program. Try any combination you can think of of
two single digits, separated by a ’+’ or a ’-’. You should get a series of
four assembler-language instructions out of each run. Now try some
expressions with deliberate errors in them. Does the parser catch the
errors?
Take a look at the object code generated. There are two observa-
tions we can make. First, the code generated is NOT what we would
write ourselves. The sequence
MOVE #n,D0
MOVE D0,D1
12
EXPRESSION PARSING General expressions
Now our code is even less efficient, but at least it gives the right
answer! Unfortunately, the rules that give the meaning of math ex-
pressions require that the terms in an expression come out in an
inconvenient order for us. Again, this is just one of those facts of life
you learn to live with. This one will come back to haunt us when we
get to division.
OK, at this point we have a parser that can recognize the sum or
difference of two digits. Earlier, we could only recognize a single digit.
But real expressions can have either form (or an infinity of others).
For kicks, go back and run the program with the single input line ’1’.
Didn’t work, did it? And why should it? We just finished telling
our parser that the only kinds of expressions that are legal are those
with two terms. We must rewrite procedure Expression to be a lot
more broadminded, and this is where things start to take the shape
of a real parser.
GENERAL EXPRESSIONS
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
Term;
while Look in [’+’, ’-’] do begin
EmitLn(’MOVE D0,D1’);
case Look of
’+’: Add;
’-’: Subtract;
else Expected(’Addop’);
end;
end;
end;
{--------------------------------------------------------------}
1+(2-(3+(4-5)))
If we put the ’1’ in D1, where do we put the ’2’ ? Since a general
expression can have any degree of complexity, we’re going to run out
of registers fast!
Fortunately, there’s a simple solution. Like every modern micro-
processor, the 68000 has a stack, which is the perfect place to save a
variable number of items. So instead of moving the term in D0 to D1,
let’s just push it onto the stack. For the benefit of those unfamiliar
with 68000 assembler language, a push is written
-(SP)
EmitLn(’MOVE D0,-(SP)’);
Now let’s get down to some REALLY serious business. As you all
know, there are other math operators than “addops” ... expressions
can also have multiply and divide operations. You also know that
there is an implied operator PRECEDENCE, or hierarchy, associated
with expressions, so that in an expression like
2 + 3 * 4,
What is a factor? For now, it’s what a term used to be ... a single
digit.
Notice the symmetry: a term has the same form as an expression.
As a matter of fact, we can add to our parser with a little judicious
copying and renaming. But to avoid confusion, the listing below is
the complete set of parsing routines. (Note the way we handle the
reversal of operands in Divide.)
15
Parentheses EXPRESSION PARSING
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Factor;
begin
EmitLn(’MOVE #’ + GetNum + ’,D0’)
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match(’*’);
Factor;
EmitLn(’MULS (SP)+,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match(’/’);
Factor;
EmitLn(’MOVE (SP)+,D1’);
EmitLn(’DIVS D1,D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
Factor;
while Look in [’*’, ’/’] do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’*’: Multiply;
’/’: Divide;
else Expected(’Mulop’);
end;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match(’+’);
Term;
EmitLn(’ADD (SP)+,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match(’-’);
Term;
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
Term;
while Look in [’+’, ’-’] do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’+’: Add;
’-’: Subtract;
else Expected(’Addop’);
end;
end;
end;
{--------------------------------------------------------------}
PARENTHESES
We can wrap up this part of the parser with the addition of paren-
theses with math expressions. As you know, parentheses are a mech-
anism to force a desired operator precedence. So, for example, in the
expression
2*(3+4) ,
the parentheses force the addition before the multiply. Much more
importantly, though, parentheses give us a mechanism for defining
expressions of any degree of complexity, as in
(1+2)/((3+4)+(5-6))
Note again how easily we can extend the parser, and how well the
Pascal code matches the BNF syntax.
As usual, compile the new version and make sure that it correctly
parses legal sentences, and flags illegal ones with an error message.
17
Unary minus EXPRESSION PARSING
UNARY MINUS
At this point, we have a parser that can handle just about any ex-
pression, right? OK, try this input sentence:
-1
-(3-2) .
There are a couple of ways to fix the problem. The easiest (al-
though not necessarily the best) way is to stick an imaginary leading
zero in front of expressions of this type, so that -3 becomes 0-3. We
can easily patch this into our existing version of Expression:
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
if IsAddop(Look) then
EmitLn(’CLR D0’)
else
Term;
while IsAddop(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’+’: Add;
’-’: Subtract;
else Expected(’Addop’);
end;
end;
end;
{--------------------------------------------------------------}
I TOLD you that making changes was easy! This time it cost us
only three new lines of Pascal. Note the new reference to function
IsAddop. Since the test for an addop appeared twice, I chose to em-
bed it in the new function. The form of IsAddop should be apparent
from that for IsAlpha. Here it is:
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
18
EXPRESSION PARSING A word about optimization
than the stack? It worked, because with only those two operations,
the “stack” never needs more than two entries.
Well, the 68000 has eight data registers. Why not use them as a
privately managed stack? The key is to recognize that, at any point
in its processing, the parser KNOWS how many items are on the
stack, so it can indeed manage it properly. We can define a private
“stack pointer” that keeps track of which stack level we’re at, and
addresses the corresponding register. Procedure Factor, for example,
would not cause data to be loaded into register D0, but into whatever
the current “top-of-stack” register happened to be.
What we’re doing in effect is to replace the CPU’s RAM stack with
a locally managed stack made up of registers. For most expressions,
the stack level will never exceed eight, so we’ll get pretty good code
out. Of course, we also have to deal with those odd cases where the
stack level DOES exceed eight, but that’s no problem either. We
simply let the stack spill over into the CPU stack. For levels beyond
eight, the code is no worse than what we’re generating now, and for
levels less than eight, it’s considerably better.
For the record, I have implemented this concept, just to make sure
it works before I mentioned it to you. It does. In practice, it turns
out that you can’t really use all eight levels ... you need at least one
register free to reverse the operand order for division (sure wish the
68000 had an XTHL, like the 8080!). For expressions that include
function calls, we would also need a register reserved for them. Still,
there is a nice improvement in code size for most expressions.
So, you see, getting better code isn’t that difficult, but it does add
complexity to the our translator ... complexity we can do without
at this point. For that reason, I STRONGLY suggest that we con-
tinue to ignore efficiency issues for the rest of this series, secure in
the knowledge that we can indeed improve the code quality without
throwing away what we’ve done.
Next lesson, I’ll show you how to deal with variables factors and
function calls. I’ll also show you just how easy it is to handle multi-
character tokens and embedded white space.
21
Part III
4 Aug 1988.
MORE EXPRESSIONS
MORE EXPRESSIONS Variables
INTRODUCTION
VARIABLES
b * b + 4 * a * c
The ’|’ stands for “or”, meaning of course that either form is a legal
form for a factor. Remember, too, that we had no trouble knowing
which was which ... the lookahead character is a left paren ’(’ in one
case, and a digit in the other.
It probably won’t come as too much of a surprise that a variable
is just another kind of factor. So we extend the BNF above to read:
23
Functions MORE EXPRESSIONS
MOVE X(PC),D0
where X is, of course, the variable name. Armed with that, let’s
modify the current version of Factor to read:
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Expression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsAlpha(Look) then
EmitLn(’MOVE ’ + GetName + ’(PC),D0’)
else
EmitLn(’MOVE #’ + GetNum + ’,D0’);
end;
{--------------------------------------------------------------}
FUNCTIONS
Since we’re not dealing with parameter lists yet, there is nothing
to do but to call the function, so we need only to issue a BSR (call)
instead of a MOVE.
Now that there are two possibilities for the “If IsAlpha” branch of
the test in Factor, let’s treat them in a separate procedure. Modify
Factor to read:
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Expression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsAlpha(Look) then
Ident
else
EmitLn(’MOVE #’ + GetNum + ’,D0’);
end;
{--------------------------------------------------------------}
25
More on error handling MORE EXPRESSIONS
OK, compile and test this version. Does it parse all legal expres-
sions? Does it correctly flag badly formed ones?
The important thing to notice is that even though we no longer
have a predictive parser, there is little or no complication added
with the recursive descent approach that we’re using. At the point
where Factor finds an identifier (letter), it doesn’t know whether it’s
a variable name or a function name, nor does it really care. It simply
passes it on to Ident and leaves it up to that procedure to figure it
out. Ident, in turn, simply tucks away the identifier and then reads
one more character to decide which kind of identifier it’s dealing with.
Keep this approach in mind. It’s a very powerful concept, and
it should be used whenever you encounter an ambiguous situation
requiring further lookahead. Even if you had to look several tokens
ahead, the principle would still work.
See how the space was treated as a terminator? Now, to make the
compiler properly flag this, add the line
CR = ^M;
As usual, recompile the program and verify that it does what it’s
supposed to.
27
Assignment statements MORE EXPRESSIONS
ASSIGNMENT STATEMENTS
OK, at this point we have a parser that works very nicely. I’d like
to point out that we got it using only 88 lines of executable code,
not counting what was in the cradle. The compiled object file is a
whopping 4752 bytes. Not bad, considering we weren’t trying very
hard to save either source code or object size. We just stuck to the
KISS principle.
Of course, parsing an expression is not much good without having
something to do with it afterwards. Expressions USUALLY (but not
always) appear in assignment statements, in the form
<Ident> = <Expression>
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := GetName;
Match(’=’);
Expression;
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’)
end;
{--------------------------------------------------------------}
Note again that the code exactly parallels the BNF. And notice
further that the error checking was painless, handled by GetName
and Match.
The reason for the two lines of assembler has to do with a peculiar-
ity in the 68000, which requires this kind of construct for PC-relative
code.
Now change the call to Expression, in the main program, to one
to Assignment. That’s all there is to it.
Son of a gun! We are actually compiling assignment statements.
If those were the only kind of statements in a language, all we’d have
to do is put this in a loop and we’d have a full-fledged compiler!
Well, of course they’re not the only kind. There are also little items
like control statements (IFs and loops), procedures, declarations, etc.
28
MORE EXPRESSIONS Multi-character tokens
But cheer up. The arithmetic expressions that we’ve been dealing
with are among the most challenging in a language. Compared to
what we’ve already done, control statements will be easy. I’ll be
covering them in the fifth installment. And the other statements will
all fall in line, as long as we remember to KISS.
MULTI-CHARACTER TOKENS
Add this function to your parser. I put mine just after IsDigit.
While you’re at it, might as well include it as a permanent member
of Cradle, too.
Now, we need to modify function GetName to return a string
instead of a character:
29
White space MORE EXPRESSIONS
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var Token: string;
begin
Token := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlNum(Look) do begin
Token := Token + UpCase(Look);
GetChar;
end;
GetName := Token;
end;
{--------------------------------------------------------------}
WHITE SPACE
Before we leave this parser for awhile, let’s address the issue of white
space. As it stands now, the parser will barf (or simply terminate)
on a single space character embedded anywhere in the input stream.
That’s pretty unfriendly behavior. So let’s “productionize” the thing
a bit by eliminating this last restriction.
The key to easy handling of white space is to come up with a
simple rule for how the parser should treat the input stream, and
to enforce that rule everywhere. Up till now, because white space
wasn’t permitted, we’ve been able to assume that after each parsing
action, the lookahead character Look contains the next meaningful
character, so we could test it immediately. Our design was based
upon this principle.
30
MORE EXPRESSIONS White space
It still sounds like a good rule to me, so that’s the one we’ll use.
This means that every routine that advances the input stream must
skip over white space, and leave the next non-white character in Look.
Fortunately, because we’ve been careful to use GetName, GetNum,
and Match for most of our input processing, it is only those three
routines (plus Init) that we need to modify.
Not surprisingly, we start with yet another new recognizer routine:
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look <> x then Expected(’’’’ + x + ’’’’)
else begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var Token: string;
begin
Token := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlNum(Look) do begin
Token := Token + UpCase(Look);
GetChar;
end;
GetName := Token;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: string;
var Value: string;
begin
Value := ’’;
if not IsDigit(Look) then Expected(’Integer’);
31
White space MORE EXPRESSIONS
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
Make these changes and recompile the program. You will find
that you will have to move Match below SkipWhite, to avoid an
error message from the Pascal compiler. Test the program as always
to make sure it works properly.
Since we’ve made quite a few changes during this session, I’m
reproducing the entire parser below:
{--------------------------------------------------------------}
program parse;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look: char; { Lookahead Character }
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
32
MORE EXPRESSIONS White space
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an Alphanumeric }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look <> x then Expected(’’’’ + x + ’’’’)
else begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var Token: string;
begin
Token := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlNum(Look) do begin
Token := Token + UpCase(Look);
GetChar;
end;
GetName := Token;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: string;
var Value: string;
begin
Value := ’’;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
Value := Value + Look;
GetChar;
end;
GetNum := Value;
SkipWhite;
end;
33
White space MORE EXPRESSIONS
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Identifier }
procedure Ident;
var Name: string[8];
begin
Name:= GetName;
if Look = ’(’ then begin
Match(’(’);
Match(’)’);
EmitLn(’BSR ’ + Name);
end
else
EmitLn(’MOVE ’ + Name + ’(PC),D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Expression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsAlpha(Look) then
Ident
else
EmitLn(’MOVE #’ + GetNum + ’,D0’);
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match(’*’);
Factor;
EmitLn(’MULS (SP)+,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match(’/’);
Factor;
EmitLn(’MOVE (SP)+,D1’);
EmitLn(’EXS.L D0’);
EmitLn(’DIVS D1,D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
Factor;
while Look in [’*’, ’/’] do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’*’: Multiply;
’/’: Divide;
end;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match(’+’);
34
MORE EXPRESSIONS White space
Term;
EmitLn(’ADD (SP)+,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match(’-’);
Term;
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
if IsAddop(Look) then
EmitLn(’CLR D0’)
else
Term;
while IsAddop(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’+’: Add;
’-’: Subtract;
end;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string[8];
begin
Name := GetName;
Match(’=’);
Expression;
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’)
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Assignment;
If Look <> CR then Expected(’NewLine’);
end.
{--------------------------------------------------------------}
Now the parser is complete. It’s got every feature we can put in
a one-line “compiler.” Tuck it away in a safe place. Next time we’ll
move on to a new subject, but we’ll still be talking about expressions
for quite awhile. Next installment, I plan to talk a bit about inter-
preters as opposed to compilers, and show you how the structure of
the parser changes a bit as we change what sort of action has to be
taken. The information we pick up there will serve us in good stead
later on, even if you have no interest in interpreters. See you next
time.
35
Part IV
24 July 1988.
INTERPRETERS
INTERPRETERS Introduction
INTRODUCTION
x = 2 * y + 3
What I’d like you to see here is that the layout ... the structure ...
of the parser doesn’t change. It’s only the actions that change. So if
you can write an interpreter for a given language, you can also write a
compiler, and vice versa. Yet, as you will see, there ARE differences,
and significant ones. Because the actions are different, the procedures
that do the recognizing end up being written differently. Specifically,
in the interpreter the recognizing procedures end up being coded as
FUNCTIONS that return numeric values to their callers. None of
the parsing routines for our compiler did that.
Our compiler, in fact, is what we might call a “pure” compiler.
Each time a construct is recognized, the object code is emitted IM-
MEDIATELY. (That’s one reason the code is not very efficient.) The
interpreter we’ll be building here is a pure interpreter, in the sense
that there is no translation, such as “tokenizing,” performed on the
source code. These represent the two extremes of translation. In the
real world, translators are rarely so pure, but tend to have bits of
each technique.
I can think of several examples. I’ve already mentioned one: most
interpreters, such as Microsoft BASIC, for example, translate the
source code (tokenize it) into an intermediate form so that it’ll be
easier to parse real time.
Another example is an assembler. The purpose of an assembler,
of course, is to produce object code, and it normally does that on
a one-to-one basis: one object instruction per line of source code.
But almost every assembler also permits expressions as arguments.
In this case, the expressions are always constant expressions, and so
the assembler isn’t supposed to issue object code for them. Rather,
it “interprets” the expressions and computes the corresponding con-
stant result, which is what it actually emits as object code.
As a matter of fact, we could use a bit of that ourselves. The
translator we built in the previous installment will dutifully spit out
object code for complicated expressions, even though every term in
the expression is a constant. In that case it would be far better if the
translator behaved a bit more like an interpreter, and just computed
the equivalent constant result.
There is a concept in compiler theory called “lazy” translation.
The idea is that you typically don’t just emit code at every action.
In fact, at the extreme you don’t emit anything at all, until you ab-
solutely have to. To accomplish this, the actions associated with the
38
INTERPRETERS The interpreter
parsing routines typically don’t just emit code. Sometimes they do,
but often they simply return information back to the caller. Armed
with such information, the caller can then make a better choice of
what to do.
For example, given the statement
x = x + 3 - 2 - (5 - 4) ,
x = x + 0 .
x = x ,
THE INTERPRETER
OK, now that you know WHY we’re going into all this, let’s do it.
Just to give you practice, we’re going to start over with a bare cradle
and build up the translator all over again. This time, of course, we
can go a bit faster.
Since we’re now going to do arithmetic, the first thing we need
to do is to change function GetNum, which up till now has always
returned a character (or string). Now, it’s better for it to return an
integer. MAKE A COPY of the cradle (for goodness’s sake, don’t
change the version in Cradle itself!!) and modify GetNum as follows:
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: integer;
begin
if not IsDigit(Look) then Expected(’Integer’);
GetNum := Ord(Look) - Ord(’0’);
GetChar;
end;
{--------------------------------------------------------------}
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
function Expression: integer;
begin
Expression := GetNum;
end;
{--------------------------------------------------------------}
Writeln(Expression);
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
function Expression: integer;
var Value: integer;
40
INTERPRETERS The interpreter
begin
if IsAddop(Look) then
Value := 0
else
Value := GetNum;
while IsAddop(Look) do begin
case Look of
’+’: begin
Match(’+’);
Value := Value + GetNum;
end;
’-’: begin
Match(’-’);
Value := Value - GetNum;
end;
end;
end;
Expression := Value;
end;
{--------------------------------------------------------------}
41
A little philosophy INTERPRETERS
Now, try it out. Don’t forget two things: first, we’re dealing
with integer division, so, for example, 1/3 should come out zero.
Second, even though we can output multi-digit results, our input is
still restricted to single digits.
That seems like a silly restriction at this point, since we have
already seen how easily function GetNum can be extended. So let’s
go ahead and fix it right now. The new version is
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: integer;
var Value: integer;
begin
Value := 0;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
Value := 10 * Value + Ord(Look) - Ord(’0’);
GetChar;
end;
GetNum := Value;
end;
{--------------------------------------------------------------}
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
function Expression: integer; Forward;
function Factor: integer;
begin
if Look = ’(’ then begin
Match(’(’);
Factor := Expression;
Match(’)’);
end
else
Factor := GetNum;
end;
{---------------------------------------------------------------}
A LITTLE PHILOSOPHY
Before going any further, there’s something I’d like to call to your
attention. It’s a concept that we’ve been making use of in all these
sessions, but I haven’t explicitly mentioned it up till now. I think
42
INTERPRETERS A little philosophy
{---------------------------------------------------------------}
{ Initialize the Variable Area }
procedure InitTable;
var i: char;
begin
for i := ’A’ to ’Z’ do
Table[i] := 0;
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
function Expression: integer; Forward;
function Factor: integer;
begin
if Look = ’(’ then begin
Match(’(’);
Factor := Expression;
Match(’)’);
end
else if IsAlpha(Look) then
Factor := Table[GetName]
else
Factor := GetNum;
end;
{---------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := GetName;
Match(’=’);
Table[Name] := Expression;
end;
{--------------------------------------------------------------}
45
A little philosophy INTERPRETERS
{--------------------------------------------------------------}
{ Recognize and Skip Over a Newline }
procedure NewLine;
begin
if Look = CR then begin
GetChar;
if Look = LF then
GetChar;
end;
end;
{--------------------------------------------------------------}
Insert this procedure at any convenient spot ... I put mine just
after Match. Now, rewrite the main program to look like this:
46
INTERPRETERS A little philosophy
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Assignment;
NewLine;
until Look = ’.’;
end.
{--------------------------------------------------------------}
Note that the test for a CR is now gone, and that there are also no
error tests within NewLine itself. That’s OK, though ... whatever is
left over in terms of bogus characters will be caught at the beginning
of the next assignment statement.
Well, we now have a functioning interpreter. It doesn’t do us a lot
of good, however, since we have no way to read data in or write it
out. Sure would help to have some I/O!
Let’s wrap this session up, then, by adding the I/O routines. Since
we’re sticking to single-character tokens, I’ll use ’ ?’ to stand for a
read statement, and ’ !’ for a write, with the character immediately
following them to be used as a one-token “parameter list.” Here are
the routines:
{--------------------------------------------------------------}
{ Input Routine }
procedure Input;
begin
Match(’?’);
Read(Table[GetName]);
end;
{--------------------------------------------------------------}
{ Output Routine }
procedure Output;
begin
Match(’!’);
WriteLn(Table[GetName]);
end;
{--------------------------------------------------------------}
47
A little philosophy INTERPRETERS
48
Part V
19 August 1988.
CONTROL CONSTRUCTS
The plan CONTROL CONSTRUCTS
INTRODUCTION
THE PLAN
In what follows, we’ll be starting over again with a bare cradle, and
as we’ve done twice before now, we’ll build things up one at a time.
We’ll also be retaining the concept of single-character tokens that
has served us so well to date. This means that the “code” will look
a little funny, with ’i’ for IF, ’w’ for WHILE, etc. But it helps us get
the concepts down pat without fussing over lexical scanning. Fear
not ... eventually we’ll see something looking like “real” code.
I also don’t want to have us get bogged down in dealing with
statements other than branches, such as the assignment statements
we’ve been working on. We’ve already demonstrated that we can
handle them, so there’s no point carrying them around as excess
baggage during this exercise. So what I’ll do instead is to use an
anonymous statement, “other”, to take the place of the non-control
statements and serve as a place-holder for them. We have to generate
some kind of object code for them (we’re back into compiling, not
interpretation), so for want of anything else I’ll just echo the character
input.
OK, then, starting with yet another copy of the cradle, let’s define
the procedure:
50
CONTROL CONSTRUCTS The plan
{--------------------------------------------------------------}
{ Recognize and Translate an "Other" }
procedure Other;
begin
EmitLn(GetName);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Other;
end.
{--------------------------------------------------------------}
Run the program and see what you get. Not very exciting, is it?
But hang in there, it’s a start, and things will get better.
The first thing we need is the ability to deal with more than one
statement, since a single-line branch is pretty limited. We did that
in the last session on interpreting, but this time let’s get a little more
formal. Consider the following BNF:
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure DoProgram;
begin
Block;
if Look <> ’e’ then Expected(’End’);
EmitLn(’END’)
end;
{--------------------------------------------------------------}
51
Some groundwork CONTROL CONSTRUCTS
{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }
procedure Block;
begin
while not(Look in [’e’]) do begin
Other;
end;
end;
{--------------------------------------------------------------}
(From the form of the procedure, you just KNOW we’re going to
be adding to it in a bit!)
OK, enter these routines into your program. Replace the call to
Block in the main program, by a call to DoProgram. Now try it and
see how it works. Well, it’s still not much, but we’re getting closer.
SOME GROUNDWORK
IF ( <condition> ) <statement>
It’s clear, then, that we’re going to need some more procedures
to help us deal with these branches. I’ve defined two of them below.
Procedure NewLabel generates unique labels. This is done via the
simple expedient of calling every label ’Lnn’, where nn is a label
number starting from zero. Procedure PostLabel just outputs the
labels at the proper place.
Here are the two routines:
{--------------------------------------------------------------}
{ Generate a Unique Label }
function NewLabel: string;
var S: string;
begin
Str(LCount, S);
NewLabel := ’L’ + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }
procedure PostLabel(L: string);
begin
WriteLn(L, ’:’);
end;
{--------------------------------------------------------------}
LCount := 0;
IF: First, get the condition and issue the code for it.
Then, create a unique label and emit a branch if false.
IF
<condition> { Condition;
L = NewLabel;
Emit(Branch False to L); }
<block>
ENDIF { PostLabel(L) }
“false,” and anything else (some use FFFF, some 0001) represent
“true.”
On the 68000 the condition flags are set whenever any data is
moved or calculated. If the data is a 0000 (corresponding to a false
condition, remember), the zero flag will be set. The code for “Branch
on zero” is BEQ. So for our purposes here,
BEQ <=> Branch if false
BNE <=> Branch if true
It’s the nature of the beast that most of the branches we see will
be BEQ’s ... we’ll be branching AROUND the code that’s supposed
to be executed when the condition is true.
THE IF STATEMENT
With that bit of explanation out of the way, we’re finally ready to
begin coding the IF-statement parser. In fact, we’ve almost already
done it! As usual, I’ll be using our single-character approach, with
the character ’i’ for IF, and ’e’ for ENDIF (as well as END ... that
dual nature causes no confusion). I’ll also, for now, skip completely
the character for the branch condition, which we still have to define.
The code for DoIf is:
{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L: string;
begin
Match(’i’);
L := NewLabel;
Condition;
EmitLn(’BEQ ’ + L);
Block;
Match(’e’);
PostLabel(L);
end;
{--------------------------------------------------------------}
55
The IF statement CONTROL CONSTRUCTS
Insert this procedure in your program just before DoIf. Now run
the program. Try a string like
aibece
As you can see, the parser seems to recognize the construct and
inserts the object code at the right places. Now try a set of nested
IF’s, like
aibicedefe
<condition>
BEQ L1
<block>
BRA L2
L1: <block>
L2: ...
56
CONTROL CONSTRUCTS The WHILE statement
IF
<condition> { L1 = NewLabel;
L2 = NewLabel;
Emit(BEQ L1) }
<block>
ELSE { Emit(BRA L2);
PostLabel(L1) }
<block>
ENDIF { PostLabel(L2) }
{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure DoIf;
var L1, L2: string;
begin
Match(’i’);
Condition;
L1 := NewLabel;
L2 := L1;
EmitLn(’BEQ ’ + L1);
Block;
if Look = ’l’ then begin
Match(’l’);
L2 := NewLabel;
EmitLn(’BRA ’ + L2);
PostLabel(L1);
Block;
end;
Match(’e’);
PostLabel(L2);
end;
{--------------------------------------------------------------}
aiblcede
aibece
Now try some nested IF’s. Try anything you like, including some
badly formed statements. Just remember that ’e’ is not a legal
“other” statement.
57
The WHILE statement CONTROL CONSTRUCTS
The next type of statement should be easy, since we already have the
process down pat. The syntax I’ve chosen for the WHILE statement
is
L1: <condition>
BEQ L2
<block>
BRA L1
L2:
WHILE { L1 = NewLabel;
PostLabel(L1) }
<condition> { Emit(BEQ L2) }
<block>
ENDWHILE { Emit(BRA L1);
PostLabel(L2) }
{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
Match(’w’);
L1 := NewLabel;
58
CONTROL CONSTRUCTS The LOOP statement
L2 := NewLabel;
PostLabel(L1);
Condition;
EmitLn(’BEQ ’ + L2);
Block;
Match(’e’);
EmitLn(’BRA ’ + L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
We could stop right here, and have a language that works. It’s been
shown many times that a high-order language with only two con-
structs, the IF and the WHILE, is sufficient to write structured code.
But we’re on a roll now, so let’s richen up the repertoire a bit.
This construct is even easier, since it has no condition test at all ...
it’s an infinite loop. What’s the point of such a loop? Not much, by
59
REPEAT-UNTIL CONTROL CONSTRUCTS
itself, but later on we’re going to add a BREAK command, that will
give us a way out. This makes the language considerably richer than
Pascal, which has no break, and also avoids the funny WHILE(1) or
WHILE TRUE of C and Pascal.
The syntax is simply
LOOP { L = NewLabel;
PostLabel(L) }
<block>
ENDLOOP { Emit(BRA L }
When you insert this routine, don’t forget to add a line in Block
to call it.
REPEAT-UNTIL
Here’s one construct that I lifted right from Pascal. The syntax is
REPEAT { L = NewLabel;
PostLabel(L) }
<block>
UNTIL
<condition> { Emit(BEQ L) }
60
CONTROL CONSTRUCTS The FOR loop
{--------------------------------------------------------------}
{ Parse and Translate a REPEAT Statement }
procedure DoRepeat;
var L: string;
begin
Match(’r’);
L := NewLabel;
PostLabel(L);
Block;
Match(’u’);
Condition;
EmitLn(’BEQ ’ + L);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }
procedure Block;
begin
while not(Look in [’e’, ’l’, ’u’]) do begin
case Look of
’i’: DoIf;
’w’: DoWhile;
’p’: DoLoop;
’r’: DoRepeat;
else Other;
end;
end;
end;
{--------------------------------------------------------------}
The FOR loop is a very handy one to have around, but it’s a bear to
translate. That’s not so much because the construct itself is hard ...
it’s only a loop after all ... but simply because it’s hard to implement
in assembler language. Once the code is figured out, the translation
is straightforward enough.
C fans love the FOR-loop of that language (and, in fact, it’s easier
to code), but I’ve chosen instead a syntax very much like the one
from good ol’ BASIC:
61
The FOR loop CONTROL CONSTRUCTS
Notice that with this definition of the loop, <block> will not be
executed at all if <expr1> is initially larger than <expr2>.
The 68000 code needed to do this is trickier than anything we’ve
done so far. I had a couple of tries at it, putting both the counter and
the upper limit on the stack, both in registers, etc. I finally arrived
at a hybrid arrangement, in which the loop counter is in memory (so
that it can be accessed within the loop), and the upper limit is on
the stack. The translated code came out like this:
<ident> get name of loop counter
<expr1> get initial value
LEA <ident>(PC),A0 address the loop counter
SUBQ #1,D0 predecrement it
MOVE D0,(A0) save it
<expr1> get upper limit
MOVE D0,-(SP) save it on stack
Wow! That seems like a lot of code ... the line containing <block>
seems to almost get lost. But that’s the best I could do with it. I
guess it helps to keep in mind that it’s really only sixteen words, after
all. If anyone else can optimize this better, please let me know.
Still, the parser routine is pretty easy now that we have the code:
62
CONTROL CONSTRUCTS The DO statement
{--------------------------------------------------------------}
{ Parse and Translate a FOR Statement }
procedure DoFor;
var L1, L2: string;
Name: char;
begin
Match(’f’);
L1 := NewLabel;
L2 := NewLabel;
Name := GetName;
Match(’=’);
Expression;
EmitLn(’SUBQ #1,D0’);
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’);
Expression;
EmitLn(’MOVE D0,-(SP)’);
PostLabel(L1);
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE (A0),D0’);
EmitLn(’ADDQ #1,D0’);
EmitLn(’MOVE D0,(A0)’);
EmitLn(’CMP (SP),D0’);
EmitLn(’BGT ’ + L2);
Block;
Match(’e’);
EmitLn(’BRA ’ + L1);
PostLabel(L2);
EmitLn(’ADDQ #2,SP’);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate an Expression }
{ This version is a dummy }
Procedure Expression;
begin
EmitLn(’<expr>’);
end;
{--------------------------------------------------------------}
Give it a try. Once again, don’t forget to add the call in Block.
Since we don’t have any input for the dummy version of Expression,
a typical input line would look something like
afi=bece
Well, it DOES generate a lot of code, doesn’t it? But at least it’s
the RIGHT code.
THE DO STATEMENT
All this made me wish for a simpler version of the FOR loop. The
reason for all the code above is the need to have the loop counter
accessible as a variable within the loop. If all we need is a counting
63
The BREAK statement CONTROL CONSTRUCTS
DO
<expr> { Emit(SUBQ #1,D0);
L = NewLabel;
PostLabel(L);
Emit(MOVE D0,-(SP) }
<block>
ENDDO { Emit(MOVE (SP)+,D0;
Emit(DBRA D0,L) }
That’s quite a bit simpler! The loop will execute <expr> times.
Here’s the code:
{--------------------------------------------------------------}
{ Parse and Translate a DO Statement }
procedure Dodo;
var L: string;
begin
Match(’d’);
L := NewLabel;
Expression;
EmitLn(’SUBQ #1,D0’);
PostLabel(L);
EmitLn(’MOVE D0,-(SP)’);
Block;
EmitLn(’MOVE (SP)+,D0’);
EmitLn(’DBRA D0,’ + L);
end;
{--------------------------------------------------------------}
I think you’ll have to agree, that’s a whole lot simpler than the
classical FOR. Still, each construct has its place.
{--------------------------------------------------------------}
{ Parse and Translate a LOOP Statement }
procedure DoLoop;
var L1, L2: string;
begin
Match(’p’);
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Block(L2);
Match(’e’);
EmitLn(’BRA ’ + L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
Notice that DoLoop now has TWO labels, not just one. The
second is to give the BREAK instruction a target to jump to. If there
65
The BREAK statement CONTROL CONSTRUCTS
Again, notice that all Block does with the label is to pass it into
DoIf and DoBreak. The loop constructs don’t need it, because they
are going to pass their own label anyway.
The new version of DoIf is:
{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block(L: string); Forward;
procedure DoIf(L: string);
var L1, L2: string;
begin
Match(’i’);
Condition;
L1 := NewLabel;
L2 := L1;
EmitLn(’BEQ ’ + L1);
Block(L);
if Look = ’l’ then begin
Match(’l’);
L2 := NewLabel;
EmitLn(’BRA ’ + L2);
PostLabel(L1);
Block(L);
end;
Match(’e’);
PostLabel(L2);
end;
{--------------------------------------------------------------}
Here, the only thing that changes is the addition of the parameter
to procedure Block. An IF statement doesn’t change the loop nesting
level, so DoIf just passes the label along. No matter how many levels
of IF nesting we have, the same label will be used.
Now, remember that DoProgram also calls Block, so it now needs
to pass it a label. An attempt to exit the outermost block is an error,
so DoProgram passes a null label which is caught by DoBreak:
66
CONTROL CONSTRUCTS Conclusion
{--------------------------------------------------------------}
{ Recognize and Translate a BREAK }
procedure DoBreak(L: string);
begin
Match(’b’);
if L <> ’’ then
EmitLn(’BRA ’ + L)
else Abort(’No loop to break from’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure DoProgram;
begin
Block(’’);
if Look <> ’e’ then Expected(’End’);
EmitLn(’END’)
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a DO Statement }
procedure Dodo;
var L1, L2: string;
begin
Match(’d’);
L1 := NewLabel;
L2 := NewLabel;
Expression;
EmitLn(’SUBQ #1,D0’);
PostLabel(L1);
EmitLn(’MOVE D0,-(SP)’);
Block(L2);
EmitLn(’MOVE (SP)+,D0’);
EmitLn(’DBRA D0,’ + L1);
EmitLn(’SUBQ #2,SP’);
PostLabel(L2);
EmitLn(’ADDQ #2,SP’);
end;
{--------------------------------------------------------------}
The two extra instructions, the SUBQ and ADDQ, take care of
leaving the stack in the right shape.
67
Conclusion CONTROL CONSTRUCTS
CONCLUSION
{--------------------------------------------------------------}
program Branch;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look : char; { Lookahead Character }
Lcount: integer; { Label Counter }
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected(’’’’ + x + ’’’’);
68
CONTROL CONSTRUCTS Conclusion
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: char;
begin
if not IsAlpha(Look) then Expected(’Name’);
GetName := UpCase(Look);
GetChar;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: char;
begin
if not IsDigit(Look) then Expected(’Integer’);
GetNum := Look;
GetChar;
end;
{--------------------------------------------------------------}
{ Generate a Unique Label }
function NewLabel: string;
var S: string;
begin
Str(LCount, S);
NewLabel := ’L’ + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }
procedure PostLabel(L: string);
begin
WriteLn(L, ’:’);
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
69
Conclusion CONTROL CONSTRUCTS
procedure Condition;
begin
EmitLn(’<condition>’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Math Expression }
procedure Expression;
begin
EmitLn(’<expr>’);
end;
{--------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block(L: string); Forward;
procedure DoIf(L: string);
var L1, L2: string;
begin
Match(’i’);
Condition;
L1 := NewLabel;
L2 := L1;
EmitLn(’BEQ ’ + L1);
Block(L);
if Look = ’l’ then begin
Match(’l’);
L2 := NewLabel;
EmitLn(’BRA ’ + L2);
PostLabel(L1);
Block(L);
end;
Match(’e’);
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
Match(’w’);
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Condition;
EmitLn(’BEQ ’ + L2);
Block(L2);
Match(’e’);
EmitLn(’BRA ’ + L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Parse and Translate a LOOP Statement }
procedure DoLoop;
var L1, L2: string;
begin
Match(’p’);
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Block(L2);
Match(’e’);
EmitLn(’BRA ’ + L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Parse and Translate a REPEAT Statement }
procedure DoRepeat;
var L1, L2: string;
begin
Match(’r’);
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
Block(L2);
Match(’u’);
Condition;
EmitLn(’BEQ ’ + L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
70
CONTROL CONSTRUCTS Conclusion
71
Conclusion CONTROL CONSTRUCTS
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure DoProgram;
begin
Block(’’);
if Look <> ’e’ then Expected(’End’);
EmitLn(’END’)
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
LCount := 0;
GetChar;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DoProgram;
end.
{--------------------------------------------------------------}
72
Part VI
31 August 1988.
BOOLEAN EXPRESSIONS
The grammar BOOLEAN EXPRESSIONS
INTRODUCTION
THE PLAN
THE GRAMMAR
For some time now, we’ve been implementing BNF syntax equations
for arithmetic expressions, without ever actually writing them down
all in one place. It’s time that we did so. They are:
Actually, while we’re on the subject, I’d like to amend this gram-
mar a bit right now. The way we’ve handled the unary minus is a bit
awkward. I’ve found that it’s better to write the grammar this way:
<expression> ::= <term> [<addop> <term>]*
<term> ::= <signed factor> [<mulop> factor]*
<signed factor> ::= [<addop>] <factor>
<factor> ::= <integer> | <variable> | (<expression>)
This puts the job of handling the unary minus onto Factor, which
is where it really belongs.
This doesn’t mean that you have to go back and recode the pro-
grams you’ve already written, although you’re free to do so if you
like. But I will be using the new syntax from now on.
Now, it probably won’t come as a shock to you to learn that we
can define an analogous grammar for Boolean algebra. A typical set
or rules is:
<b-expression>::= <b-term> [<orop> <b-term>]*
<b-term> ::= <not-factor> [AND <not-factor>]*
<not-factor> ::= [NOT] <b-factor>
<b-factor> ::= <b-literal> | <b-variable> | (<b-expression>)
or worse yet,
a - -b
makes perfect sense, and the syntax shown allows for that.
75
Relops BOOLEAN EXPRESSIONS
RELOPS
OK, assuming that you’re willing to accept the grammar I’ve shown
here, we now have syntax rules for both arithmetic and Boolean al-
gebra. The sticky part comes in when we have to combine the two.
Why do we have to do that? Well, the whole subject came up be-
cause of the need to process the “predicates” (conditions) associated
with control statements such as the IF. The predicate is required to
have a Boolean value; that is, it must evaluate to either TRUE or
FALSE. The branch is then taken or not taken, depending on that
value. What we expect to see going on in procedure Condition, then,
is the evaluation of a Boolean expression.
But there’s more to it than that. A pure Boolean expression can
indeed be the predicate of a control statement ... things like
Here, the two terms in parens are Boolean expressions, but the
individual terms being compared: x, 0, and 100, are NUMERIC in
nature. The RELATIONAL OPERATORS >= and <= are the cata-
lysts by which the Boolean and the arithmetic ingredients get merged
together.
Now, in the example above, the terms being compared are just
that: terms. However, in general each side can be a math expression.
So we can define a RELATION to be:
where the expressions we’re talking about here are the old numeric
type, and the relops are any of the usual symbols
If you think about it a bit, you’ll agree that, since this kind of
predicate has a single Boolean value, TRUE or FALSE, as its result,
it is really just another kind of factor. So we can expand the definition
of a Boolean factor above to read:
76
BOOLEAN EXPRESSIONS Relops
THAT’s the connection! The relops and the relation they define
serve to wed the two kinds of algebra. It is worth noting that this
implies a hierarchy where the arithmetic expression has a HIGHER
precedence that a Boolean factor, and therefore than all the Boolean
operators. If you write out the precedence levels for all the operators,
you arrive at the following list:
When the parser is parsing this code, it knows after it sees the IF
token that a Boolean expression is supposed to be next. So it can set
up to begin evaluating such an expression. But the first expression
in the example is an ARITHMETIC expression, A + B + C. What’s
worse, at the point that the parser has read this much of the input
line:
IF ((((((A ,
the two cases. The situation can be handled without changing any
of our definitions, but only if we’re willing to accept an arbitrary
amount of backtracking to work our way out of bad guesses. No
compiler writer in his right mind would agree to that.
What’s going on here is that the beauty and elegance of BNF
grammar has met face to face with the realities of compiler technol-
ogy.
To deal with this situation, compiler writers have had to make
compromises so that a single parser can handle the grammar without
backtracking.
Notice that there is only ONE set of syntax rules, applying to both
kinds of operators. According to this grammar, then, expressions like
78
BOOLEAN EXPRESSIONS Fixing the grammar
are perfectly legal. And, in fact, they ARE ... as far as the parser is
concerned. Pascal doesn’t allow the mixing of arithmetic and Boolean
variables, and things like this are caught at the SEMANTIC level,
when it comes time to generate code for them, rather than at the
syntax level.
The authors of C took a diametrically opposite approach: they
treat the operators as different, and have something much more akin
to our seven levels of precedence. In fact, in C there are no fewer
than 17 levels! That’s because C also has the operators ‘=’, ’+=’
and its kin, ’<<’, ’>>’, ’++’, ’--’, etc. Ironically, although in C the
arithmetic and Boolean operators are treated separately, the variables
are NOT ... there are no Boolean or logical variables in C, so a
Boolean test can be made on any integer value.
We’ll do something that’s sort of in-between. I’m tempted to stick
mostly with the Pascal approach, since that seems the simplest from
an implementation point of view, but it results in some funnies that
I never liked very much, such as the fact that, in the expression
79
The parser BOOLEAN EXPRESSIONS
This grammar results in the same set of seven levels that I showed
earlier. Really, it’s almost the same grammar ... I just removed
the option of parenthesized b-expressions as a possible b-factor, and
added the relation as a legal form of b-factor.
There is one subtle but crucial difference, which is what makes the
whole thing work. Notice the square brackets in the definition of a
relation. This means that the relop and the second expression are
OPTIONAL.
A strange consequence of this grammar (and one shared by C)
is that EVERY expression is potentially a Boolean expression. The
parser will always be looking for a Boolean expression, but will “set-
tle” for an arithmetic one. To be honest, that’s going to slow down
the parser, because it has to wade through more layers of procedure
calls. That’s one reason why Pascal compilers tend to compile faster
than C compilers. If it’s raw speed you want, stick with the Pascal
syntax.
THE PARSER
Type these routines into your program. You can test them by
adding into the main program the print statement
80
BOOLEAN EXPRESSIONS The parser
WriteLn(GetBoolean);
OK, compile the program and test it. As usual, it’s not very
impressive so far, but it soon will be.
Now, when we were dealing with numeric data we had to arrange
to generate code to load the values into D0. We need to do the same
for Boolean data. The usual way to encode Boolean variables is to let
0 stand for FALSE, and some other value for TRUE. Many languages,
such as C, use an integer 1 to represent it. But I prefer FFFF hex (or
-1), because a bitwise NOT also becomes a Boolean NOT. So now we
need to emit the right assembler code to load those values. The first
cut at the Boolean expression parser (BoolExpression, of course) is:
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }
procedure BoolExpression;
begin
if not IsBoolean(Look) then Expected(’Boolean Literal’);
if GetBoolean then
EmitLn(’MOVE #-1,D0’)
else
EmitLn(’CLR D0’);
end;
{---------------------------------------------------------------}
Add this procedure to your parser, and call it from the main pro-
gram (replacing the print statement you had just put there). As you
can see, we still don’t have much of a parser, but the output code is
starting to look more realistic.
Next, of course, we have to expand the definition of a Boolean
expression. We already have the BNF rule:
{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }
procedure BoolOr;
begin
Match(’|’);
BoolTerm;
EmitLn(’OR (SP)+,D0’);
end;
{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }
procedure BoolXor;
81
The parser BOOLEAN EXPRESSIONS
begin
Match(’~’);
BoolTerm;
EmitLn(’EOR (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }
procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’|’: BoolOr;
’~’: BoolXor;
end;
end;
end;
{---------------------------------------------------------------}
Note the new recognizer IsOrOp, which is also a copy, this time of
IsAddOp:
{--------------------------------------------------------------}
{ Recognize a Boolean Orop }
function IsOrop(c: char): Boolean;
begin
IsOrop := c in [’|’, ’~’];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with NOT }
procedure NotFactor;
begin
if Look = ’!’ then begin
Match(’!’);
BoolFactor;
EmitLn(’EOR #-1,D0’);
end
else
BoolFactor;
end;
{--------------------------------------------------------------}
If you’ve been following what we did in the parser for math expres-
sions, you know that what we did next was to expand the definition
of a factor to include variables and parens. We don’t have to do that
for the Boolean factor, because those little items get taken care of by
the next step. It takes just a one line addition to BoolFactor to take
care of relations:
{--------------------------------------------------------------}
{ Parse and Translate a Boolean Factor }
procedure BoolFactor;
begin
if IsBoolean(Look) then
if GetBoolean then
EmitLn(’MOVE #-1,D0’)
else
EmitLn(’CLR D0’)
else Relation;
end;
{--------------------------------------------------------------}
Of course, it would help to have some code for Relation. I don’t feel
comfortable, though, adding any more code without first checking out
what we already have. So for now let’s just write a dummy version
of Relation that does nothing except eat the current character, and
write a little message:
83
The parser BOOLEAN EXPRESSIONS
{---------------------------------------------------------------}
{ Parse and Translate a Relation }
procedure Relation;
begin
WriteLn(’<Relation>’);
GetChar;
end;
{--------------------------------------------------------------}
OK, key in this code and give it a try. All the old things should
still work ... you should be able to generate the code for ANDs,
ORs, and NOTs. In addition, if you type any alphabetic character
you should get a little <Relation> place-holder, where a Boolean
factor should be. Did you get that? Fine, then let’s move on to the
full-blown version of Relation.
To get that, though, there is a bit of groundwork that we must lay
first. Recall that a relation has the form
{--------------------------------------------------------------}
{ Recognize a Relop }
function IsRelop(c: char): Boolean;
begin
IsRelop := c in [’=’, ’#’, ’<’, ’>’];
end;
{--------------------------------------------------------------}
Comparing numeric data is easy enough ... the 68000 has an op-
eration for that ... but it sets the flags, not a value. What’s more,
the flags will always be set the same (zero if equal, etc.), while we
need the zero flag set differently for the each of the different relops.
The solution is found in the 68000 instruction Scc, which sets a
byte value to 0000 or FFFF (funny how that works!) depending upon
the result of the specified condition. If we make the destination byte
to be D0, we get the Boolean value needed.
Unfortunately, there’s one final complication: unlike almost every
other instruction in the 68000 set, Scc does NOT reset the condition
flags to match the data being stored. So we have to do one last step,
which is to test D0 and set the flags to match it. It must seem to be a
trip around the moon to get what we want: we first perform the test,
then test the flags to set data into D0, then test D0 to set the flags
again. It is sort of roundabout, but it’s the most straightforward way
to get the flags right, and after all it’s only a couple of instructions.
I might mention here that this area is, in my opinion, the one that
represents the biggest difference between the efficiency of hand-coded
assembler language and compiler-generated code. We have seen al-
ready that we lose efficiency in arithmetic operations, although later
I plan to show you how to improve that a bit. We’ve also seen that
the control constructs themselves can be done quite efficiently ... it’s
usually very difficult to improve on the code generated for an IF or
a WHILE. But virtually every compiler I’ve ever seen generates ter-
rible code, compared to assembler, for the computation of a Boolean
function, and particularly for relations. The reason is just what I’ve
hinted at above. When I’m writing code in assembler, I go ahead and
perform the test the most convenient way I can, and then set up the
branch so that it goes the way it should. In effect, I “tailor” every
branch to the situation. The compiler can’t do that (practically),
and it also can’t know that we don’t want to store the result of the
test as a Boolean variable. So it must generate the code in a very
strict order, and it often ends up loading the result as a Boolean that
never gets used for anything.
In any case, we’re now ready to look at the code for Relation. It’s
shown below with its companion procedures:
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }
procedure Equals;
begin
85
The parser BOOLEAN EXPRESSIONS
Match(’=’);
Expression;
EmitLn(’CMP (SP)+,D0’);
EmitLn(’SEQ D0’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }
procedure NotEquals;
begin
Match(’#’);
Expression;
EmitLn(’CMP (SP)+,D0’);
EmitLn(’SNE D0’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }
procedure Less;
begin
Match(’<’);
Expression;
EmitLn(’CMP (SP)+,D0’);
EmitLn(’SGE D0’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }
procedure Greater;
begin
Match(’>’);
Expression;
EmitLn(’CMP (SP)+,D0’);
EmitLn(’SLE D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Relation }
procedure Relation;
begin
Expression;
if IsRelop(Look) then begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’=’: Equals;
’#’: NotEquals;
’<’: Less;
’>’: Greater;
end;
EmitLn(’TST D0’);
end;
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ Parse and Translate an Identifier }
procedure Ident;
var Name: char;
begin
86
BOOLEAN EXPRESSIONS The parser
Name:= GetName;
if Look = ’(’ then begin
Match(’(’);
Match(’)’);
EmitLn(’BSR ’ + Name);
end
else
EmitLn(’MOVE ’ + Name + ’(PC),D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Expression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsAlpha(Look) then
Ident
else
EmitLn(’MOVE #’ + GetNum + ’,D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate the First Math Factor }
procedure SignedFactor;
begin
if Look = ’+’ then
GetChar;
if Look = ’-’ then begin
GetChar;
if IsDigit(Look) then
EmitLn(’MOVE #-’ + GetNum + ’,D0’)
else begin
Factor;
EmitLn(’NEG D0’);
end;
end
else Factor;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match(’*’);
Factor;
EmitLn(’MULS (SP)+,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match(’/’);
Factor;
EmitLn(’MOVE (SP)+,D1’);
EmitLn(’EXS.L D0’);
EmitLn(’DIVS D1,D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
SignedFactor;
while Look in [’*’, ’/’] do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’*’: Multiply;
’/’: Divide;
end;
end;
end;
{---------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match(’+’);
Term;
87
Merging with control constructs BOOLEAN EXPRESSIONS
EmitLn(’ADD (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match(’-’);
Term;
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
Term;
while IsAddop(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’+’: Add;
’-’: Subtract;
end;
end;
end;
{---------------------------------------------------------------}
There you have it ... a parser that can handle both arithmetic
AND Boolean algebra, and things that combine the two through the
use of relops. I suggest you file away a copy of this parser in a safe
place for future reference, because in our next step we’re going to be
chopping it up.
At this point, let’s go back to the file we had previously built that
parses control constructs. Remember those little dummy procedures
called Condition and Expression? Now you know what goes in their
places!
I warn you, you’re going to have to do some creative editing here,
so take your time and get it right. What you need to do is to copy
all of the procedures from the logic parser, from Ident through Bool-
Expression, into the parser for control constructs. Insert them at the
current location of Condition. Then delete that procedure, as well
as the dummy Expression. Next, change every call to Condition to
refer to BoolExpression instead. Finally, copy the procedures IsMu-
lop, IsOrOp, IsRelop, IsBoolean, and GetBoolean into place. That
should do it.
Compile the resulting program and give it a try. Since we haven’t
used this program in awhile, don’t forget that we used single-character
tokens for IF, WHILE, etc. Also don’t forget that any letter not a
keyword just gets echoed as a block.
Try
88
BOOLEAN EXPRESSIONS Adding assignments
ia=bxlye
ADDING ASSIGNMENTS
As long as we’re this far, and we already have the routines for ex-
pressions in place, we might as well replace the “blocks” with real
assignment statements. We’ve already done that before, so it won’t
be too hard. Before taking that step, though, we need to fix some-
thing else.
We’re soon going to find that the one-line “programs” that we’re
having to write here will really cramp our style. At the moment we
have no cure for that, because our parser doesn’t recognize the end-
of-line characters, the carriage return (CR) and the line feed (LF).
So before going any further let’s plug that hole.
There are a couple of ways to deal with the CR/LFs. One (the
C/Unix approach) is just to treat them as additional white space
characters and ignore them. That’s actually not such a bad ap-
proach, but it does sort of produce funny results for our parser as
it stands now. If it were reading its input from a source file as any
self-respecting REAL compiler does, there would be no problem. But
we’re reading input from the keyboard, and we’re sort of conditioned
to expect something to happen when we hit the return key. It won’t,
if we just skip over the CR and LF (try it). So I’m going to use a
different method here, which is NOT necessarily the best approach
in the long run. Consider it a temporary kludge until we’re further
along.
Instead of skipping the CR/LF, We’ll let the parser go ahead and
catch them, then introduce a special procedure, analogous to Skip-
White, that skips them only in specified “legal” spots.
Here’s the procedure:
{--------------------------------------------------------------}
{ Skip a CRLF }
procedure Fin;
begin
if Look = CR then GetChar;
if Look = LF then GetChar;
end;
{--------------------------------------------------------------}
89
Adding assignments BOOLEAN EXPRESSIONS
Now, you’ll find that you can use multiple-line “programs.” The
only restriction is that you can’t separate an IF or WHILE token
from its predicate.
Now we’re ready to include the assignment statements. Simply
change that call to Other in procedure Block to a call to Assignment,
and add the following procedure, copied from one of our earlier pro-
grams. Note that Assignment now calls BoolExpression, so that we
can assign Boolean variables.
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := GetName;
Match(’=’);
BoolExpression;
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’);
end;
{--------------------------------------------------------------}
time, when you’ve had a little more time to digest what we’ve done
and are ready to start fresh.
In the next installment, then, we’ll build a lexical scanner and
eliminate the single-character barrier once and for all. We’ll also
write our first complete compiler, based on what we’ve done in this
session. See you then.
91
Part VII
7 November 1988.
LEXICAL SCANNING
LEXICAL SCANNING Introduction
INTRODUCTION
In the last installment, I left you with a compiler that would AL-
MOST work, except that we were still limited to single-character
tokens. The purpose of this session is to get rid of that restriction,
once and for all. This means that we must deal with the concept of
the lexical scanner.
Maybe I should mention why we need a lexical scanner at all ...
after all, we’ve been able to manage all right without one, up till now,
even when we provided for multi-character tokens.
The ONLY reason, really, has to do with keywords. It’s a fact of
computer life that the syntax for a keyword has the same form as that
for any other identifier. We can’t tell until we get the complete word
whether or not it IS a keyword. For example, the variable IFILE and
the keyword IF look just alike, until you get to the third character. In
the examples to date, we were always able to make a decision based
upon the first character of the token, but that’s no longer possible
when keywords are present. We need to know that a given string is
a keyword BEFORE we begin to process it. And that’s why we need
a scanner.
In the last session, I also promised that we would be able to provide
for normal tokens without making wholesale changes to what we have
already done. I didn’t lie ... we can, as you will see later. But every
time I set out to install these elements of the software into the parser
we have already built, I had bad feelings about it. The whole thing
felt entirely too much like a band-aid. I finally figured out what
was causing the problem: I was installing lexical scanning software
without first explaining to you what scanning is all about, and what
the alternatives are. Up till now, I have studiously avoided giving
you a lot of theory, and certainly not alternatives. I generally don’t
respond well to the textbooks that give you twenty-five different ways
to do something, but no clue as to which way best fits your needs.
I’ve tried to avoid that pitfall by just showing you ONE method, that
WORKS.
But this is an important area. While the lexical scanner is hardly
the most exciting part of a compiler, it often has the most profound
effect on the general “look & feel” of the language, since after all it’s
the part closest to the user. I have a particular structure in mind
for the scanner to be used with KISS. It fits the look & feel that I
93
Lexical scanning LEXICAL SCANNING
want for that language. But it may not work at all for the language
YOU’RE cooking up, so in this one case I feel that it’s important for
you to know your options.
So I’m going to depart, again, from my usual format. In this ses-
sion we’ll be getting much deeper than usual into the basic theory
of languages and grammars. I’ll also be talking about areas OTHER
than compilers in which lexical scanning plays an important role. Fi-
nally, I will show you some alternatives for the structure of the lexical
scanner. Then, and only then, will we get back to our parser from
the last installment. Bear with me ... I think you’ll find it’s worth
the wait. In fact, since scanners have many applications outside of
compilers, you may well find this to be the most useful session for
you.
LEXICAL SCANNING
• Type 1: Context-Sensitive
• Type 2: Context-Free
• Type 3: Regular
Using this let’s write the following two routines, which are very
similar to those we’ve used before:
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var x: string[8];
begin
x := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlNum(Look) do begin
x := x + UpCase(Look);
GetChar;
end;
GetName := x;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: string;
var x: string[16];
begin
x := ’’;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
x := x + Look;
GetChar;
end;
GetNum := x;
end;
{--------------------------------------------------------------}
97
State machines LEXICAL SCANNING
This program will print any legal name typed in (maximum eight
characters, since that’s what we told GetName). It will reject any-
thing else.
Test the other routine similarly.
WHITE SPACE
We also have dealt with embedded white space before, using the two
routines IsWhite and SkipWhite. Make sure that these routines are
in your current version of the cradle, and add the the line
SkipWhite;
(You will have to add the declaration of the string Token at the
beginning of the program. Make it any convenient length, say 16
characters.)
Now, run the program. Note how the input string is, indeed,
separated into distinct tokens.
98
LEXICAL SCANNING Newlines
STATE MACHINES
For the record, a parse routine like GetName does indeed implement
a state machine. The state is implicit in the current position in the
code. A very useful trick for visualizing what’s going on is the syntax
diagram, or “railroad-track” diagram. It’s a little difficult to draw
one in this medium, so I’ll use them very sparingly, but the figure
below should give you the idea:
As you can see, this diagram shows how the logic flows as char-
acters are read. Things begin, of course, in the start state, and end
when a character other than an alphanumeric is found. If the first
character is not alpha, an error occurs. Otherwise the machine will
continue looping until the terminating delimiter is found.
Note that at any point in the flow, our position is entirely depen-
dent on the past history of the input characters. At that point, the
action to be taken depends only on the current state, plus the current
input character. That’s what make this a state machine.
Because of the difficulty of drawing railroad-track diagrams in this
medium, I’ll continue to stick to syntax equations from now on. But
I highly recommend the diagrams to you for anything you do that
involves parsing. After a little practice you can begin to see how to
write a parser directly from the diagrams. Parallel paths get coded
into guarded actions (guarded by IF’s or CASE statements), serial
paths into sequential calls. It’s almost like working from a schematic.
We didn’t even discuss SkipWhite, which was introduced earlier,
but it also is a simple state machine, as is GetNum. So is their parent
procedure, Scan. Little machines make big machines.
The neat thing that I’d like you to note is how painlessly this
implicit approach creates these state machines. I personally prefer it
a lot over the table-driven approach. It also results is a small, tight,
and fast scanner.
99
Newlines LEXICAL SCANNING
NEWLINES
Moving right along, let’s modify our scanner to handle more than
one line. As I mentioned last time, the most straightforward way to
do this is to simply treat the newline characters, carriage return and
line feed, as white space. This is, in fact, the way the C standard
library routine, iswhite, works. We didn’t actually try this before.
I’d like to do it now, so you can get a feel for the results.
To do this, simply modify the single executable line of IsWhite to
read:
OK, compile this program and run it. Try a couple of lines, ter-
minated by the period. I used:
Hey, what happened? When I tried it, I didn’t get the last token,
the period. The program didn’t halt. What’s more, when I pressed
the ’enter’ key a few times, I still didn’t get the period.
If you’re still stuck in your program, you’ll find that typing a
period on a new line will terminate it.
What’s going on here? The answer is that we’re hanging up in
SkipWhite. A quick look at that routine will show that as long as
we’re typing null lines, we’re going to just continue to loop. After
SkipWhite encounters an LF, it tries to execute a GetChar. But since
the input buffer is now empty, GetChar’s read statement insists on
having another line. Procedure Scan gets the terminating period, all
right, but it calls SkipWhite to clean up, and SkipWhite won’t return
until it gets a non-null line.
This kind of behavior is not quite as bad as it seems. In a real
compiler, we’d be reading from an input file instead of the console,
100
LEXICAL SCANNING Newlines
Note the “guard” test preceding the call to Fin. That’s what
makes the whole thing work, and ensures that we don’t try to read
a line ahead.
Try the code now. I think you’ll like it better.
If you refer to the code we did in the last installment, you’ll find
that I quietly sprinkled calls to Fin throughout the code, wherever
a line break was appropriate. This is one of those areas that really
affects the look & feel that I mentioned. At this point I would urge
you to experiment with different arrangements and see how you like
them. If you want your language to be truly free-field, then newlines
should be transparent. In this case, the best approach is to put the
following lines at the BEGINNING of Scan:
while Look = CR do
Fin;
If, on the other hand, you want a line-oriented language like As-
sembler, BASIC, or FORTRAN (or even Ada... note that it has
comments terminated by newlines), then you’ll need for Scan to re-
turn CR’s as tokens. It must also eat the trailing LF. The best way
to do that is to use this line, again at the beginning of Scan:
101
Operators LEXICAL SCANNING
OPERATORS
We could stop now and have a pretty useful scanner for our purposes.
In the fragments of KISS that we’ve built so far, the only tokens
that have multiple characters are the identifiers and numbers. All
operators were single characters. The only exception I can think of is
the relops <=, >=, and <>, but they could be dealt with as special
cases.
Still, other languages have multi-character operators, such as the
’:=’ of Pascal or the ’++’ and ’>>’ of C. So while we may not
need multi-character operators, it’s nice to know how to get them if
necessary.
Needless to say, we can handle operators very much the same way
as the other tokens. Let’s start with a recognizer:
{--------------------------------------------------------------}
{ Recognize Any Operator }
function IsOp(c: char): boolean;
begin
IsOp := c in [’+’, ’-’, ’*’, ’/’, ’<’, ’>’, ’:’, ’=’];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Lexical Scanner }
Function Scan: string;
begin
while Look = CR do
102
LEXICAL SCANNING Lists, commas and command lines
Fin;
if IsAlpha(Look) then
Scan := GetName
else if IsDigit(Look) then
Scan := GetNum
else if IsOp(Look) then
Scan := GetOp
else begin
Scan := Look;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
Try the program now. You will find that any code fragments you
care to throw at it will be neatly broken up into individual tokens.
Before getting back to the main thrust of our study, I’d like to get
on my soapbox for a moment.
How many times have you worked with a program or operating
system that had rigid rules about how you must separate items in a
list? (Try, the last time you used MSDOS!) Some programs require
spaces as delimiters, and some require commas. Worst of all, some
require both, in different places. Most are pretty unforgiving about
violations of their rules.
I think this is inexcusable. It’s too easy to write a parser that
will handle both spaces and commas in a flexible way. Consider the
following procedure:
{--------------------------------------------------------------}
{ Skip Over a Comma }
procedure SkipComma;
begin
SkipWhite;
if Look = ’,’ then begin
GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
GETTING FANCY
OK, at this point we have a pretty nice lexical scanner that will
break an input stream up into tokens. We could use it as it stands
and have a servicable compiler. But there are some other aspects of
lexical scanning that we need to cover.
The main consideration is <shudder> efficiency. Remember when
we were dealing with single-character tokens, every test was a com-
parison of a single character, Look, with a byte constant. We also
used the Case statement heavily.
With the multi-character tokens being returned by Scan, all those
tests now become string comparisons. Much slower. And not only
slower, but more awkward, since there is no string equivalent of the
Case statement in Pascal. It seems especially wasteful to test for what
used to be single characters ... the ’=’, ’+’, and other operators ...
using string comparisons.
Using string comparison is not impossible ... Ron Cain used just
that approach in writing Small C. Since we’re sticking to the KISS
104
LEXICAL SCANNING Getting fancy
For this reason, most compiler writers ask the lexical scanner to
do a little more work, by “tokenizing” the input stream. The idea
is to match every token against a list of acceptable keywords and
operators, and return unique codes for each one recognized. In the
case of ordinary variable names or numbers, we just return a code
that says what kind of token they are, and save the actual string
somewhere else.
Table[1] := ’IF’;
Table[2] := ’ELSE’;
.
.
Table[n] := ’END’;
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
To test it, you can temporarily change the main program as fol-
lows:
{--------------------------------------------------------------}
{ Main Program }
begin
ReadLn(Token);
WriteLn(Lookup(Addr(KWList), Token, 4));
end.
{--------------------------------------------------------------}
107
Getting fancy LEXICAL SCANNING
What we’ve done here is to replace the string Token used earlier
with an enumerated type. Scan returns the type in variable Token,
and returns the string itself in the new variable Value.
OK, compile this and give it a whirl. If everything goes right, you
should see that we are now recognizing keywords.
What we have now is working right, and it was easy to generate
from what we had earlier. However, it still seems a little “busy”
to me. We can simplify things a bit by letting GetName, GetNum,
GetOp, and Scan be procedures working with the global variables
Token and Value, thereby eliminating the local copies. It also seems
a little cleaner to move the table lookup into GetName. The new
form for the four procedures is, then:
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
var k: integer;
begin
Value := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
k := Lookup(Addr(KWlist), Value, 4);
if k = 0 then
Token := Ident
else
Token := SymType(k-1);
end;
{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
Value := ’’;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
Value := Value + Look;
GetChar;
end;
Token := Number;
end;
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Value := ’’;
if not IsOp(Look) then Expected(’Operator’);
while IsOp(Look) do begin
Value := Value + Look;
GetChar;
end;
Token := Operator;
end;
{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
if IsAlpha(Look) then
GetName
else if IsDigit(Look) then
GetNum
108
LEXICAL SCANNING Returning a character
RETURNING A CHARACTER
Essentially every scanner I’ve ever seen that was written in Pascal
used the mechanism of an enumerated type that I’ve just described.
It is certainly a workable mechanism, but it doesn’t seem the simplest
approach to me.
For one thing, the list of possible symbol types can get pretty long.
Here, I’ve used just one symbol, “Operator,” to stand for all of the
operators, but I’ve seen other designs that actually return different
codes for each one.
There is, of course, another simple type that can be returned as
a code: the character. Instead of returning the enumeration value
’Operator’ for a ’+’ sign, what’s wrong with just returning the char-
acter itself? A character is just as good a variable for encoding the
different token types, it can be used in case statements easily, and
it’s sure a lot easier to type. What could be simpler?
Besides, we’ve already had experience with the idea of encoding
keywords as single characters. Our previous programs are already
written that way, so using this approach will minimize the changes
to what we’ve already done.
Some of you may feel that this idea of returning character codes
is too mickey-mouse. I must admit it gets a little awkward for multi-
character operators like ’<=’. If you choose to stay with the enu-
merated type, fine. For the rest, I’d like to show you how to change
what we’ve done above to support that approach.
First, you can delete the SymType declaration now ... we won’t
be needing that. And you can change the type of Token to char.
Next, to replace SymType, add the following constant string:
109
Returning a character LEXICAL SCANNING
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
begin
Value := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];
end;
{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
Value := ’’;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
Value := Value + Look;
GetChar;
end;
Token := ’#’;
end;
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Value := ’’;
if not IsOp(Look) then Expected(’Operator’);
while IsOp(Look) do begin
Value := Value + Look;
GetChar;
end;
if Length(Value) = 1 then
Token := Value[1]
else
Token := ’?’;
end;
{--------------------------------------------------------------}
{ Lexical Scanner }
procedure Scan;
var k: integer;
begin
while Look = CR do
Fin;
if IsAlpha(Look) then
GetName
else if IsDigit(Look) then
GetNum
else if IsOp(Look) then begin
GetOp
else begin
Value := Look;
Token := ’?’;
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Scan;
case Token of
’x’: write(’Ident ’);
’#’: Write(’Number ’);
’i’, ’l’, ’e’: Write(’Keyword ’);
else Write(’Operator ’);
end;
Writeln(Value);
110
LEXICAL SCANNING Distributed vs centralized scanners
This program should work the same as the previous version. A mi-
nor difference in structure, maybe, but it seems more straightforward
to me.
The structure for the lexical scanner that I’ve just shown you is very
conventional, and about 99% of all compilers use something very
close to it. This is not, however, the only possible structure, or even
always the best one.
The problem with the conventional approach is that the scanner
has no knowledge of context. For example, it can’t distinguish be-
tween the assignment operator ’=’ and the relational operator ‘=’
(perhaps that’s why both C and Pascal use different strings for the
two). All the scanner can do is to pass the operator along to the
parser, which can hopefully tell from the context which operator is
meant. Similarly, a keyword like ’IF’ has no place in the middle of
a math expression, but if one happens to appear there, the scanner
will see no problem with it, and will return it to the parser, properly
encoded as an ’IF’.
With this kind of approach, we are not really using all the infor-
mation at our disposal. In the middle of an expression, for example,
the parser “knows” that there is no need to look for keywords, but it
has no way of telling the scanner that. So the scanner continues to
do so. This, of course, slows down the compilation.
In real-world compilers, the designers often arrange for more in-
formation to be passed between parser and scanner, just to avoid this
kind of problem. But that can get awkward, and certainly destroys
a lot of the modularity of the structure.
The alternative is to seek some way to use the contextual infor-
mation that comes from knowing where we are in the parser. This
leads us back to the notion of a distributed scanner, in which various
portions of the scanner are called depending upon the context.
In KISS, as in most languages, keywords ONLY appear at the
beginning of a statement. In places like expressions, they are not
111
Merging scanner and parser LEXICAL SCANNING
Now that we’ve covered all of the theory and general aspects of lex-
ical scanning that we’ll be needing, I’m FINALLY ready to back up
my claim that we can accomodate multi-character tokens with min-
imal change to our previous work. To keep things short and simple
I will restrict myself here to a subset of what we’ve done before; I’m
allowing only one control construct (the IF) and no Boolean expres-
sions. That’s enough to demonstrate the parsing of both keywords
and expressions. The extension to the full set of constructs should
be pretty apparent from what we’ve already done.
All the elements of the program to parse this subset, using single-
character tokens, exist already in our previous programs. I built it
by judicious copying of these files, but I wouldn’t dare try to lead
you through that process. Instead, to avoid any confusion, the whole
program is shown below:
{--------------------------------------------------------------}
program KISS;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
LF = ^J;
112
LEXICAL SCANNING Merging scanner and parser
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look : char; { Lookahead Character }
Lcount: integer; { Label Counter }
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
{ Recognize a Mulop }
function IsMulop(c: char): boolean;
begin
IsMulop := c in [’*’, ’/’];
end;
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
113
Merging scanner and parser LEXICAL SCANNING
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look <> x then Expected(’’’’ + x + ’’’’);
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Skip a CRLF }
procedure Fin;
begin
if Look = CR then GetChar;
if Look = LF then GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: char;
begin
while Look = CR do
Fin;
if not IsAlpha(Look) then Expected(’Name’);
Getname := UpCase(Look);
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: char;
begin
if not IsDigit(Look) then Expected(’Integer’);
GetNum := Look;
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Generate a Unique Label }
function NewLabel: string;
var S: string;
begin
Str(LCount, S);
NewLabel := ’L’ + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }
procedure PostLabel(L: string);
begin
WriteLn(L, ’:’);
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
{---------------------------------------------------------------}
{ Parse and Translate an Identifier }
procedure Ident;
var Name: char;
begin
Name := GetName;
if Look = ’(’ then begin
Match(’(’);
Match(’)’);
EmitLn(’BSR ’ + Name);
end
else
EmitLn(’MOVE ’ + Name + ’(PC),D0’);
end;
114
LEXICAL SCANNING Merging scanner and parser
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Expression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsAlpha(Look) then
Ident
else
EmitLn(’MOVE #’ + GetNum + ’,D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate the First Math Factor }
procedure SignedFactor;
var s: boolean;
begin
s := Look = ’-’;
if IsAddop(Look) then begin
GetChar;
SkipWhite;
end;
Factor;
if s then
EmitLn(’NEG D0’);
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match(’*’);
Factor;
EmitLn(’MULS (SP)+,D0’);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match(’/’);
Factor;
EmitLn(’MOVE (SP)+,D1’);
EmitLn(’EXS.L D0’);
EmitLn(’DIVS D1,D0’);
end;
{---------------------------------------------------------------}
{ Completion of Term Processing (called by Term and FirstTerm }
procedure Term1;
begin
while IsMulop(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’*’: Multiply;
’/’: Divide;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
Factor;
Term1;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term with Possible Leading Sign }
procedure FirstTerm;
begin
SignedFactor;
Term1;
end;
{---------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match(’+’);
115
Merging scanner and parser LEXICAL SCANNING
Term;
EmitLn(’ADD (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match(’-’);
Term;
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
FirstTerm;
while IsAddop(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’+’: Add;
’-’: Subtract;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
{ This version is a dummy }
Procedure Condition;
begin
EmitLn(’Condition’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block;
Forward;
procedure DoIf;
var L1, L2: string;
begin
Match(’i’);
Condition;
L1 := NewLabel;
L2 := L1;
EmitLn(’BEQ ’ + L1);
Block;
if Look = ’l’ then begin
Match(’l’);
L2 := NewLabel;
EmitLn(’BRA ’ + L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
Match(’e’);
end;
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := GetName;
Match(’=’);
Expression;
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’);
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }
procedure Block;
begin
while not(Look in [’e’, ’l’]) do begin
case Look of
’i’: DoIf;
CR: while Look = CR do
Fin;
else Assignment;
end;
end;
116
LEXICAL SCANNING Merging scanner and parser
end;
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure DoProgram;
begin
Block;
if Look <> ’e’ then Expected(’END’);
EmitLn(’END’)
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
LCount := 0;
GetChar;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DoProgram;
end.
{--------------------------------------------------------------}
A couple of comments:
Before we proceed to adding the scanner, first copy this file and
verify that it does indeed parse things correctly. Don’t forget the
“codes”: ’i’ for IF, ’l’ for ELSE, and ’e’ for END or ENDIF.
If the program works, then let’s press on. In adding the scanner
modules to the program, it helps to have a systematic plan. In all the
parsers we’ve written to date, we’ve stuck to a convention that the
current lookahead character should always be a non-blank character.
We preload the lookahead character in Init, and keep the “pump
primed” after that. To keep the thing working right at newlines, we
had to modify this a bit and treat the newline as a legal token.
In the multi-character version, the rule is similar: The current
lookahead character should always be left at the BEGINNING of the
next token, or at a newline.
The multi-character version is shown next. To get it, I’ve made
the following changes:
• Added the variables Token and Value, and the type definitions
needed by Lookup.
117
Merging scanner and parser LEXICAL SCANNING
{--------------------------------------------------------------}
program KISS;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
LF = ^J;
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look : char; { Lookahead Character }
Token : char; { Encoded Token }
Value : string[16]; { Unencoded Token }
Lcount: integer; { Label Counter }
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const KWlist: array [1..4] of Symbol =
(’IF’, ’ELSE’, ’ENDIF’, ’END’);
const KWcode: string[5] = ’xilee’;
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
118
LEXICAL SCANNING Merging scanner and parser
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
{ Recognize a Mulop }
function IsMulop(c: char): boolean;
begin
IsMulop := c in [’*’, ’/’];
end;
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look <> x then Expected(’’’’ + x + ’’’’);
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Skip a CRLF }
procedure Fin;
begin
if Look = CR then GetChar;
if Look = LF then GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Table Lookup }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
found: boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
119
Merging scanner and parser LEXICAL SCANNING
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
begin
while Look = CR do
Fin;
if not IsAlpha(Look) then Expected(’Name’);
Value := ’’;
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
if not IsDigit(Look) then Expected(’Integer’);
Value := ’’;
while IsDigit(Look) do begin
Value := Value + Look;
GetChar;
end;
Token := ’#’;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }
procedure Scan;
begin
GetName;
Token := KWcode[Lookup(Addr(KWlist), Value, 4) + 1];
end;
{--------------------------------------------------------------}
{ Match a Specific Input String }
procedure MatchString(x: string);
begin
if Value <> x then Expected(’’’’ + x + ’’’’);
end;
{--------------------------------------------------------------}
{ Generate a Unique Label }
function NewLabel: string;
var S: string;
begin
Str(LCount, S);
NewLabel := ’L’ + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }
procedure PostLabel(L: string);
begin
WriteLn(L, ’:’);
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
{---------------------------------------------------------------}
{ Parse and Translate an Identifier }
procedure Ident;
begin
GetName;
120
LEXICAL SCANNING Merging scanner and parser
121
Merging scanner and parser LEXICAL SCANNING
procedure FirstTerm;
begin
SignedFactor;
Term1;
end;
{---------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Match(’+’);
Term;
EmitLn(’ADD (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Match(’-’);
Term;
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
FirstTerm;
while IsAddop(Look) do begin
EmitLn(’MOVE D0,-(SP)’);
case Look of
’+’: Add;
’-’: Subtract;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Condition }
{ This version is a dummy }
Procedure Condition;
begin
EmitLn(’Condition’);
end;
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L1, L2: string;
begin
Condition;
L1 := NewLabel;
L2 := L1;
EmitLn(’BEQ ’ + L1);
Block;
if Token = ’l’ then begin
L2 := NewLabel;
EmitLn(’BRA ’ + L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString(’ENDIF’);
end;
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string;
begin
Name := Value;
Match(’=’);
Expression;
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’);
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Statement Block }
procedure Block;
begin
Scan;
122
LEXICAL SCANNING Conclusion
CONCLUSION
At this point, you have learned how to parse and generate code for
expressions, Boolean expressions, and control structures. You have
now learned how to develop lexical scanners, and how to incorporate
their elements into a translator. You have still not seen ALL the
elements combined into one program, but on the basis of what we’ve
done before you should find it a straightforward matter to extend our
earlier programs to include scanners.
We are very close to having all the elements that we need to build a
real, functional compiler. There are still a few things missing, notably
procedure calls and type definitions. We will deal with those in the
next few sessions. Before doing so, however, I thought it would be
fun to turn the translator above into a true compiler. That’s what
we’ll be doing in the next installment.
Up till now, we’ve taken a rather bottom-up approach to parsing,
beginning with low-level constructs and working our way up. In the
next installment, I’ll also be taking a look from the top down, and
we’ll discuss how the structure of the translator is altered by changes
in the language definition.
See you then.
123
Part VIII
2 April 1989.
A LITTLE PHILOSOPHY
A LITTLE PHILOSOPHY The road home
INTRODUCTION
I’ve also been toying for years with the idea of a HOL-like assem-
bler, with structured control constructs and HOL-like assignment
statements. That, in fact, was the impetus behind my original foray
into the jungles of compiler theory. This one may never be built,
simply because I’ve learned that it’s actually easier to implement a
language like KISS, that only uses a subset of the CPU instructions.
As you know, assembly language can be bizarre and irregular in the
extreme, and a language that maps one-for-one onto it can be a real
challenge. Still, I’ve always felt that the syntax used in conventional
assemblers is dumb ... why is
MOVE.L A,B
B=A ?
WHY IS IT SO SIMPLE?
Yet the things we have done here have usually turned out to be quite
simple, sometimes even trivial.
For awhile, I thought is was simply because I hadn’t yet gotten
into the meat of the subject. I had only covered the simple parts. I
will freely admit to you that, even when I began the series, I wasn’t
sure how far we would be able to go before things got too complex to
deal with in the ways we have so far. But at this point I’ve already
been down the road far enough to see the end of it. Guess what?
Since the series began I’ve received some comments from you.
Most of them echo my own thoughts: “This is easy! Why do the
textbooks make it seem so hard?” Good question.
Recently, I’ve gone back and looked at some of those texts again,
and even bought and read some new ones. Each time, I come away
with the same feeling: These guys have made it seem too hard.
128
A LITTLE PHILOSOPHY Why is it so simple?
What’s going on here? Why does the whole thing seem difficult
in the texts, but easy to us? Are we that much smarter than Aho,
Ullman, Brinch Hansen, and all the rest?
Hardly. But we are doing some things differently, and more and
more I’m starting to appreciate the value of our approach, and the
way that it simplifies things. Aside from the obvious shortcuts that
I outlined in Part I, like single-character tokens and console I/O, we
have made some implicit assumptions and done some things differ-
ently from those who have designed compilers in the past. As it turns
out, our approach makes life a lot easier.
So why didn’t all those other guys use it?
You have to remember the context of some of the earlier compiler
development. These people were working with very small computers
of limited capacity. Memory was very limited, the CPU instruction
set was minimal, and programs ran in batch mode rather than inter-
actively. As it turns out, these caused some key design decisions that
have really complicated the designs. Until recently, I hadn’t realized
how much of classical compiler design was driven by the available
hardware.
Even in cases where these limitations no longer apply, people have
tended to structure their programs in the same way, since that is the
way they were taught to do it.
In our case, we have started with a blank sheet of paper. There is
a danger there, of course, that you will end up falling into traps that
other people have long since learned to avoid. But it also has allowed
us to take different approaches that, partly by design and partly by
pure dumb luck, have allowed us to gain simplicity.
Here are the areas that I think have led to complexity in the past:
• Batch Processing
In the early days, batch processing was the only choice ... there
was no interactive computing. Even today, compilers run in
essentially batch mode.
In a mainframe compiler as well as many micro compilers, con-
siderable effort is expended on error recovery ... it can consume
as much as 30-40% of the compiler and completely drive the de-
sign. The idea is to avoid halting on the first error, but rather
to keep going at all costs, so that you can tell the programmer
about as many errors in the whole program as possible.
All of that harks back to the days of the early mainframes, where
turnaround time was measured in hours or days, and it was im-
portant to squeeze every last ounce of information out of each
run.
In this series, I’ve been very careful to avoid the issue of er-
ror recovery, and instead our compiler simply halts with an error
message on the first error. I will frankly admit that it was mostly
because I wanted to take the easy way out and keep things sim-
ple. But this approach, pioneered by Borland in Turbo Pascal,
also has a lot going for it anyway. Aside from keeping the com-
piler simple, it also fits very well with the idea of an interactive
system. When compilation is fast, and especially when you have
an editor such as Borland’s that will take you right to the point
130
A LITTLE PHILOSOPHY Why is it so simple?
of the error, then it makes a lot of sense to stop there, and just
restart the compilation after the error is fixed.
• Large Programs
Early compilers were designed to handle large programs ... es-
sentially infinite ones. In those days there was little choice; the
idea of subroutine libraries and separate compilation were still
in the future. Again, this assumption led to multi-pass designs
and intermediate files to hold the results of partial processing.
Brinch Hansen’s stated goal was that the compiler should be
able to compile itself. Again, because of his limited RAM, this
drove him to a multi-pass design. He needed as little resident
compiler code as possible, so that the necessary tables and other
data structures would fit into RAM.
I haven’t stated this one yet, because there hasn’t been a need ...
we’ve always just read and written the data as streams, anyway.
But for the record, my plan has always been that, in a production
compiler, the source and object data should all coexist in RAM
with the compiler, a la the early Turbo Pascals. That’s why I’ve
been careful to keep routines like GetChar and Emit as separate
routines, in spite of their small size. It will be easy to change
them to read to and write from memory.
• Emphasis on Efficiency
John Backus has stated that, when he and his colleagues devel-
oped the original FORTRAN compiler, they KNEW that they
had to make it produce tight code. In those days, there was
a strong sentiment against HOLs and in favor of assembly lan-
guage, and efficiency was the reason. If FORTRAN didn’t pro-
duce very good code by assembly standards, the users would
simply refuse to use it. For the record, that FORTRAN com-
piler turned out to be one of the most efficient ever built, in
terms of code quality. But it WAS complex!
Today, we have CPU power and RAM size to spare, so code
efficiency is not so much of an issue. By studiously ignoring this
issue, we have indeed been able to Keep It Simple. Ironically,
though, as I have said, I have found some optimizations that we
can add to the basic compiler structure, without having to add
a lot of complexity. So in this case we get to have our cake and
eat it too: we will end up with reasonable code quality, anyway.
• Limited Instruction Sets
131
Why is it so simple? A LITTLE PHILOSOPHY
CONCLUSION
134
Part IX
16 April 1989.
A TOP VIEW
The top level A TOP VIEW
INTRODUCTION
begin
solve the problem
end
OK, I grant you that this doesn’t give much of a hint as to what
the next level is, but I like to write it down anyway, just to give me
that warm feeling that I am indeed starting at the top.
For our problem, the overall function of a compiler is to compile a
complete program. Any definition of the language, written in BNF,
begins here. What does the top level BNF look like? Well, that
depends quite a bit on the language to be translated. Let’s take a
look at Pascal.
To translate this, we’ll start with a fresh copy of the Cradle. Since
we’re back to single-character names, we’ll just use a ’p’ to stand for
’PROGRAM.’
To a fresh copy of the cradle, add the following code, and insert a
call to it from the main program:
{--------------------------------------------------------------}
{ Parse and Translate A Program }
procedure Prog;
var Name: char;
begin
Match(’p’); { Handles program header part }
Name := GetName;
Prolog(Name);
Match(’.’);
Epilog(Name);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
EmitLn(’WARMST EQU $A01E’);
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog(Name: char);
begin
EmitLn(’DC WARMST’);
EmitLn(’END ’ + Name);
end;
{--------------------------------------------------------------}
As usual, add this code and try out the “compiler.” At this point,
there is only one legal input:
FLESHING IT OUT
To flesh out the compiler, we only have to deal with language features
one by one. I like to start with a stub procedure that does nothing,
then add detail in incremental fashion. Let’s begin by processing a
block, in accordance with its PDL above. We can do this in two
stages. First, add the null procedure:
{--------------------------------------------------------------}
{ Parse and Translate a Pascal Block }
procedure DoBlock(Name: char);
begin
end;
{--------------------------------------------------------------}
139
Declarations A TOP VIEW
DECLARATIONS
(Note that I’m using the more liberal definition used by Turbo
Pascal. In the standard Pascal definition, each of these parts must
be in a specific order relative to the rest.)
As usual, let’s let a single character represent each of these decla-
ration types. The new form of Declarations is:
{--------------------------------------------------------------}
{ Parse and Translate the Declaration Part }
procedure Declarations;
begin
while Look in [’l’, ’c’, ’t’, ’v’, ’p’, ’f’] do
case Look of
’l’: Labels;
’c’: Constants;
’t’: Types;
’v’: Variables;
’p’: DoProcedure;
’f’: DoFunction;
end;
end;
{--------------------------------------------------------------}
140
A TOP VIEW Declarations
{--------------------------------------------------------------}
{ Process Label Statement }
procedure Labels;
begin
Match(’l’);
end;
{--------------------------------------------------------------}
{ Process Const Statement }
procedure Constants;
begin
Match(’c’);
end;
{--------------------------------------------------------------}
{ Process Type Statement }
procedure Types;
begin
Match(’t’);
end;
{--------------------------------------------------------------}
{ Process Var Statement }
procedure Variables;
begin
Match(’v’);
end;
{--------------------------------------------------------------}
{ Process Procedure Definition }
procedure DoProcedure;
begin
Match(’p’);
end;
{--------------------------------------------------------------}
{ Process Function Definition }
procedure DoFunction;
begin
Match(’f’);
end;
{--------------------------------------------------------------}
Now try out the compiler with a few representative inputs. You
can mix the declarations any way you like, as long as the last char-
acter in the program is’.’ to indicate the end of the program. Of
course, none of the declarations actually declare anything, so you
don’t need (and can’t use) any characters other than those standing
for the keywords.
We can flesh out the statement part in a similar way. The BNF
for it is:
141
Declarations A TOP VIEW
Note that statements can begin with any identifier except END.
So the first stub form of procedure Statements is:
{--------------------------------------------------------------}
{ Parse and Translate the Statement Part }
procedure Statements;
begin
Match(’b’);
while Look <> ’e’ do
GetChar;
Match(’e’);
end;
{--------------------------------------------------------------}
’pxbe.’
Try it. Also try some combinations of this. Make some deliberate
errors and see what happens.
At this point you should be beginning to see the drill. We begin
with a stub translator to process a program, then we flesh out each
procedure in turn, based upon its BNF definition. Just as the lower-
level BNF definitions add detail and elaborate upon the higher-level
ones, the lower-level recognizers will parse more detail of the input
program. When the last stub has been expanded, the compiler will
be complete. That’s top-down design/implementation in its purest
form.
You might note that even though we’ve been adding procedures,
the output of the program hasn’t changed. That’s as it should be.
At these top levels there is no emitted code required. The recognizers
are functioning as just that: recognizers. They are accepting input
sentences, catching bad ones, and channeling good input to the right
places, so they are doing their job. If we were to pursue this a bit
longer, code would start to appear.
The next step in our expansion should probably be procedure
Statements. The Pascal definition is:
142
A TOP VIEW The structure of C
THE STRUCTURE OF C
One reason I’m showing you these structures now is so that I can
impress upon you these two facts:
1. The definition of the language drives the structure of the com-
piler. What works for one language may be a disaster for an-
other. It’s a very bad idea to try to force a given structure
upon the compiler. Rather, you should let the BNF drive the
structure, as we have done here.
2. A language that is hard to write BNF for will probably be hard
to write a compiler for, as well. C is a popular language, and
it has a reputation for letting you do virtually anything that is
possible to do. Despite the success of Small C, C is NOT an
easy language to parse.
A C program has less structure than its Pascal counterpart. At
the top level, everything in C is a static declaration, either of data
or of a function. We can capture this thought like this:
<program> ::= ( <global declaration> )*
In Small C, functions can only have the default type int, which
is not declared. This makes the input easy to parse: the first token
is either “int,” “char,” or the name of a function. In Small C, the
preprocessor commands are also processed by the compiler proper,
so the syntax becomes:
<global declaration> ::= ’#’ <preprocessor command> |
’int’ <data list> |
’char’ <data list> |
<ident> <function body> |
Although we’re really more interested in full C here, I’ll show you
the code corresponding to this top-level structure for Small C.
{--------------------------------------------------------------}
{ Parse and Translate A Program }
procedure Prog;
begin
while Look <> ^Z do begin
case Look of
’#’: PreProc;
’i’: IntDecl;
’c’: CharDecl;
else DoFunction(Int);
end;
end;
end;
{--------------------------------------------------------------}
144
A TOP VIEW The structure of C
Note that I’ve had to use a Ẑ to indicate the end of the source. C
has no keyword such as END or the ’.’ to otherwise indicate the end.
With full C, things aren’t even this easy. The problem comes
about because in full C, functions can also have types. So when the
compiler sees a keyword like “int,” it still doesn’t know whether to
expect a data declaration or a function definition. Things get more
complicated since the next token may not be a name ... it may start
with an ’*’ or ’(’, or combinations of the two.
More specifically, the BNF for full C begins with:
You can now see the problem: The first two parts of the dec-
larations for data and functions can be the same. Because of the
ambiguity in the grammar as written above, it’s not a suitable gram-
mar for a recursive-descent parser. Can we transform it into one that
is suitable? Yes, with a little work. Suppose we write it this way:
We can build a parsing routine for the class and type definitions,
and have them store away their findings and go on, without their
ever having to “know” whether a function or a data declaration is
being processed.
To begin, key in the following version of the main program:
145
The structure of C A TOP VIEW
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
while Look <> ^Z do begin
GetClass;
GetType;
TopDecl;
end;
end.
{--------------------------------------------------------------}
For the first round, just make the three procedures stubs that do
nothing BUT call GetChar.
Does this program work? Well, it would be hard put NOT to,
since we’re not really asking it to do anything. It’s been said that
a C compiler will accept virtually any input without choking. It’s
certainly true of THIS compiler, since in effect all it does is to eat
input characters until it finds a Ẑ.
Next, let’s make GetClass do something worthwhile. Declare the
global variable
{--------------------------------------------------------------}
{ Get a Storage Class Specifier }
Procedure GetClass;
begin
if Look in [’a’, ’x’, ’s’] then begin
Class := Look;
GetChar;
end
else Class := ’a’;
end;
{--------------------------------------------------------------}
Here, I’ve used three single characters to represent the three stor-
age classes “auto,” “extern,” and “static.” These are not the only
three possible classes ... there are also “register” and “typedef,” but
this should give you the picture. Note that the default class is “auto.”
We can do a similar thing for types. Enter the following procedure
next:
{--------------------------------------------------------------}
{ Get a Type Specifier }
procedure GetType;
begin
Typ := ’ ’;
if Look = ’u’ then begin
Sign := ’u’;
146
A TOP VIEW The structure of C
Typ := ’i’;
GetChar;
end
else Sign := ’s’;
if Look in [’i’, ’l’, ’c’] then begin
Typ := Look;
GetChar;
end;
end;
{--------------------------------------------------------------}
Note that you must add two more global variables, Sign and Typ.
With these two procedures in place, the compiler will process the
class and type definitions and store away their findings. We can now
process the rest of the declaration.
We are by no means out of the woods yet, because there are still
many complexities just in the definition of the type, before we even
get to the actual data or function names. Let’s pretend for the mo-
ment that we have passed all those gates, and that the next thing in
the input stream is a name. If the name is followed by a left paren,
we have a function declaration. If not, we have at least one data item,
and possibly a list, each element of which can have an initializer.
Insert the following version of TopDecl:
{--------------------------------------------------------------}
{ Process a Top-Level Declaration }
procedure TopDecl;
var Name: char;
begin
Name := Getname;
if Look = ’(’ then
DoFunc(Name)
else
DoData(Name);
end;
{--------------------------------------------------------------}
(Note that, since we have already read the name, we must pass it
along to the appropriate routine.)
Finally, add the two procedures DoFunc and DoData:
{--------------------------------------------------------------}
{ Process a Function Definition }
procedure DoFunc(n: char);
begin
Match(’(’);
Match(’)’);
Match(’{’);
Match(’}’);
if Typ = ’ ’ then Typ := ’i’;
Writeln(Class, Sign, Typ, ’ function ’, n);
end;
{--------------------------------------------------------------}
{ Process a Data Declaration }
procedure DoData(n: char);
147
The structure of C A TOP VIEW
begin
if Typ = ’ ’ then Expected(’Type declaration’);
Writeln(Class, Sign, Typ, ’ data ’, n);
while Look = ’,’ do begin
Match(’,’);
n := GetName;
WriteLn(Class, Sign, Typ, ’ data ’, n);
end;
Match(’;’);
end;
{--------------------------------------------------------------}
148
Part X
21 May 1989.
INTRODUCING “TINY”
Introduction INTRODUCING “TINY”
INTRODUCTION
In the last installment, I showed you the general idea for the top-
down development of a compiler. I gave you the first few steps of
the process for compilers for Pascal and C, but I stopped far short
of pushing it through to completion. The reason was simple: if we’re
going to produce a real, functional compiler for any language, I’d
rather do it for KISS, the language that I’ve been defining in this
tutorial series.
In this installment, we’re going to do just that, for a subset of
KISS which I’ve chosen to call TINY.
The process will be essentially that outlined in Installment IX,
except for one notable difference. In that installment, I suggested
that you begin with a full BNF description of the language. That’s
fine for something like Pascal or C, for which the language definition
is firm. In the case of TINY, however, we don’t yet have a full
description ... we seem to be defining the language as we go. That’s
OK. In fact, it’s preferable, since we can tailor the language slightly
as we go, to keep the parsing easy.
So in the development that follows, we’ll actually be doing a top-
down development of BOTH the language and its compiler. The
BNF description will grow along with the compiler.
In this process, there will be a number of decisions to be made,
each of which will influence the BNF and therefore the nature of the
language. At each decision point I’ll try to remember to explain the
decision and the rationale behind my choice. That way, if you happen
to hold a different opinion and would prefer a different option, you
can choose it instead. You now have the background to do that. I
guess the important thing to note is that nothing we do here is cast
in concrete. When YOU’RE designing YOUR language, you should
feel free to do it YOUR way.
Many of you may be asking at this point: Why bother starting
over from scratch? We had a working subset of KISS as the outcome
of Installment VII (lexical scanning). Why not just extend it as
needed? The answer is threefold. First of all, I have been making
a number of changes to further simplify the program ... changes
like encapsulating the code generation procedures, so that we can
convert to a different target machine more easily. Second, I want you
to see how the development can indeed be done from the top down as
150
INTRODUCING “TINY” Getting started
GETTING STARTED
Many years ago there were languages called Tiny BASIC, Tiny Pas-
cal, and Tiny C, each of which was a subset of its parent full language.
Tiny BASIC, for example, had only single-character variable names
and global variables. It supported only a single data type. Sound
familiar? At this point we have almost all the tools we need to build
a compiler like that.
Yet a language called Tiny-anything still carries some baggage
inherited from its parent language. I’ve often wondered if this is a
good idea. Granted, a language based upon some parent language
will have the advantage of familiarity, but there may also be some
peculiar syntax carried over from the parent that may tend to add
unnecessary complexity to the compiler. (Nowhere is this more true
than in Small C.)
I’ve wondered just how small and simple a compiler could be made
and still be useful, if it were designed from the outset to be both easy
to use and to parse. Let’s find out. This language will just be called
“TINY,” period. It’s a subset of KISS, which I also haven’t fully
defined, so that at least makes us consistent (!). I suppose you could
call it TINY KISS. But that opens up a whole can of worms involving
cuter and cuter (and perhaps more risque) names, so let’s just stick
with TINY.
The main limitations of TINY will be because of the things we
haven’t yet covered, such as data types. Like its cousins Tiny C and
Tiny BASIC, TINY will have only one data type, the 16-bit integer.
The first version we develop will also have no procedure calls and will
use single-character variable names, although as you will see we can
remove these restrictions without much effort.
The language I have in mind will share some of the good features
of Pascal, C, and Ada. Taking a lesson from the comparison of the
Pascal and C compilers in the previous installment, though, TINY
will have a decided Pascal flavor. Wherever feasible, a language struc-
ture will be bracketed by keywords or symbols, so that the parser will
know where it’s going without having to guess.
151
Getting started INTRODUCING “TINY”
One other ground rule: As we go, I’d like to keep the compiler
producing real, executable code. Even though it may not DO much
at the beginning, it will at least do it correctly.
Finally, I’ll use a couple of Pascal restrictions that make sense: All
data and procedures must be declared before they are used. That
makes good sense, even though for now the only data type we’ll use
is a word. This rule in turn means that the only reasonable place to
put the executable code for the main program is at the end of the
listing.
The top-level definition will be similar to Pascal:
152
INTRODUCING “TINY” Getting started
The procedure Header just emits the startup code required by the
assembler:
{--------------------------------------------------------------}
{ Write Header Info }
procedure Header;
begin
WriteLn(’WARMST’, TAB, ’EQU $A01E’);
end;
{--------------------------------------------------------------}
The procedures Prolog and Epilog emit the code for identifying
the main program, and for returning to the OS:
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
PostLabel(’MAIN’);
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog;
begin
EmitLn(’DC WARMST’);
EmitLn(’END MAIN’);
end;
{--------------------------------------------------------------}
The main program just calls Prog, and then looks for a clean
ending:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Prog;
if Look <> CR then Abort(’Unexpected data after ’’.’’’);
end.
{--------------------------------------------------------------}
At this point, TINY will accept only one input “program,” the
null program:
Note, though, that the compiler DOES generate correct code for
this program. It will run, and do what you’d expect the null program
to do, that is, nothing but return gracefully to the OS.
As a matter of interest, one of my favorite compiler benchmarks is
to compile, link, and execute the null program in whatever language is
involved. You can learn a lot about the implementation by measuring
the overhead in time required to compile what should be a trivial
153
Getting started INTRODUCING “TINY”
154
INTRODUCING “TINY” Declarations
DECLARATIONS
Note that since there is only one variable type, there is no need to
declare the type. Later on, for full KISS, we can easily add a type
description.
The procedure Prog becomes:
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure Prog;
begin
Match(’p’);
Header;
TopDecls;
Main;
Match(’.’);
end;
{--------------------------------------------------------------}
155
Declarations and symbols INTRODUCING “TINY”
That looks pretty good, but we’re still only generating the null pro-
gram for output. A real compiler would issue assembler directives
to allocate storage for the variables. It’s about time we actually
produced some code.
With a little extra code, that’s an easy thing to do from procedure
Decl. Modify it as follows:
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }
procedure Decl;
var Name: char;
begin
Match(’v’);
Alloc(GetName);
end;
{--------------------------------------------------------------}
156
INTRODUCING “TINY” Initializers
Give this one a whirl. Try an input that declares some variables,
such as:
pvxvyvzbe.
See how the storage is allocated? Simple, huh? Note also that the
entry point, “MAIN,” comes out in the right place.
For the record, a “real” compiler would also have a symbol table
to record the variables being used. Normally, the symbol table is
necessary to record the type of each variable. But since in this case
all variables have the same type, we don’t need a symbol table for
that reason. As it turns out, we’re going to find a symbol necessary
even without different types, but let’s postpone that need until it
arises.
Of course, we haven’t really parsed the correct syntax for a data
declaration, since it involves a variable list. Our version only permits
a single variable. That’s easy to fix, too.
The BNF for <var-list> is
<var-list> ::= <ident> (, <ident>)*
OK, now compile this code and give it a try. Try a number of lines
of VAR declarations, try a list of several variables on one line, and
try combinations of the two. Does it work?
INITIALIZERS
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }
procedure Alloc(N: char);
begin
if InTable(N) then Abort(’Duplicate Variable Name ’ + N);
ST[N] := ’v’;
Write(N, ’:’, TAB, ’DC ’);
if Look = ’=’ then begin
Match(’=’);
If Look = ’-’ then begin
Write(Look);
Match(’-’);
end;
WriteLn(GetNum);
end
else
WriteLn(’0’);
end;
{--------------------------------------------------------------}
pvavavabe.
Here we’ve declared the variable A three times. As you can see,
the compiler will cheerfully accept that, and generate three identical
labels. Not good.
Later on, when we start referencing variables, the compiler will
also let us reference variables that don’t exist. The assembler will
catch both of these error conditions, but it doesn’t seem friendly at
all to pass such errors along to the assembler. The compiler should
catch such things at the source language level.
So even though we don’t need a symbol table to record data types,
we ought to install one just to check for these two conditions. Since
at this point we are still restricted to single-character variable names,
the symbol table can be trivial. To provide for it, first add the
following declaration at the beginning of your program:
159
Executable statements INTRODUCING “TINY”
{--------------------------------------------------------------}
{ Look for Symbol in Table }
function InTable(n: char): Boolean;
begin
InTable := ST[n] <> ’ ’;
end;
{--------------------------------------------------------------}
That should do it. The compiler will now catch duplicate declara-
tions. Later, we can also use InTable when generating references to
the variables.
EXECUTABLE STATEMENTS
At this point, we can generate a null program that has some data
variables declared and possibly initialized. But so far we haven’t
arranged to generate the first line of executable code.
Believe it or not, though, we almost have a usable language! What’s
missing is the executable code that must go into the main program.
But that code is just assignment statements and control statements
... all stuff we have done before. So it shouldn’t take us long to
provide for them, as well.
The BNF definition given earlier for the main program included a
statement block, which we have so far ignored:
160
INTRODUCING “TINY” Executable statements
Let’s start things off by adding a parser for the block. We’ll begin
with a stub for the assignment statement:
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
begin
GetChar;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
while Look <> ’e’ do
Assignment;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Main Program }
procedure Main;
begin
Match(’b’);
Prolog;
Block;
Match(’e’);
Epilog;
end;
{--------------------------------------------------------------}
This version still won’t generate any code for the “assignment
statements” ... all it does is to eat characters until it sees the ’e’ for
’END.’ But it sets the stage for what is to follow.
The next step, of course, is to flesh out the code for an assign-
ment statement. This is something we’ve done many times before,
so I won’t belabor it. This time, though, I’d like to deal with the
code generation a little differently. Up till now, we’ve always just
inserted the Emits that generate output code in line with the pars-
ing routines. A little unstructured, perhaps, but it seemed the most
straightforward approach, and made it easy to see what kind of code
would be emitted for each construct.
However, I realize that most of you are using an 80x86 computer,
so the 68000 code generated is of little use to you. Several of you
have asked me if the CPU-dependent code couldn’t be collected into
161
Executable statements INTRODUCING “TINY”
162
INTRODUCING “TINY” Executable statements
{--------------------------------------------------------------}
{ Report an Undefined Identifier }
procedure Undefined(n: string);
begin
Abort(’Undefined Identifier ’ + n);
end;
{--------------------------------------------------------------}
163
Executable statements INTRODUCING “TINY”
This version of the BNF is also a bit different than we’ve used
before ... yet another “variation on the theme of an expression.” This
particular version has what I consider to be the best treatment of the
unary minus. As you’ll see later, it lets us handle negative constant
values efficiently. It’s worth mentioning here that we have often seen
the advantages of “tweaking” the BNF as we go, to help make the
language easy to parse. What you’re looking at here is a bit different:
we’ve tweaked the BNF to make the CODE GENERATION more
efficient! That’s a first for this series.
Anyhow, the following code implements the BNF:
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure Expression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsAlpha(Look) then
LoadVar(GetName)
else
LoadConst(GetNum);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }
procedure NegFactor;
begin
Match(’-’);
if IsDigit(Look) then
LoadConst(-GetNum)
else begin
Factor;
Negate;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }
procedure FirstFactor;
begin
case Look of
’+’: begin
Match(’+’);
Factor;
end;
’-’: NegFactor;
else Factor;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match(’*’);
Factor;
PopMul;
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match(’/’);
Factor;
PopDiv;
end;
{---------------------------------------------------------------}
164
INTRODUCING “TINY” Booleans
OK, if you’ve got all this code inserted, then compile it and check
it out. You should be seeing reasonable-looking code, representing a
complete program that will assemble and execute. We have a com-
piler!
165
Booleans INTRODUCING “TINY”
BOOLEANS
The next step should also be familiar to you. We must add Boolean
expressions and relational operations. Again, since we’ve already
dealt with them more than once, I won’t elaborate much on them, ex-
cept where they are different from what we’ve done before. Again, we
won’t just copy from other files because I’ve changed a few things just
a bit. Most of the changes just involve encapsulating the machine-
dependent parts as we did for the arithmetic operations. I’ve also
modified procedure NotFactor somewhat, to parallel the structure of
FirstFactor. Finally, I corrected an error in the object code for the
relational operators: The Scc instruction I used only sets the low 8
bits of D0. We want all 16 bits set for a logical true, so I’ve added
an instruction to sign-extend the low byte.
To begin, we’re going to need some more recognizers:
{--------------------------------------------------------------}
{ Recognize a Boolean Orop }
function IsOrop(c: char): boolean;
begin
IsOrop := c in [’|’, ’~’];
end;
{--------------------------------------------------------------}
{ Recognize a Relop }
function IsRelop(c: char): boolean;
begin
IsRelop := c in [’=’, ’#’, ’<’, ’>’];
end;
{--------------------------------------------------------------}
{---------------------------------------------------------------}
{ Complement the Primary Register }
procedure NotIt;
begin
EmitLn(’NOT D0’);
end;
{---------------------------------------------------------------}
.
.
.
{---------------------------------------------------------------}
{ AND Top of Stack with Primary }
procedure PopAnd;
begin
EmitLn(’AND (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ OR Top of Stack with Primary }
procedure PopOr;
begin
EmitLn(’OR (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }
procedure PopXor;
begin
EmitLn(’EOR (SP)+,D0’);
166
INTRODUCING “TINY” Booleans
end;
{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }
procedure PopCompare;
begin
EmitLn(’CMP (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was = }
procedure SetEqual;
begin
EmitLn(’SEQ D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was != }
procedure SetNEqual;
begin
EmitLn(’SNE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was > }
procedure SetGreater;
begin
EmitLn(’SLT D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was < }
procedure SetLess;
begin
EmitLn(’SGT D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
All of this gives us the tools we need. The BNF for the Boolean
expressions is:
Sharp-eyed readers might note that this syntax does not include
the non-terminal “bool-factor” used in earlier versions. It was needed
then because I also allowed for the Boolean constants TRUE and
FALSE. But remember that in TINY there is no distinction made
between Boolean and arithmetic types ... they can be freely inter-
mixed. So there is really no need for these predefined values ... we
can just use -1 and 0, respectively.
In C terminology, we could always use the defines:
167
Booleans INTRODUCING “TINY”
#define TRUE -1
#define FALSE 0
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }
procedure Equals;
begin
Match(’=’);
Expression;
PopCompare;
SetEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }
procedure NotEquals;
begin
Match(’#’);
Expression;
PopCompare;
SetNEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }
procedure Less;
begin
Match(’<’);
Expression;
PopCompare;
SetLess;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }
procedure Greater;
begin
Match(’>’);
Expression;
PopCompare;
SetGreater;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Relation }
procedure Relation;
begin
Expression;
if IsRelop(Look) then begin
168
INTRODUCING “TINY” Booleans
Push;
case Look of
’=’: Equals;
’#’: NotEquals;
’<’: Less;
’>’: Greater;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }
procedure NotFactor;
begin
if Look = ’!’ then begin
Match(’!’);
Relation;
NotIt;
end
else
Relation;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }
procedure BoolTerm;
begin
NotFactor;
while Look = ’&’ do begin
Push;
Match(’&’);
NotFactor;
PopAnd;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }
procedure BoolOr;
begin
Match(’|’);
BoolTerm;
PopOr;
end;
{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }
procedure BoolXor;
begin
Match(’~’);
BoolTerm;
PopXor;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }
procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Look) do begin
Push;
case Look of
’|’: BoolOr;
’~’: BoolXor;
end;
end;
end;
{--------------------------------------------------------------}
pvx,y,zbx=z>ye.
PROGRAM
VAR X,Y,Z
BEGIN
X = Z > Y
END.
CONTROL STRUCTURES
Once again, let me spell out the decisions implicit in this syntax,
which departs strongly from that of C or Pascal. In both of those
languages, the “body” of an IF or WHILE is regarded as a single
statement. If you intend to use a block of more than one statement,
you have to build a compound statement using BEGIN-END (in Pas-
cal) or ’{}’ (in C). In TINY (and KISS) there is no such thing as a
compound statement ... single or multiple they’re all just blocks to
these languages.
In KISS, all the control structures will have explicit and unique
keywords bracketing the statement block, so there can be no confu-
sion as to where things begin and end. This is the modern approach,
used in such respected languages as Ada and Modula 2, and it com-
pletely eliminates the problem of the “dangling else.”
Note that I could have chosen to use the same keyword END to end
all the constructs, as is done in Pascal. (The closing ’}’ in C serves
the same purpose.) But this has always led to confusion, which is
why Pascal programmers tend to write things like
170
INTRODUCING “TINY” Control structures
end { loop }
or end { if }
As I explained in Part V, using unique terminal keywords does
increase the size of the keyword list and therefore slows down the
scanning, but in this case it seems a small price to pay for the added
insurance. Better to find the errors at compile time rather than run
time.
One last thought: The two constructs above each have the non-
terminals
{---------------------------------------------------------------}
{ Branch Unconditional }
procedure Branch(L: string);
begin
EmitLn(’BRA ’ + L);
end;
{---------------------------------------------------------------}
{ Branch False }
procedure BranchFalse(L: string);
begin
EmitLn(’TST D0’);
EmitLn(’BEQ ’ + L);
end;
{--------------------------------------------------------------}
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
171
Control structures INTRODUCING “TINY”
where
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
while not(Look in [’e’, ’l’]) do begin
case Look of
’i’: DoIf;
’w’: DoWhile;
else Assignment;
end;
end;
end;
{--------------------------------------------------------------}
172
INTRODUCING “TINY” Lexical scanning
OK, add the routines I’ve given, compile and test them. You
should be able to parse the single-character versions of any of the
control constructs. It’s looking pretty good!
As a matter of fact, except for the single-character limitation we’ve
got a virtually complete version of TINY. I call it, with tongue
planted firmly in cheek, TINY Version 0.1.
LEXICAL SCANNING
Note that we have seen this procedure before in the form of Pro-
cedure Fin. I’ve changed the name since this new one seems more
descriptive of the actual function. I’ve also changed the code to allow
for multiple newlines and lines with nothing but white space.
The next step is to insert calls to NewLine wherever we decide a
newline is permissible. As I’ve pointed out before, this can be very
173
Lexical scanning INTRODUCING “TINY”
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look : char; { Lookahead Character }
Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }
ST: Array[’A’..’Z’] of char;
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const NKW = 9;
NKW1 = 10;
const KWlist: array[1..NKW] of Symbol =
(’IF’, ’ELSE’, ’ENDIF’, ’WHILE’, ’ENDWHILE’,
’VAR’, ’BEGIN’, ’END’, ’PROGRAM’);
const KWcode: string[NKW1] = ’xilewevbep’;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Table Lookup }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
found: Boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Get an Identifier and Scan it for Keywords }
procedure Scan;
begin
174
INTRODUCING “TINY” Lexical scanning
GetName;
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Match a Specific Input String }
procedure MatchString(x: string);
begin
if Value <> x then Expected(’’’’ + x + ’’’’);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
begin
NewLine;
if not IsAlpha(Look) then Expected(’Name’);
Value := ’’;
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
Note that this procedure leaves its result in the global string Value.
Next, we have to change every reference to GetName to reflect its
new form. These occur in Factor, Assignment, and Decl:
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure BoolExpression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
BoolExpression;
Match(’)’);
end
else if IsAlpha(Look) then begin
GetName;
LoadVar(Value[1]);
end
else
LoadConst(GetNum);
end;
{--------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := Value[1];
Match(’=’);
BoolExpression;
Store(Name);
175
Lexical scanning INTRODUCING “TINY”
end;
{---------------------------------------------------------------}
.
.
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }
procedure Decl;
begin
GetName;
Alloc(Value[1]);
while Look = ’,’ do begin
Match(’,’);
GetName;
Alloc(Value[1]);
end;
end;
{--------------------------------------------------------------}
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L1, L2: string;
begin
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Token = ’l’ then begin
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString(’ENDIF’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
MatchString(’ENDWHILE’);
Branch(L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
176
INTRODUCING “TINY” Multi-character variable names
That should do it. If all the changes got in correctly, you should
now be parsing programs that look like programs. (If you didn’t
make it through all the changes, don’t despair. A complete listing of
the final form is given later.)
Did it work? If so, then we’re just about home. In fact, with a few
minor exceptions we’ve already got a compiler that’s usable. There
are still a few areas that need improvement.
keywords, this one begins to look very much like an arbitrary and
unnecessary limitation. And indeed it is. Basically, its only virtue is
that it permits a trivially simple implementation of the symbol table.
But that’s just a convenience to the compiler writers, and needs to
be eliminated.
We’ve done this step before. This time, as usual, I’m doing it a
little differently. I think the approach used here keeps things just
about as simple as possible.
The natural way to implement a symbol table in Pascal is by
declaring a record type, and making the symbol table an array of
such records. Here, though, we don’t really need a type field yet
(there is only one kind of entry allowed so far), so we only need an
array of symbols. This has the advantage that we can use the existing
procedure Lookup to search the symbol table as well as the keyword
list. As it turns out, even when we need more fields we can still
use the same approach, simply by storing the other fields in separate
arrays.
OK, here are the changes that need to be made. First, add the
new typed constant:
NEntry: integer = 0;
{--------------------------------------------------------------}
{ Look for Symbol in Table }
function InTable(n: Symbol): Boolean;
begin
InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }
procedure AddEntry(N: Symbol; T: char);
begin
if InTable(N) then Abort(’Duplicate Identifier ’ + N);
if NEntry = MaxEntry then Abort(’Symbol Table Full’);
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }
procedure Alloc(N: Symbol);
begin
if InTable(N) then Abort(’Duplicate Variable Name ’ + N);
AddEntry(N, ’v’);
.
.
.
{--------------------------------------------------------------}
Finally, we must change all the routines that currently treat the
variable name as a single character. These include LoadVar and Store
(just change the type from char to string), and Factor, Assignment,
and Decl (just change Value[1] to Value).
One last thing: change procedure Init to clear the array as shown:
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: integer;
begin
for i := 1 to MaxEntry do begin
ST[i] := ’’;
SType[i] := ’ ’;
end;
GetChar;
Scan;
end;
{--------------------------------------------------------------}
That should do it. Try it out and verify that you can, indeed, use
multi-character variable names.
MORE RELOPS
If you’ll recall, in Part VII I pointed out that the conventional way
to deal with relops is to include them in the list of keywords, and let
the lexical scanner find them. But, again, this requires scanning
throughout the expression parsing process, whereas so far we’ve been
able to limit the use of the scanner to the beginning of a statement.
I mentioned then that we can still get away with this, since the
multi-character relops are so few and so limited in their usage. It’s
easy to just treat them as special cases and handle them in an ad
hoc manner.
The changes required affect only the code generation routines and
procedures Relation and friends. First, we’re going to need two more
code generation routines:
{---------------------------------------------------------------}
{ Set D0 If Compare was <= }
procedure SetLessOrEqual;
begin
EmitLn(’SGE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was >= }
procedure SetGreaterOrEqual;
begin
EmitLn(’SLE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }
procedure LessOrEqual;
begin
Match(’=’);
Expression;
PopCompare;
SetLessOrEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }
procedure NotEqual;
begin
Match(’>’);
Expression;
PopCompare;
SetNEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }
procedure Less;
begin
Match(’<’);
case Look of
’=’: LessOrEqual;
’>’: NotEqual;
else begin
Expression;
PopCompare;
SetLess;
180
INTRODUCING “TINY” Input/Output
end;
end;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }
procedure Greater;
begin
Match(’>’);
if Look = ’=’ then begin
Match(’=’);
Expression;
PopCompare;
SetGreaterOrEqual;
end
else begin
Expression;
PopCompare;
SetGreater;
end;
end;
{---------------------------------------------------------------}
That’s all it takes. Now you can process all the relops. Try it.
INPUT/OUTPUT
{---------------------------------------------------------------}
{ Read Variable to Primary Register }
procedure ReadVar;
begin
181
Input/Output INTRODUCING “TINY”
EmitLn(’BSR READ’);
Store(Value);
end;
{---------------------------------------------------------------}
{ Write Variable from Primary Register }
procedure WriteVar;
begin
EmitLn(’BSR WRITE’);
end;
{--------------------------------------------------------------}
The idea is that READ loads the value from input to the D0, and
WRITE outputs it from there.
These two procedures represent our first encounter with a need for
library procedures ... the components of a Run Time Library (RTL).
Of course, someone (namely us) has to write these routines, but
they’re not part of the compiler itself. I won’t even bother showing
the routines here, since these are obviously very much OS-dependent.
I WILL simply say that for SK*DOS, they are particularly simple
... almost trivial. One reason I won’t show them here is that you can
add all kinds of fanciness to the things, for example by prompting in
READ for the inputs, and by giving the user a chance to reenter a
bad input.
But that is really separate from compiler design, so for now I’ll
just assume that a library call TINYLIB.LIB exists. Since we now
need it loaded, we need to add a statement to include it in procedure
Header:
{--------------------------------------------------------------}
{ Write Header Info }
procedure Header;
begin
WriteLn(’WARMST’, TAB, ’EQU $A01E’);
EmitLn(’LIB TINYLIB’);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const NKW = 11;
NKW1 = 12;
const KWlist: array[1..NKW] of Symbol =
(’IF’, ’ELSE’, ’ENDIF’, ’WHILE’, ’ENDWHILE’,
’READ’, ’WRITE’, ’VAR’, ’BEGIN’, ’END’,
’PROGRAM’);
const KWcode: string[NKW1] = ’xileweRWvbep’;
{--------------------------------------------------------------}
182
INTRODUCING “TINY” Conclusion
(Note how I’m using upper case codes here to avoid conflict with
the ’w’ of WHILE.)
Next, we need procedures for processing the read/write statement
and its argument list:
{--------------------------------------------------------------}
{ Process a Read Statement }
procedure DoRead;
begin
Match(’(’);
GetName;
ReadVar;
while Look = ’,’ do begin
Match(’,’);
GetName;
ReadVar;
end;
Match(’)’);
end;
{--------------------------------------------------------------}
{ Process a Write Statement }
procedure DoWrite;
begin
Match(’(’);
Expression;
WriteVar;
while Look = ’,’ do begin
Match(’,’);
Expression;
WriteVar;
end;
Match(’)’);
end;
{--------------------------------------------------------------}
CONCLUSION
At this point we have TINY completely defined. It’s not much ...
actually a toy compiler. TINY has only one data type and no sub-
routines ... but it’s a complete, usable language. While you’re not
183
Conclusion INTRODUCING “TINY”
{--------------------------------------------------------------}
program Tiny10;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
LF = ^J;
LCount: integer = 0;
NEntry: integer = 0;
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look : char; { Lookahead Character }
Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }
const MaxEntry = 100;
var ST : array[1..MaxEntry] of Symbol;
SType: array[1..MaxEntry] of char;
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const NKW = 11;
NKW1 = 12;
const KWlist: array[1..NKW] of Symbol =
(’IF’, ’ELSE’, ’ENDIF’, ’WHILE’, ’ENDWHILE’,
’READ’, ’WRITE’, ’VAR’, ’BEGIN’, ’END’,
’PROGRAM’);
const KWcode: string[NKW1] = ’xileweRWvbep’;
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
184
INTRODUCING “TINY” Conclusion
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Report an Undefined Identifier }
procedure Undefined(n: string);
begin
Abort(’Undefined Identifier ’ + n);
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
{ Recognize a Mulop }
function IsMulop(c: char): boolean;
begin
IsMulop := c in [’*’, ’/’];
end;
{--------------------------------------------------------------}
{ Recognize a Boolean Orop }
function IsOrop(c: char): boolean;
begin
IsOrop := c in [’|’, ’~’];
end;
{--------------------------------------------------------------}
{ Recognize a Relop }
function IsRelop(c: char): boolean;
begin
IsRelop := c in [’=’, ’#’, ’<’, ’>’];
end;
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
185
Conclusion INTRODUCING “TINY”
GetChar;
end;
{--------------------------------------------------------------}
{ Skip Over an End-of-Line }
procedure NewLine;
begin
while Look = CR do begin
GetChar;
if Look = LF then GetChar;
SkipWhite;
end;
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
NewLine;
if Look = x then GetChar
else Expected(’’’’ + x + ’’’’);
SkipWhite;
end;
{--------------------------------------------------------------}
{ Table Lookup }
function Lookup(T: TabPtr; s: string; n: integer): integer;
var i: integer;
found: Boolean;
begin
found := false;
i := n;
while (i > 0) and not found do
if s = T^[i] then
found := true
else
dec(i);
Lookup := i;
end;
{--------------------------------------------------------------}
{ Locate a Symbol in Table }
{ Returns the index of the entry. Zero if not present. }
function Locate(N: Symbol): integer;
begin
Locate := Lookup(@ST, n, MaxEntry);
end;
{--------------------------------------------------------------}
{ Look for Symbol in Table }
function InTable(n: Symbol): Boolean;
begin
InTable := Lookup(@ST, n, MaxEntry) <> 0;
end;
{--------------------------------------------------------------}
{ Add a New Entry to Symbol Table }
procedure AddEntry(N: Symbol; T: char);
begin
if InTable(N) then Abort(’Duplicate Identifier ’ + N);
if NEntry = MaxEntry then Abort(’Symbol Table Full’);
Inc(NEntry);
ST[NEntry] := N;
SType[NEntry] := T;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
begin
NewLine;
if not IsAlpha(Look) then Expected(’Name’);
Value := ’’;
while IsAlNum(Look) do begin
Value := Value + UpCase(Look);
GetChar;
end;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: integer;
var Val: integer;
begin
NewLine;
186
INTRODUCING “TINY” Conclusion
187
Conclusion INTRODUCING “TINY”
begin
if not InTable(Name) then Undefined(Name);
EmitLn(’MOVE ’ + Name + ’(PC),D0’);
end;
{---------------------------------------------------------------}
{ Push Primary onto Stack }
procedure Push;
begin
EmitLn(’MOVE D0,-(SP)’);
end;
{---------------------------------------------------------------}
{ Add Top of Stack to Primary }
procedure PopAdd;
begin
EmitLn(’ADD (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }
procedure PopSub;
begin
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }
procedure PopMul;
begin
EmitLn(’MULS (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }
procedure PopDiv;
begin
EmitLn(’MOVE (SP)+,D7’);
EmitLn(’EXT.L D7’);
EmitLn(’DIVS D0,D7’);
EmitLn(’MOVE D7,D0’);
end;
{---------------------------------------------------------------}
{ AND Top of Stack with Primary }
procedure PopAnd;
begin
EmitLn(’AND (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ OR Top of Stack with Primary }
procedure PopOr;
begin
EmitLn(’OR (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }
procedure PopXor;
begin
EmitLn(’EOR (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }
procedure PopCompare;
begin
EmitLn(’CMP (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was = }
procedure SetEqual;
begin
EmitLn(’SEQ D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was != }
procedure SetNEqual;
begin
EmitLn(’SNE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was > }
188
INTRODUCING “TINY” Conclusion
procedure SetGreater;
begin
EmitLn(’SLT D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was < }
procedure SetLess;
begin
EmitLn(’SGT D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was <= }
procedure SetLessOrEqual;
begin
EmitLn(’SGE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was >= }
procedure SetGreaterOrEqual;
begin
EmitLn(’SLE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Store Primary to Variable }
procedure Store(Name: string);
begin
if not InTable(Name) then Undefined(Name);
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’)
end;
{---------------------------------------------------------------}
{ Branch Unconditional }
procedure Branch(L: string);
begin
EmitLn(’BRA ’ + L);
end;
{---------------------------------------------------------------}
{ Branch False }
procedure BranchFalse(L: string);
begin
EmitLn(’TST D0’);
EmitLn(’BEQ ’ + L);
end;
{---------------------------------------------------------------}
{ Read Variable to Primary Register }
procedure ReadVar;
begin
EmitLn(’BSR READ’);
Store(Value[1]);
end;
{ Write Variable from Primary Register }
procedure WriteVar;
begin
EmitLn(’BSR WRITE’);
end;
{--------------------------------------------------------------}
{ Write Header Info }
procedure Header;
begin
WriteLn(’WARMST’, TAB, ’EQU $A01E’);
end;
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
PostLabel(’MAIN’);
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog;
begin
EmitLn(’DC WARMST’);
EmitLn(’END MAIN’);
end;
189
Conclusion INTRODUCING “TINY”
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure BoolExpression; Forward;
procedure Factor;
begin
if Look = ’(’ then begin
Match(’(’);
BoolExpression;
Match(’)’);
end
else if IsAlpha(Look) then begin
GetName;
LoadVar(Value);
end
else
LoadConst(GetNum);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Negative Factor }
procedure NegFactor;
begin
Match(’-’);
if IsDigit(Look) then
LoadConst(-GetNum)
else begin
Factor;
Negate;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Leading Factor }
procedure FirstFactor;
begin
case Look of
’+’: begin
Match(’+’);
Factor;
end;
’-’: NegFactor;
else Factor;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Match(’*’);
Factor;
PopMul;
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Match(’/’);
Factor;
PopDiv;
end;
{---------------------------------------------------------------}
{ Common Code Used by Term and FirstTerm }
procedure Term1;
begin
while IsMulop(Look) do begin
Push;
case Look of
’*’: Multiply;
’/’: Divide;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
Factor;
Term1;
end;
{---------------------------------------------------------------}
190
INTRODUCING “TINY” Conclusion
191
Conclusion INTRODUCING “TINY”
192
INTRODUCING “TINY” Conclusion
’|’: BoolOr;
’~’: BoolXor;
end;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string;
begin
Name := Value;
Match(’=’);
BoolExpression;
Store(Name);
end;
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L1, L2: string;
begin
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Token = ’l’ then begin
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString(’ENDIF’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
MatchString(’ENDWHILE’);
Branch(L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Process a Read Statement }
procedure DoRead;
begin
Match(’(’);
GetName;
ReadVar;
while Look = ’,’ do begin
Match(’,’);
GetName;
ReadVar;
end;
Match(’)’);
end;
{--------------------------------------------------------------}
{ Process a Write Statement }
procedure DoWrite;
begin
Match(’(’);
Expression;
WriteVar;
while Look = ’,’ do begin
Match(’,’);
Expression;
WriteVar;
end;
Match(’)’);
end;
193
Conclusion INTRODUCING “TINY”
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
while not(Token in [’e’, ’l’]) do begin
case Token of
’i’: DoIf;
’w’: DoWhile;
’R’: DoRead;
’W’: DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }
procedure Alloc(N: Symbol);
begin
if InTable(N) then Abort(’Duplicate Variable Name ’ + N);
AddEntry(N, ’v’);
Write(N, ’:’, TAB, ’DC ’);
if Look = ’=’ then begin
Match(’=’);
If Look = ’-’ then begin
Write(Look);
Match(’-’);
end;
WriteLn(GetNum);
end
else
WriteLn(’0’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }
procedure Decl;
begin
GetName;
Alloc(Value);
while Look = ’,’ do begin
Match(’,’);
GetName;
Alloc(Value);
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
Scan;
while Token <> ’b’ do begin
case Token of
’v’: Decl;
else Abort(’Unrecognized Keyword ’ + Value);
end;
Scan;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Main Program }
procedure Main;
begin
MatchString(’BEGIN’);
Prolog;
Block;
MatchString(’END’);
Epilog;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Program }
procedure Prog;
begin
MatchString(’PROGRAM’);
Header;
TopDecls;
Main;
Match(’.’);
194
INTRODUCING “TINY” Conclusion
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: integer;
begin
for i := 1 to MaxEntry do begin
ST[i] := ’’;
SType[i] := ’ ’;
end;
GetChar;
Scan;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
Prog;
if Look <> CR then Abort(’Unexpected data after ’’.’’’);
end.
{--------------------------------------------------------------}
195
Part XI
3 June 1989.
LEXICAL SCAN REVISITED
LEXICAL SCAN REVISITED Background
INTRODUCTION
I’ve got some good news and some bad news. The bad news is that
this installment is not the one I promised last time. What’s more,
the one after this one won’t be, either.
The good news is the reason for this installment: I’ve found a way
to simplify and improve the lexical scanning part of the compiler.
Let me explain.
BACKGROUND
out to be quite fragile ... one addition or deletion here or there and
things tended to go to pot. Looking back on it, I realize that there
was a message in this that I just wasn’t paying attention to.
When I tried to add semicolons on top of the newlines, that was
the last straw. I ended up with much too complex a solution. I began
to realize that something fundamental had to change.
So, in a way this installment will cause us to backtrack a bit and
revisit the issue of scanning all over again. Sorry about that. That’s
the price you pay for watching me do this in real time. But the new
version is definitely an improvement, and will serve us well for what
is to come.
As I said, the scanner we used in Part X was about as simple as
one can get. But anything can be improved. The new scanner is
more like the classical scanner, and not as simple as before. But the
overall compiler structure is even simpler than before. It’s also more
robust, and easier to add to and/or modify. I think that’s worth the
time spent in this digression. So in this installment, I’ll be showing
you the new structure. No doubt you’ll be happy to know that, while
the changes affect many procedures, they aren’t very profound and
so we lose very little of what’s been done so far.
Ironically, the new scanner is much more conventional than the
old one, and is very much like the more generic scanner I showed you
earlier in Part VII. Then I started trying to get clever, and I almost
clevered myself clean out of business. You’d think one day I’d learn:
K-I-S-S!
THE PROBLEM
198
LEXICAL SCAN REVISITED The problem
done for characters. It seems so obvious once you think about it that
way.
Interestingly enough, if we do things this way the problem that
we’ve had with newline characters goes away. We can just lump
them in as whitespace characters, which means that the handling of
newlines becomes very trivial, and MUCH less prone to error than
we’ve had to deal with in the past.
THE SOLUTION
{--------------------------------------------------------------}
{ Get an Identifier }
procedure GetName;
begin
SkipWhite;
if Not IsAlpha(Look) then Expected(’Identifier’);
Token := ’x’;
Value := ’’;
repeat
Value := Value + UpCase(Look);
GetChar;
until not IsAlNum(Look);
end;
{--------------------------------------------------------------}
{ Get a Number }
procedure GetNum;
begin
SkipWhite;
if not IsDigit(Look) then Expected(’Number’);
Token := ’#’;
Value := ’’;
repeat
Value := Value + Look;
GetChar;
until not IsDigit(Look);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Get an Operator }
procedure GetOp;
begin
Token := Look;
200
LEXICAL SCAN REVISITED The solution
Value := ’’;
repeat
Value := Value + Look;
GetChar;
until IsAlpha(Look) or IsDigit(Look) or IsWhite(Look);
end;
{--------------------------------------------------------------}
Note that GetOp returns, as its encoded token, the FIRST char-
acter of the operator. This is important, because it means that we
can now use that single character to drive the parser, instead of the
lookahead character.
We need to tie these procedures together into a single procedure
that can handle all three cases. The following procedure will read any
one of the token types and always leave the input stream advanced
beyond it:
{--------------------------------------------------------------}
{ Get the Next Input Token }
procedure Next;
begin
SkipWhite;
if IsAlpha(Look) then GetName
else if IsDigit(Look) then GetNum
else GetOp;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB, CR, LF];
end;
{--------------------------------------------------------------}
We’ve already tried similar routines in Part VII, but you might as
well try these new ones out. Add them to a copy of the Cradle and
call Next with the following main program:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
repeat
Next;
WriteLn(Token, ’ ’, Value);
until Token = ’.’;
end.
{--------------------------------------------------------------}
Compile it and verify that you can separate a program into a series
of tokens, and that you get the right encoding for each token.
This ALMOST works, but not quite. There are two potential
problems: First, in KISS/TINY almost all of our operators are single-
character operators. The only exceptions are the relops >=, <=, and
<>. It seems a shame to treat all operators as strings and do a string
compare, when only a single character compare will almost always
suffice. Second, and much more important, the thing doesn’t WORK
when two operators appear together, as in (a+b)*(c+d). Here the
string following ’b’ would be interpreted as a single operator “)*(.”
It’s possible to fix that problem. For example, we could just give
GetOp a list of legal characters, and we could treat the parentheses
as different operator types than the others. But this begins to get
messy.
Fortunately, there’s a better way that solves all the problems.
Since almost all the operators are single characters, let’s just treat
them that way, and let GetOp get only one character at a time. This
not only simplifies GetOp, but also speeds things up quite a bit. We
still have the problem of the relops, but we were treating them as
special cases anyway.
So here’s the final version of GetOp:
{--------------------------------------------------------------}
{ Get an Operator }
202
LEXICAL SCAN REVISITED The solution
procedure GetOp;
begin
SkipWhite;
Token := Look;
Value := Look;
GetChar;
end;
{--------------------------------------------------------------}
Note that I still give the string Value a value. If you’re truly
concerned about efficiency, you could leave this out. When we’re
expecting an operator, we will only be testing Token anyhow, so the
value of the string won’t matter. But to me it seems to be good
practice to give the thing a value just in case.
Try this new version with some realistic-looking code. You should
be able to separate any program into its individual tokens, with the
caveat that the two-character relops will scan into two separate to-
kens. That’s OK ... we’ll parse them that way.
Now, in Part VII the function of Next was combined with proce-
dure Scan, which also checked every identifier against a list of key-
words and encoded each one that was found. As I mentioned at the
time, the last thing we would want to do is to use such a procedure
in places where keywords should not appear, such as in expressions.
If we did that, the keyword list would be scanned for every identifier
appearing in the code. Not good.
The right way to deal with that is to simply separate the functions
of fetching tokens and looking for keywords. The version of Scan
shown below does NOTHING but check for keywords. Notice that
it operates on the current token and does NOT advance the input
stream.
{--------------------------------------------------------------}
{ Scan the Current Identifier for Keywords }
procedure Scan;
begin
if Token = ’x’ then
Token := KWcode[Lookup(Addr(KWlist), Value, NKW) + 1];
end;
{--------------------------------------------------------------}
There is one last detail. In the compiler there are a few places
that we must actually check the string value of the token. Mainly,
this is done to distinguish between the different END’s, but there
are a couple of other places. (I should note in passing that we could
always eliminate the need for matching END characters by encoding
each one to a different character. Right now we are definitely taking
the lazy man’s route.)
203
Fixing up the compiler LEXICAL SCAN REVISITED
{--------------------------------------------------------------}
{ Match a Specific Input String }
procedure MatchString(x: string);
begin
if Value <> x then Expected(’’’’ + x + ’’’’);
Next;
end;
{--------------------------------------------------------------}
Armed with these new scanner procedures, we can now begin to fix
the compiler to use them properly. The changes are all quite minor,
but there are quite a few places where changes are necessary. Rather
than showing you each place, I will give you the general idea and
then just give the finished product.
First of all, the code for procedure Block doesn’t change, though
its function does:
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
while not(Token in [’e’, ’l’]) do begin
case Token of
’i’: DoIf;
’w’: DoWhile;
’R’: DoRead;
’W’: DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}
Remember that the new version of Scan doesn’t advance the in-
put stream, it only scans for keywords. The input stream must be
advanced by each procedure that Block calls.
In general, we have to replace every test on Look with a similar
test on Token. For example:
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }
procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Token) do begin
Push;
204
LEXICAL SCAN REVISITED Fixing up the compiler
case Token of
’|’: BoolOr;
’~’: BoolXor;
end;
end;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Next;
Term;
PopAdd;
end;
{-------------------------------------------------------------}
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L1, L2: string;
begin
Next;
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Token = ’l’ then begin
Next;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString(’ENDIF’);
end;
{--------------------------------------------------------------}
1. I’ve deleted the two procedures Prog and Main, and combined
their functions into the main program. They didn’t seem to add
to program clarity ... in fact they seemed to just muddy things
up a little.
205
TINY version 1.1 LEXICAL SCAN REVISITED
2. I’ve deleted the keywords PROGRAM and BEGIN from the key-
word list. Each one only occurs in one place, so it’s not necessary
to search for it.
3. Having been bitten by an overdose of cleverness, I’ve reminded
myself that TINY is supposed to be a minimalist program. There-
fore I’ve replaced the fancy handling of unary minus with the
dumbest one I could think of. A giant step backwards in code
quality, but a great simplification of the compiler. KISS is the
right place to use the other version.
4. I’ve added some error-checking routines such as CheckTable and
CheckDup, and replaced in-line code by calls to them. This
cleans up a number of routines.
5. I’ve taken the error checking out of code generation routines like
Store, and put it in the parser where it belongs. See Assignment,
for example.
6. There was an error in InTable and Locate that caused them to
search all locations instead of only those with valid data in them.
They now search only valid cells. This allows us to eliminate the
initialization of the symbol table, which was done in Init.
7. Procedure AddEntry now has two arguments, which helps to
make things a bit more modular.
8. I’ve cleaned up the code for the relational operators by the ad-
dition of the new procedures CompareExpression and NextEx-
pression.
9. I fixed an error in the Read routine ... the earlier value did not
check for a valid variable name.
CONCLUSION
The resulting compiler for TINY is given below. Other than the
removal of the keyword PROGRAM, it parses the same language as
before. It’s just a bit cleaner, and more importantly it’s considerably
more robust. I feel good about it.
The next installment will be another digression: the discussion of
semicolons and such that got me into this mess in the first place.
THEN we’ll press on into procedures and types. Hang in there with
me. The addition of those features will go a long way towards re-
moving KISS from the “toy language” category. We’re getting very
close to being able to write a serious compiler.
206
LEXICAL SCAN REVISITED TINY version 1.1
{--------------------------------------------------------------}
program Tiny11;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
LF = ^J;
LCount: integer = 0;
NEntry: integer = 0;
{--------------------------------------------------------------}
{ Type Declarations }
type Symbol = string[8];
SymTab = array[1..1000] of Symbol;
TabPtr = ^SymTab;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look : char; { Lookahead Character }
Token: char; { Encoded Token }
Value: string[16]; { Unencoded Token }
const MaxEntry = 100;
var ST : array[1..MaxEntry] of Symbol;
SType: array[1..MaxEntry] of char;
{--------------------------------------------------------------}
{ Definition of Keywords and Token Types }
const NKW = 9;
NKW1 = 10;
const KWlist: array[1..NKW] of Symbol =
(’IF’, ’ELSE’, ’ENDIF’, ’WHILE’, ’ENDWHILE’,
’READ’, ’WRITE’, ’VAR’, ’END’);
const KWcode: string[NKW1] = ’xileweRWve’;
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Report an Undefined Identifier }
procedure Undefined(n: string);
begin
Abort(’Undefined Identifier ’ + n);
end;
{--------------------------------------------------------------}
{ Report a Duplicate Identifier }
procedure Duplicate(n: string);
begin
Abort(’Duplicate Identifier ’ + n);
end;
{--------------------------------------------------------------}
{ Check to Make Sure the Current Token is an Identifier }
procedure CheckIdent;
begin
if Token <> ’x’ then Expected(’Identifier’);
end;
{--------------------------------------------------------------}
207
TINY version 1.1 LEXICAL SCAN REVISITED
208
LEXICAL SCAN REVISITED TINY version 1.1
209
TINY version 1.1 LEXICAL SCAN REVISITED
end;
{--------------------------------------------------------------}
{ Match a Specific Input String }
procedure MatchString(x: string);
begin
if Value <> x then Expected(’’’’ + x + ’’’’);
Next;
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
{--------------------------------------------------------------}
{ Generate a Unique Label }
function NewLabel: string;
var S: string;
begin
Str(LCount, S);
NewLabel := ’L’ + S;
Inc(LCount);
end;
{--------------------------------------------------------------}
{ Post a Label To Output }
procedure PostLabel(L: string);
begin
WriteLn(L, ’:’);
end;
{---------------------------------------------------------------}
{ Clear the Primary Register }
procedure Clear;
begin
EmitLn(’CLR D0’);
end;
{---------------------------------------------------------------}
{ Negate the Primary Register }
procedure Negate;
begin
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Complement the Primary Register }
procedure NotIt;
begin
EmitLn(’NOT D0’);
end;
{---------------------------------------------------------------}
{ Load a Constant Value to Primary Register }
procedure LoadConst(n: string);
begin
Emit(’MOVE #’);
WriteLn(n, ’,D0’);
end;
{---------------------------------------------------------------}
{ Load a Variable to Primary Register }
procedure LoadVar(Name: string);
begin
if not InTable(Name) then Undefined(Name);
EmitLn(’MOVE ’ + Name + ’(PC),D0’);
end;
{---------------------------------------------------------------}
{ Push Primary onto Stack }
procedure Push;
begin
EmitLn(’MOVE D0,-(SP)’);
end;
{---------------------------------------------------------------}
{ Add Top of Stack to Primary }
procedure PopAdd;
begin
210
LEXICAL SCAN REVISITED TINY version 1.1
EmitLn(’ADD (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }
procedure PopSub;
begin
EmitLn(’SUB (SP)+,D0’);
EmitLn(’NEG D0’);
end;
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary }
procedure PopMul;
begin
EmitLn(’MULS (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Divide Top of Stack by Primary }
procedure PopDiv;
begin
EmitLn(’MOVE (SP)+,D7’);
EmitLn(’EXT.L D7’);
EmitLn(’DIVS D0,D7’);
EmitLn(’MOVE D7,D0’);
end;
{---------------------------------------------------------------}
{ AND Top of Stack with Primary }
procedure PopAnd;
begin
EmitLn(’AND (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ OR Top of Stack with Primary }
procedure PopOr;
begin
EmitLn(’OR (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ XOR Top of Stack with Primary }
procedure PopXor;
begin
EmitLn(’EOR (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Compare Top of Stack with Primary }
procedure PopCompare;
begin
EmitLn(’CMP (SP)+,D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was = }
procedure SetEqual;
begin
EmitLn(’SEQ D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was != }
procedure SetNEqual;
begin
EmitLn(’SNE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was > }
procedure SetGreater;
begin
EmitLn(’SLT D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was < }
procedure SetLess;
begin
EmitLn(’SGT D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was <= }
211
TINY version 1.1 LEXICAL SCAN REVISITED
procedure SetLessOrEqual;
begin
EmitLn(’SGE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Set D0 If Compare was >= }
procedure SetGreaterOrEqual;
begin
EmitLn(’SLE D0’);
EmitLn(’EXT D0’);
end;
{---------------------------------------------------------------}
{ Store Primary to Variable }
procedure Store(Name: string);
begin
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’)
end;
{---------------------------------------------------------------}
{ Branch Unconditional }
procedure Branch(L: string);
begin
EmitLn(’BRA ’ + L);
end;
{---------------------------------------------------------------}
{ Branch False }
procedure BranchFalse(L: string);
begin
EmitLn(’TST D0’);
EmitLn(’BEQ ’ + L);
end;
{---------------------------------------------------------------}
{ Read Variable to Primary Register }
procedure ReadIt(Name: string);
begin
EmitLn(’BSR READ’);
Store(Name);
end;
{ Write from Primary Register }
procedure WriteIt;
begin
EmitLn(’BSR WRITE’);
end;
{--------------------------------------------------------------}
{ Write Header Info }
procedure Header;
begin
WriteLn(’WARMST’, TAB, ’EQU $A01E’);
end;
{--------------------------------------------------------------}
{ Write the Prolog }
procedure Prolog;
begin
PostLabel(’MAIN’);
end;
{--------------------------------------------------------------}
{ Write the Epilog }
procedure Epilog;
begin
EmitLn(’DC WARMST’);
EmitLn(’END MAIN’);
end;
{---------------------------------------------------------------}
{ Allocate Storage for a Static Variable }
procedure Allocate(Name, Val: string);
begin
WriteLn(Name, ’:’, TAB, ’DC ’, Val);
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Factor }
procedure BoolExpression; Forward;
procedure Factor;
begin
if Token = ’(’ then begin
Next;
BoolExpression;
MatchString(’)’);
212
LEXICAL SCAN REVISITED TINY version 1.1
end
else begin
if Token = ’x’ then
LoadVar(Value)
else if Token = ’#’ then
LoadConst(Value)
else Expected(’Math Factor’);
Next;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Multiply }
procedure Multiply;
begin
Next;
Factor;
PopMul;
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Divide }
procedure Divide;
begin
Next;
Factor;
PopDiv;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Math Term }
procedure Term;
begin
Factor;
while IsMulop(Token) do begin
Push;
case Token of
’*’: Multiply;
’/’: Divide;
end;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate an Add }
procedure Add;
begin
Next;
Term;
PopAdd;
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
procedure Subtract;
begin
Next;
Term;
PopSub;
end;
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
if IsAddop(Token) then
Clear
else
Term;
while IsAddop(Token) do begin
Push;
case Token of
’+’: Add;
’-’: Subtract;
end;
end;
end;
{---------------------------------------------------------------}
{ Get Another Expression and Compare }
procedure CompareExpression;
begin
Expression;
PopCompare;
end;
213
TINY version 1.1 LEXICAL SCAN REVISITED
{---------------------------------------------------------------}
{ Get The Next Expression and Compare }
procedure NextExpression;
begin
Next;
CompareExpression;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Equals" }
procedure Equal;
begin
NextExpression;
SetEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than or Equal" }
procedure LessOrEqual;
begin
NextExpression;
SetLessOrEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Not Equals" }
procedure NotEqual;
begin
NextExpression;
SetNEqual;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Less Than" }
procedure Less;
begin
Next;
case Token of
’=’: LessOrEqual;
’>’: NotEqual;
else begin
CompareExpression;
SetLess;
end;
end;
end;
{---------------------------------------------------------------}
{ Recognize and Translate a Relational "Greater Than" }
procedure Greater;
begin
Next;
if Token = ’=’ then begin
NextExpression;
SetGreaterOrEqual;
end
else begin
CompareExpression;
SetGreater;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Relation }
procedure Relation;
begin
Expression;
if IsRelop(Token) then begin
Push;
case Token of
’=’: Equal;
’<’: Less;
’>’: Greater;
end;
end;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Factor with Leading NOT }
procedure NotFactor;
begin
if Token = ’!’ then begin
Next;
Relation;
NotIt;
214
LEXICAL SCAN REVISITED TINY version 1.1
end
else
Relation;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Term }
procedure BoolTerm;
begin
NotFactor;
while Token = ’&’ do begin
Push;
Next;
NotFactor;
PopAnd;
end;
end;
{--------------------------------------------------------------}
{ Recognize and Translate a Boolean OR }
procedure BoolOr;
begin
Next;
BoolTerm;
PopOr;
end;
{--------------------------------------------------------------}
{ Recognize and Translate an Exclusive Or }
procedure BoolXor;
begin
Next;
BoolTerm;
PopXor;
end;
{---------------------------------------------------------------}
{ Parse and Translate a Boolean Expression }
procedure BoolExpression;
begin
BoolTerm;
while IsOrOp(Token) do begin
Push;
case Token of
’|’: BoolOr;
’~’: BoolXor;
end;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string;
begin
CheckTable(Value);
Name := Value;
Next;
MatchString(’=’);
BoolExpression;
Store(Name);
end;
{---------------------------------------------------------------}
{ Recognize and Translate an IF Construct }
procedure Block; Forward;
procedure DoIf;
var L1, L2: string;
begin
Next;
BoolExpression;
L1 := NewLabel;
L2 := L1;
BranchFalse(L1);
Block;
if Token = ’l’ then begin
Next;
L2 := NewLabel;
Branch(L2);
PostLabel(L1);
Block;
end;
PostLabel(L2);
MatchString(’ENDIF’);
215
TINY version 1.1 LEXICAL SCAN REVISITED
end;
{--------------------------------------------------------------}
{ Parse and Translate a WHILE Statement }
procedure DoWhile;
var L1, L2: string;
begin
Next;
L1 := NewLabel;
L2 := NewLabel;
PostLabel(L1);
BoolExpression;
BranchFalse(L2);
Block;
MatchString(’ENDWHILE’);
Branch(L1);
PostLabel(L2);
end;
{--------------------------------------------------------------}
{ Read a Single Variable }
procedure ReadVar;
begin
CheckIdent;
CheckTable(Value);
ReadIt(Value);
Next;
end;
{--------------------------------------------------------------}
{ Process a Read Statement }
procedure DoRead;
begin
Next;
MatchString(’(’);
ReadVar;
while Token = ’,’ do begin
Next;
ReadVar;
end;
MatchString(’)’);
end;
{--------------------------------------------------------------}
{ Process a Write Statement }
procedure DoWrite;
begin
Next;
MatchString(’(’);
Expression;
WriteIt;
while Token = ’,’ do begin
Next;
Expression;
WriteIt;
end;
MatchString(’)’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
while not(Token in [’e’, ’l’]) do begin
case Token of
’i’: DoIf;
’w’: DoWhile;
’R’: DoRead;
’W’: DoWrite;
else Assignment;
end;
Scan;
end;
end;
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }
procedure Alloc;
begin
Next;
if Token <> ’x’ then Expected(’Variable Name’);
CheckDup(Value);
AddEntry(Value, ’v’);
216
LEXICAL SCAN REVISITED TINY version 1.1
Allocate(Value, ’0’);
Next;
end;
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
Scan;
while Token = ’v’ do
Alloc;
while Token = ’,’ do
Alloc;
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
begin
GetChar;
Next;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
MatchString(’PROGRAM’);
Header;
TopDecls;
MatchString(’BEGIN’);
Prolog;
Block;
MatchString(’END’);
Epilog;
end.
{--------------------------------------------------------------}
217
Part XII
5 June 1989.
MISCELLANY
MISCELLANY Semicolons
INTRODUCTION
SEMICOLONS
a=b c= d e=e+1
I suspect that this is the major ... perhaps ONLY ... reason for
semicolons: to keep programs from looking funny.
But the idea of stringing multiple statements together on a single
line is a dubious one at best. It’s not very good programming style,
and harks back to the days when it was considered improtant to
conserve cards. In these days of CRT’s and indented code, the clarity
of programs is far better served by keeping statements separate. It’s
still nice to have the OPTION of multiple statements, but it seems a
shame to keep programmers in slavery to the semicolon, just to keep
that one rare case from “looking funny.”
When I started in with KISS, I tried to keep an open mind. I
decided that I would use semicolons when it became necessary for
the parser, but not until then. I figured this would happen just
about the time I added the ability to spread statements over multiple
220
MISCELLANY Syntactic sugar
lines. But, as you can see, that never happened. The TINY compiler
is perfectly happy to parse the most complicated statement, spread
over any number of lines, without semicolons.
Still, there are people who have used semicolons for so long, they
feel naked without them. I’m one of them. Once I had KISS de-
fined sufficiently well, I began to write a few sample programs in the
language. I discovered, somewhat to my horror, that I kept putting
semicolons in anyway. So now I’m facing the prospect of a NEW rash
of compiler errors, caused by UNWANTED semicolons. Phooey!
Perhaps more to the point, there are readers out there who are
designing their own languages, which may include semicolons, or who
want to use the techniques of these tutorials to compile conventional
languages like C. In either case, we need to be able to deal with
semicolons.
SYNTACTIC SUGAR
This whole discussion brings up the issue of “syntactic sugar” ... con-
structs that are added to a language, not because they are needed,
but because they help make the programs look right to the program-
mer. After all, it’s nice to have a small, simple compiler, but it would
be of little use if the resulting language were cryptic and hard to pro-
gram. The language FORTH comes to mind (a premature OUCH!
for the barrage I know that one’s going to fetch me). If we can add
features to the language that make the programs easier to read and
understand, and if those features help keep the programmer from
making errors, then we should do so. Particularly if the constructs
don’t add much to the complexity of the language or its compiler.
The semicolon could be considered an example, but there are
plenty of others, such as the ’THEN’ in a IF-statement, the ’DO’ in
a WHILE-statement, and even the ’PROGRAM’ statement, which I
came within a gnat’s eyelash of leaving out of TINY. None of these
tokens add much to the syntax of the language ... the compiler can
figure out what’s going on without them. But some folks feel that
they DO add to the readability of programs, and that can be very
important.
There are two schools of thought on this subject, which are well
represented by two of our most popular languages, C and Pascal.
221
Syntactic sugar MISCELLANY
To the minimalists, all such sugar should be left out. They argue
that it clutters up the language and adds to the number of keystrokes
programmers must type. Perhaps more importantly, every extra to-
ken or keyword represents a trap laying in wait for the inattentive
programmer. If you leave out a token, misplace it, or misspell it, the
compiler will get you. So these people argue that the best approach
is to get rid of such things. These folks tend to like C, which has a
minimum of unnecessary keywords and punctuation.
Those from the other school tend to like Pascal. They argue that
having to type a few extra characters is a small price to pay for
legibility. After all, humans have to read the programs, too. Their
best argument is that each such construct is an opportunity to tell
the compiler that you really mean for it to do what you said to. The
sugary tokens serve as useful landmarks to help you find your way.
The differences are well represented by the two languages. The
most oft-heard complaint about C is that it is too forgiving. When
you make a mistake in C, the erroneous code is too often another
legal C construct. So the compiler just happily continues to compile,
and leaves you to find the error during debug. I guess that’s why
debuggers are so popular with C programmers.
On the other hand, if a Pascal program compiles, you can be pretty
sure that the program will do what you told it. If there is an error
at run time, it’s probably a design error.
The best example of useful sugar is the semicolon itself. Consider
the code fragment:
a=1+(2*b+c) b...
Since there is no operator connecting the token ’b’ with the rest
of the statement, the compiler will conclude that the expression ends
with the ’)’, and the ’b’ is the beginning of a new statement. But
suppose I have simply left out the intended operator, and I really
want to say:
a=1+(2*b+c)*b...
In this case the compiler will get an error, all right, but it won’t
be very meaningful since it will be expecting an ’=’ sign after the ’b’
that really shouldn’t be there.
222
MISCELLANY Dealing with semicolons
If, on the other hand, I include a semicolon after the ’b’, THEN
there can be no doubt where I intend the statement to end. Syntac-
tic sugar, then, can serve a very useful purpose by providing some
additional insurance that we remain on track.
I find myself somewhere in the middle of all this. I tend to favor
the Pascal-ers’ view ... I’d much rather find my bugs at compile time
rather than run time. But I also hate to just throw verbosity in for no
apparent reason, as in COBOL. So far I’ve consistently left most of
the Pascal sugar out of KISS/TINY. But I certainly have no strong
feelings either way, and I also can see the value of sprinkling a little
sugar around just for the extra insurance that it brings. If you like
this latter approach, things like that are easy to add. Just remember
that, like the semicolon, each item of sugar is something that can
potentially cause a compile error by its omission.
There are two distinct ways in which semicolons are used in popu-
lar languages. In Pascal, the semicolon is regarded as an statement
SEPARATOR. No semicolon is required after the last statement in
a block. The syntax is:
Of the two syntaxes, the Pascal one seems on the face of it more
rational, but experience has shown that it leads to some strange dif-
ficulties. People get so used to typing a semicolon after every state-
ment that they tend to type one after the last statement in a block,
223
Dealing with semicolons MISCELLANY
also. That usually doesn’t cause any harm ... it just gets treated as
a null statement. Many Pascal programmers, including yours truly,
do just that. But there is one place you absolutely CANNOT type
a semicolon, and that’s right before an ELSE. This little gotcha has
cost me many an extra compilation, particularly when the ELSE is
added to existing code. So the C/Ada choice turns out to be bet-
ter. Apparently Nicklaus Wirth thinks so, too: In his Modula 2, he
abandoned the Pascal approach.
Given either of these two syntaxes, it’s an easy matter (now that
we’ve reorganized the parser!) to add these features to our parser.
Let’s take the last case first, since it’s simpler.
To begin, I’ve made things easy by introducing a new recognizer:
{--------------------------------------------------------------}
{ Match a Semicolon }
procedure Semi;
begin
MatchString(’;’);
end;
{--------------------------------------------------------------}
This procedure works very much like our old Match. It insists on
finding a semicolon as the next token. Having found it, it skips to
the next one.
Since a semicolon follows a statement, procedure Block is almost
the only one we need to change:
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
Scan;
while not(Token in [’e’, ’l’]) do begin
case Token of
’i’: DoIf;
’w’: DoWhile;
’R’: DoRead;
’W’: DoWrite;
’x’: Assignment;
end;
Semi;
Scan;
end;
end;
{--------------------------------------------------------------}
Note carefully the subtle change in the case statement. The call to
Assignment is now guarded by a test on Token. This is to avoid call-
ing Assignment when the token is a semicolon (which could happen
if the statement is null).
Since declarations are also statements, we also need to add a call
to Semi within procedure TopDecls:
224
MISCELLANY Dealing with semicolons
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
Scan;
while Token = ’v’ do begin
Alloc;
while Token = ’,’ do
Alloc;
Semi;
end;
end;
{--------------------------------------------------------------}
It’s as easy as that. Try it with a copy of TINY and see how you
like it.
The Pascal version is a little trickier, but it still only requires
minor changes, and those only to procedure Block. To keep things
as simple as possible, let’s split the procedure into two parts. The
following procedure handles just one statement:
{--------------------------------------------------------------}
{ Parse and Translate a Single Statement }
procedure Statement;
begin
Scan;
case Token of
’i’: DoIf;
’w’: DoWhile;
’R’: DoRead;
’W’: DoWrite;
’x’: Assignment;
end;
end;
{--------------------------------------------------------------}
225
Single-character delimiters MISCELLANY
That sure didn’t hurt, did it? We can now parse semicolons in
Pascal-like fashion.
A COMPROMISE
Now that we know how to deal with semicolons, does that mean that
I’m going to put them in KISS/TINY? Well, yes and no. I like the
extra sugar and the security that comes with knowing for sure where
the ends of statements are. But I haven’t changed my dislike for the
compilation errors associated with semicolons.
So I have what I think is a nice compromise: Make them OP-
TIONAL!
Consider the following version of Semi:
{--------------------------------------------------------------}
{ Match a Semicolon }
procedure Semi;
begin
if Token = ’;’ then Next;
end;
{--------------------------------------------------------------}
COMMENTS
SINGLE-CHARACTER DELIMITERS
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
while Look <> ’}’ do
GetCharX;
GetCharX;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Get Character from Input Stream }
{ Skip Any Comments }
procedure GetChar;
begin
GetCharX;
if Look = ’{’ then SkipComment;
end;
{--------------------------------------------------------------}
Code this up and give it a try. You’ll find that you can, indeed,
bury comments anywhere you like. The comments never even get into
the parser proper ... every call to GetChar just returns any character
that’s NOT part of a comment.
As a matter of fact, while this approach gets the job done, and
may even be perfectly satisfactory for you, it does its job a little
227
Single-character delimiters MISCELLANY
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB, CR, LF, ’{’];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do begin
if Look = ’{’ then
SkipComment
else
GetChar;
end;
end;
{--------------------------------------------------------------}
Note that SkipWhite is written so that we will skip over any com-
bination of whitespace characters and comments, in one call.
OK, give this one a try, too. You’ll find that it will let a comment
serve to delimit tokens. It’s worth mentioning that this approach
also gives us the ability to handle curly braces within quoted strings,
228
MISCELLANY Multi-character delimiters
since within such strings we will not be testing for or skipping over
whitespace.
There’s one last item to deal with: Nested comments. Some pro-
grammers like the idea of nesting comments, since it allows you to
comment out code during debugging. The code I’ve given here won’t
allow that and, again, neither will Turbo Pascal.
But the fix is incredibly easy. All we need to do is to make Skip-
Comment recursive:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
while Look <> ’}’ do begin
GetChar;
if Look = ’{’ then SkipComment;
end;
GetChar;
end;
{--------------------------------------------------------------}
MULTI-CHARACTER DELIMITERS
That’s all well and good for cases where a comment is delimited by
single characters, but what about the cases such as C or standard
Pascal, where two characters are required? Well, the principles are
still the same, but we have to change our approach quite a bit. I’m
sure it won’t surprise you to learn that things get harder in this case.
For the multi-character situation, the easiest thing to do is to inter-
cept the left delimiter back at the GetChar stage. We can “tokenize”
it right there, replacing it by a single character.
Let’s assume we’re using the C delimiters ’/*’ and ’*/’. First, we
need to go back to the “GetCharX” approach. In yet another copy
of your compiler, rename GetChar to GetCharX and then enter the
following new procedure GetChar:
{--------------------------------------------------------------}
{ Read New Character. Intercept ’/*’ }
procedure GetChar;
begin
if TempChar <> ’ ’ then begin
Look := TempChar;
TempChar := ’ ’;
end
229
Multi-character delimiters MISCELLANY
else begin
GetCharX;
if Look = ’/’ then begin
Read(TempChar);
if TempChar = ’*’ then begin
Look := ’{’;
TempChar := ’ ’;
end;
end;
end;
end;
{--------------------------------------------------------------}
As you can see, what this procedure does is to intercept every oc-
currence of ’/’. It then examines the NEXT character in the stream.
If the character is a ’*’, then we have found the beginning of a com-
ment, and GetChar will return a single character replacement for it.
(For simplicity, I’m using the same ’{’ character as I did for Pascal.
If you were writing a C compiler, you’d no doubt want to pick some
other character that’s not used elsewhere in C. Pick anything you
like ... even $FF, anything that’s unique.)
If the character following the ’/’ is NOT a ’*’, then GetChar tucks
it away in the new global TempChar, and returns the ’/’.
Note that you need to declare this new variable and initialize it
to ’ ’. I like to do things like that using the Turbo “typed constant”
construct:
ONE-SIDED COMMENTS
So far I’ve shown you how to deal with any kind of comment delimited
on the left and the right. That only leaves the one-sided comments
like those in assembler language or in Ada, that are terminated by
the end of the line. In a way, that case is easier. The only procedure
that would need to be changed is SkipComment, which must now
terminate at the newline characters:
{--------------------------------------------------------------}
{ Skip A Comment Field }
procedure SkipComment;
begin
repeat
GetCharX;
until Look = CR;
GetChar;
end;
{--------------------------------------------------------------}
CONCLUSION
At this point we now have the ability to deal with both comments
and semicolons, as well as other kinds of syntactic sugar. I’ve shown
you several ways to deal with each, depending upon the convention
desired. The only issue left is: which of these conventions should we
use in KISS/TINY?
For the reasons that I’ve given as we went along, I’m choosing the
following:
Put the code corresponding to these cases into your copy of TINY.
You now have TINY Version 1.2.
Now that we have disposed of these sideline issues, we can finally
get back into the mainstream. In the next installment, we’ll talk
about procedures and parameter passing, and we’ll add these impor-
tant features to TINY. See you then.
232
Part XIII
27 August 1989.
PROCEDURES
One last digression PROCEDURES
INTRODUCTION
When I first began this series, I told you that we would use sev-
eral “tricks” to make things easy, and to let us learn the concepts
without getting too bogged down in the details. Among these tricks
was the idea of looking at individual pieces of a compiler at a time,
i.e. performing experiments using the Cradle as a base. When we
studied expressions, for example, we dealt with only that part of
compiler theory. When we studied control structures, we wrote a dif-
ferent program, still based on the Cradle, to do that part. We only
incorporated these concepts into a complete language fairly recently.
These techniques have served us very well indeed, and led us to the
development of a compiler for TINY version 1.3.
When I first began this session, I tried to build upon what we had
already done, and just add the new features to the existing compiler.
That turned out to be a little awkward and tricky ... much too much
to suit me.
I finally figured out why. In this series of experiments, I had aban-
doned the very useful techniques that had allowed us to get here, and
without meaning to I had switched over into a new method of work-
ing, that involved incremental changes to the full TINY compiler.
You need to understand that what we are doing here is a little
unique. There have been a number of articles, such as the Small C
articles by Cain and Hendrix, that presented finished compilers for
one language or another. This is different. In this series of tutorials,
you are watching me design and implement both a language and a
compiler, in real time.
In the experiments that I’ve been doing in preparation for this
article, I was trying to inject the changes into the TINY compiler in
such a way that, at every step, we still had a real, working compiler.
In other words, I was attempting an incremental enhancement of the
language and its compiler, while at the same time explaining to you
what I was doing.
That’s a tough act to pull off! I finally realized that it was dumb to
try. Having gotten this far using the idea of small experiments based
on single-character tokens and simple, special-purpose programs, I
had abandoned them in favor of working with the full compiler. It
wasn’t working.
So we’re going to go back to our roots, so to speak. In this install-
ment and the next, I’ll be using single-character tokens again as we
235
A basis for experiments PROCEDURES
After all this time, you don’t need more buildup than that, so let’s
waste no more time and dive right in.
THE BASICS
All modern CPU’s provide direct support for procedure calls, and
the 68000 is no exception. For the 68000, the call is a BSR (PC-
relative version) or JSR, and the return is RTS. All we have to do is
to arrange for the compiler to issue these commands at the proper
place.
<ident> = <ident>
236
PROCEDURES A basis for experiments
237
A basis for experiments PROCEDURES
238
PROCEDURES A basis for experiments
239
A basis for experiments PROCEDURES
va (for VAR A)
vb (for VAR B)
vc (for VAR C)
b (for BEGIN)
a=b
b=c
e. (for END.)
As usual, you should also make some deliberate errors, and verify
that the program catches them correctly.
DECLARING A PROCEDURE
If you’re satisfied that our little program works, then it’s time to deal
with the procedures. Since we haven’t talked about
parameters yet, we’ll begin by considering only procedures that
have no parameter lists.
As a start, let’s consider a simple program with a procedure, and
think about the code we’d like to see generated for it:
PROGRAM FOO;
.
.
PROCEDURE BAR; BAR:
BEGIN .
. .
. .
END; RTS
Note that I’ve added a new code generation routine, Return, which
merely emits an RTS instruction. The creation of that routine is “left
as an exercise for the student.”
To finish this version, add the following line within the Case state-
ment in DoBlock:
’p’: DoProc;
I should mention that this structure for declarations, and the BNF
that drives it, differs from standard Pascal. In the Jensen & Wirth
definition of Pascal, variable declarations, in fact ALL kinds of dec-
larations, must appear in a specific sequence, i.e. labels, constants,
types, variables, procedures, and main program. To follow such a
scheme, we should separate the two declarations, and have code in
the main program something like
242
PROCEDURES Declaring a procedure
DoVars;
DoProcs;
DoMain;
{--------------------------------------------------------------}
{ Parse and Translate a Main Program }
procedure DoMain;
begin
Match(’b’);
Fin;
Prolog;
DoBlock;
Epilog;
end;
{--------------------------------------------------------------}
.
.
.
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
243
Declaring a procedure PROCEDURES
TopDecls;
DoMain;
end.
{--------------------------------------------------------------}
Note that DoProc and DoMain are not quite symmetrical. DoProc
uses a call to BeginBlock, whereas DoMain cannot. That’s because
a procedure is signaled by the keyword PROCEDURE (abbreviated
by a ’p’ here), while the main program gets no keyword other than
the BEGIN itself.
And THAT brings up an interesting question: WHY?
If we look at the structure of C programs, we find that all functions
are treated just alike, except that the main program happens to be
identified by its name, “main.” Since C functions can appear in any
order, the main program can also be anywhere in the compilation
unit.
In Pascal, on the other hand, all variables and procedures must
be declared before they’re used, which means that there is no point
putting anything after the main program ... it could never be ac-
cessed. The “main program” is not identified at all, other than being
that part of the code that comes after the global BEGIN. In other
words, if it ain’t anything else, it must be the main program.
This causes no small amount of confusion for beginning program-
mers, and for big Pascal programs sometimes it’s difficult to find the
beginning of the main program at all. This leads to conventions such
as identifying it in comments:
BEGIN { of MAIN }
The code also looks much better, at least in the sense that DoMain
and DoProc look more alike:
{--------------------------------------------------------------}
{ Parse and Translate a Main Program }
procedure DoMain;
var N: char;
begin
Match(’P’);
N := GetName;
Fin;
if InTable(N) then Duplicate(N);
Prolog;
BeginBlock;
end;
{--------------------------------------------------------------}
.
.
.
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
while Look <> ’.’ do begin
case Look of
’v’: Decl;
’p’: DoProc;
’P’: DoMain;
else Abort(’Unrecognized Keyword ’ + Look);
end;
Fin;
end;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
TopDecls;
Epilog;
end.
{--------------------------------------------------------------}
Since the declaration of the main program is now within the loop
of TopDecl, that does present some difficulties. How do we ensure
that it’s the last thing in the file? And how do we ever exit from
the loop? My answer for the second question, as you can see, was
to bring back our old friend the period. Once the parser sees that,
we’re done.
To answer the first question: it depends on how far we’re willing
to go to protect the programmer from dumb mistakes. In the code
that I’ve shown, there’s nothing to keep the programmer from adding
245
Calling the procedure PROCEDURES
code after the main program ... even another main program. The
code will just not be accessible. However, we COULD access it via a
FORWARD statement, which we’ll be providing later. As a matter
of fact, many assembler language programmers like to use the area
just after the program to declare large, uninitialized data blocks, so
there may indeed be some value in not requiring the main program
to be last. We’ll leave it as it is.
If we decide that we should give the programmer a little more help
than that, it’s pretty easy to add some logic to kick us out of the loop
once the main program has been processed. Or we could at least flag
an error if someone tries to include two mains.
If you’re satisfied that things are working, let’s address the second
half of the equation ... the call.
Consider the BNF for a procedure call:
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment(Name: char);
begin
Match(’=’);
Expression;
StoreVar(Name);
246
PROCEDURES Passing parameters
end;
{--------------------------------------------------------------}
{ Decide if a Statement is an Assignment or Procedure Call }
procedure AssignOrProc;
var Name: char;
begin
Name := GetName;
case TypeOf(Name) of
’ ’: Undefined(Name);
’v’: Assignment(Name);
’p’: CallProc(Name);
else Abort(’Identifier ’ + Name +
’ Cannot Be Used Here’);
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure DoBlock;
begin
while not(Look in [’e’]) do begin
AssignOrProc;
Fin;
end;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Call a Procedure }
procedure CallProc(N: char);
begin
EmitLn(’BSR ’ + N);
end;
{--------------------------------------------------------------}
Well, at this point we have a compiler that can deal with proce-
dures. It’s worth noting that procedures can call procedures to any
depth. So even though we don’t allow nested DECLARATIONS,
there is certainly nothing to keep us from nesting CALLS, just as
we would expect to do in any language. We’re getting there, and it
wasn’t too hard, was it?
PASSING PARAMETERS
Again, we all know the basic idea of passed parameters, but let’s
review them just to be safe.
In general the procedure is given a parameter list, for example
PROCEDURE FOO(X, Y, Z)
Note that there is already an implicit decision built into this syn-
tax. Some languages, such as Pascal and Ada, permit parameter lists
to be optional. If there are no parameters, you simply leave off the
parens completely. Other languages, like C and Modula 2, require
the parens even if the list is empty. Clearly, the example we just fin-
ished corresponds to the former point of view. But to tell the truth
I prefer the latter. For procedures alone, the decision would seem to
favor the “listless” approach. The statement
Initialize; ,
248
PROCEDURES Passing parameters
{--------------------------------------------------------------}
{ Process the Formal Parameter List of a Procedure }
procedure FormalList;
begin
Match(’(’);
if Look <> ’)’ then begin
FormalParam;
while Look = ’,’ do begin
Match(’,’);
FormalParam;
end;
end;
Match(’)’);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }
procedure DoProc;
249
Passing parameters PROCEDURES
var N: char;
begin
Match(’p’);
N := GetName;
FormalList;
Fin;
if InTable(N) then Duplicate(N);
ST[N] := ’p’;
PostLabel(N);
BeginBlock;
Return;
end;
{--------------------------------------------------------------}
For now, the code for FormalParam is just a dummy one that
simply skips the parameter name:
{--------------------------------------------------------------}
{ Process a Formal Parameter }
procedure FormalParam;
var Name: char;
begin
Name := GetName;
end;
{--------------------------------------------------------------}
For the actual procedure call, there must be similar code to process
the actual parameter list:
{--------------------------------------------------------------}
{ Process an Actual Parameter }
procedure Param;
var Name: char;
begin
Name := GetName;
end;
{--------------------------------------------------------------}
{ Process the Parameter List for a Procedure Call }
procedure ParamList;
begin
Match(’(’);
if Look <> ’)’ then begin
Param;
while Look = ’,’ do begin
Match(’,’);
Param;
end;
end;
Match(’)’);
end;
{--------------------------------------------------------------}
{ Process a Procedure Call }
procedure CallProc(Name: char);
begin
ParamList;
Call(Name);
end;
{--------------------------------------------------------------}
OK, if you’ll add all this code to your translator and try it out,
you’ll find that you can indeed parse the syntax properly. I’ll note
in passing that there is NO checking to make sure that the number
(and, later, types) of formal and actual parameters match up. In a
production compiler, we must of course do this. We’ll ignore the issue
now if for no other reason than that the structure of our symbol table
doesn’t currently give us a place to store the necessary information.
Later on, we’ll have a place for that data and we can deal with the
issue then.
So far we’ve dealt with the SYNTAX of parameter passing, and we’ve
got the parsing mechanisms in place to handle it. Next, we have
to look at the SEMANTICS, i.e., the actions to be taken when we
encounter parameters. This brings us square up against the issue of
the different ways parameters can be passed.
There is more than one way to pass a parameter, and the way we
do it can have a profound effect on the character of the language.
So this is another of those areas where I can’t just give you my
solution. Rather, it’s important that we spend some time looking at
the alternatives so that you can go another route if you choose to.
There are two main ways parameters are passed:
• By value
• By reference (address)
and its caller. In effect, it gave the subroutine complete access to all
variables that appeared in the parameter list.
Many times, we didn’t want to actually change a parameter, but
only use it as an input. For example, we might pass an element count
to a subroutine, and wish we could then use that count within a DO-
loop. To avoid changing the value in the calling program, we had to
make a local copy of the input parameter, and operate only on the
copy. Some FORTRAN programmers, in fact, made it a practice to
copy ALL parameters except those that were to be used as return
values. Needless to say, all this copying defeated a good bit of the
efficiency associated with the approach.
There was, however, an even more insidious problem, which was
not really just the fault of the “pass by reference” convention, but a
bad convergence of several implementation decisions.
Suppose we have a subroutine:
SUBROUTINE FOO(X, Y, N)
CALL FOO(A, B, J + 1)
K = J + 1
CALL FOO(A, B, K)
Here again, there was copying required, and the burden was on
the programmer to do it. Not good.
Later FORTRAN implementations got rid of this by allowing ex-
pressions as parameters. What they did was to assign a compiler-
generated variable, store the value of the expression in the variable,
and then pass the address of the expression.
So far, so good. Even if the subroutine mistakenly altered the
anonymous variable, who was to know or care? On the next call, it
would be recalculated anyway.
252
PROCEDURES The semantics of parameters
CALL FOO(A, B, 4)
This means that the value of the scalar is COPIED into a separate
value used only for the call. Since the value passed is a copy, the
called procedure can use it as a local variable and modify it any way
it likes. The value in the caller will not be changed.
It may seem at first that this is a bit inefficient, because of the
need to copy the parameter. But remember that we’re going to have
to fetch SOME value to pass anyway, whether it be the parameter
itself or an address for it. Inside the subroutine, using pass-by-value
is definitely more efficient, since we eliminate one level of indirection.
Finally, we saw earlier that with FORTRAN, it was often necessary to
make copies within the subroutine anyway, so pass-by-value reduces
the number of local variables. All in all, pass-by-value is better.
Except for one small little detail: if all parameters are passed by
value, there is no way for a called to procedure to return a result to
its caller! The parameter passed is NOT altered in the caller, only
in the called procedure. Clearly, that won’t get the job done.
There have been two answers to this problem, which are equiv-
alent. In Pascal, Wirth provides for VAR parameters, which are
passed-by-reference. What a VAR parameter is, in fact, is none other
than our old friend the FORTRAN parameter, with a new name and
paint job for disguise. Wirth neatly gets around the “changing a
literal” problem as well as the “address of an expression” problem,
by the simple expedient of allowing only a variable to be the actual
parameter. In other words, it’s the same restriction that the earliest
FORTRANs imposed.
C does the same thing, but explicitly. In C, ALL parameters are
passed by value. One kind of variable that C supports, however, is
the pointer. So by passing a pointer by value, you in effect pass what
it points to by reference. In some ways this works even better yet,
because even though you can change the variable pointed to all you
like, you still CAN’T change the pointer itself. In a function such as
strcpy, for example, where the pointers are incremented as the string
is copied, we are really only incrementing copies of the pointers, so
the values of those pointers in the calling procedure still remain as
they were. To modify a pointer, you must pass a pointer to the
pointer.
Since we are simply performing experiments here, we’ll look at
BOTH pass-by-value and pass-by-reference. That way, we’ll be able
to use either one as we need to. It’s worth mentioning that it’s going
254
PROCEDURES Pass-by-value
PASS-BY-VALUE
Let’s just try some simple-minded things and see where they lead us.
Let’s begin with the pass-by-value case. Consider the procedure call:
FOO(X, Y)
Almost the only reasonable way to pass the data is through the
CPU stack. So the code we’d like to see generated might look some-
thing like this:
.
.
Value of X (2 bytes)
Value of Y (2 bytes)
SP --> Return Address (4 bytes)
So the values of the parameters have addresses that are fixed offsets
from the stack pointer. In this example, the addresses are:
X: 6(SP)
Y: 4(SP)
PROCEDURE FOO(A, B)
BEGIN
A = B
END
255
Pass-by-value PROCEDURES
{--------------------------------------------------------------}
{ Initialize Parameter Table to Null }
procedure ClearParams;
var i: char;
begin
for i := ’A’ to ’Z’ do
Params[i] := 0;
NumParams := 0;
end;
{--------------------------------------------------------------}
We’ll put a call to this procedure in Init, and also at the end of
DoProc:
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
GetChar;
256
PROCEDURES Pass-by-value
SkipWhite;
for i := ’A’ to ’Z’ do
ST[i] := ’ ’;
ClearParams;
end;
{--------------------------------------------------------------}
.
.
.
{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }
procedure DoProc;
var N: char;
begin
Match(’p’);
N := GetName;
FormalList;
Fin;
if InTable(N) then Duplicate(N);
ST[N] := ’p’;
PostLabel(N);
BeginBlock;
Return;
ClearParams;
end;
{--------------------------------------------------------------}
Note that the call within DoProc ensures that the table will be
clear when we’re in the main program.
OK, now we need a few procedures to work with the table. The
next few functions are essentially copies of InTable, TypeOf, etc.:
{--------------------------------------------------------------}
{ Find the Parameter Number }
function ParamNumber(N: char): integer;
begin
ParamNumber := Params[N];
end;
{--------------------------------------------------------------}
{ See if an Identifier is a Parameter }
function IsParam(N: char): boolean;
begin
IsParam := Params[N] <> 0;
end;
{--------------------------------------------------------------}
{ Add a New Parameter to Table }
procedure AddParam(Name: char);
begin
if IsParam(Name) then Duplicate(Name);
Inc(NumParams);
Params[Name] := NumParams;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }
procedure LoadParam(N: integer);
var Offset: integer;
begin
Offset := 4 + 2 * (NumParams - N);
Emit(’MOVE ’);
WriteLn(Offset, ’(SP),D0’);
end;
{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }
257
Pass-by-value PROCEDURES
( The last routine is one we’ve seen before, but it wasn’t in this
vestigial version of the program.)
With those preliminaries in place, we’re ready to deal with the
semantics of procedures with calling lists (remember, the code to
deal with the syntax is already in place).
Let’s begin by processing a formal parameter. All we have to do
is to add each parameter to the parameter symbol table:
{--------------------------------------------------------------}
{ Process a Formal Parameter }
procedure FormalParam;
begin
AddParam(GetName);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Get Type of Symbol }
function TypeOf(n: char): char;
begin
if IsParam(n) then
TypeOf := ’f’
else
TypeOf := ST[n];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Decide if a Statement is an Assignment or Procedure Call }
procedure AssignOrProc;
258
PROCEDURES Pass-by-value
As you can see, these procedures will treat every variable name
encountered as either a formal parameter or a global variable, de-
pending on whether or not it appears in the parameter symbol table.
Remember that we are using only a vestigial form of Expression. In
the final program, the change shown here will have to be added to
Factor, not Expression.
The rest is easy. We need only add the semantics to the actual
procedure call, which we can do with one new line of code:
{--------------------------------------------------------------}
{ Process an Actual Parameter }
procedure Param;
begin
Expression;
Push;
end;
{--------------------------------------------------------------}
That’s it. Add these changes to your program and give it a try.
Try declaring one or two procedures, each with a formal parameter
259
What’s wrong? PROCEDURES
WHAT’S WRONG?
At this point, you might be thinking: Surely there’s more to this than
a few pushes and pops. There must be more to passing parameters
than this.
You’d be right. As a matter of fact, the code that we’re generating
here leaves a lot to be desired in several respects.
The most glaring oversight is that it’s wrong! If you’ll look back
at the code for a procedure call, you’ll see that the caller pushes each
actual parameter onto the stack before it calls the procedure. The
procedure USES that information, but it doesn’t change the stack
pointer. That means that the stuff is still there when we return.
SOMEBODY needs to clean up the stack, or we’ll soon be in very
hot water!
Fortunately, that’s easily fixed. All we have to do is to increment
the stack pointer when we’re finished.
Should we do that in the calling program, or the called procedure?
Some folks let the called procedure clean up the stack, since that
requires less code to be generated per call, and since the procedure,
after all, knows how many parameters it’s got. But that means that
it must do something with the return address so as not to lose it.
I prefer letting the caller clean up, so that the callee need only
execute a return. Also, it seems a bit more balanced, since the caller
is the one who “messed up” the stack in the first place. But THAT
means that the caller must remember how many items it pushed.
To make things easy, I’ve modified the procedure ParamList to be
a function instead of a procedure, returning the number of bytes
pushed:
{--------------------------------------------------------------}
{ Process the Parameter List for a Procedure Call }
function ParamList: integer;
var N: integer;
260
PROCEDURES What’s wrong?
begin
N := 0;
Match(’(’);
if Look <> ’)’ then begin
Param;
inc(N);
while Look = ’,’ do begin
Match(’,’);
Param;
inc(N);
end;
end;
Match(’)’);
ParamList := 2 * N;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Process a Procedure Call }
procedure CallProc(Name: char);
var N: integer;
begin
N := ParamList;
Call(Name);
CleanStack(N);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Adjust the Stack Pointer Upwards by N Bytes }
procedure CleanStack(N: integer);
begin
if N > 0 then begin
Emit(’ADD #’);
WriteLn(N, ’,SP’);
end;
end;
{--------------------------------------------------------------}
OK, if you’ll add this code to your compiler, I think you’ll find
that the stack is now under control.
The next problem has to do with our way of addressing relative to
the stack pointer. That works fine in our simple examples, since with
our rudimentary form of expressions nobody else is messing with the
stack. But consider a different example as simple as:
PROCEDURE FOO(A, B)
BEGIN
A = A + B
END
This would be wrong. When we push the first argument onto the
stack, the offsets for the two formal parameters are no longer 4 and
6, but are 6 and 8. So the second fetch would fetch A again, not B.
This is not the end of the world. I think you can see that all
we really have to do is to alter the offset every time we do a push,
and that in fact is what’s done if the CPU has no support for other
methods.
Fortunately, though, the 68000 does have such support. Recog-
nizing that this CPU would be used a lot with high-order language
compilers, Motorola decided to add direct support for this kind of
thing.
The problem, as you can see, is that as the procedure executes, the
stack pointer bounces up and down, and so it becomes an awkward
thing to use as a reference to access the formal parameters. The
solution is to define some OTHER register, and use it instead. This
register is typically set equal to the original stack pointer, and is
called the frame pointer.
The 68000 instruction set LINK lets you declare such a frame
pointer, and sets it equal to the stack pointer, all in one instruction.
As a matter of fact, it does even more than that. Since this register
may have been in use for something else in the calling procedure,
LINK also pushes the current value of that register onto the stack.
It can also add a value to the stack pointer, to make room for local
variables.
The complement of LINK is UNLK, which simply restores the
stack pointer and pops the old value back into the register.
Using these two instructions, the code for the previous example
becomes:
262
PROCEDURES What’s wrong?
{--------------------------------------------------------------}
{ Write the Prolog for a Procedure }
procedure ProcProlog(N: char);
begin
PostLabel(N);
EmitLn(’LINK A6,#0’);
end;
{--------------------------------------------------------------}
{ Write the Epilog for a Procedure }
procedure ProcEpilog;
begin
EmitLn(’UNLK A6’);
EmitLn(’RTS’);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }
procedure DoProc;
var N: char;
begin
Match(’p’);
N := GetName;
FormalList;
Fin;
if InTable(N) then Duplicate(N);
ST[N] := ’p’;
ProcProlog(N);
BeginBlock;
ProcEpilog;
ClearParams;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }
procedure LoadParam(N: integer);
var Offset: integer;
begin
Offset := 8 + 2 * (NumParams - N);
Emit(’MOVE ’);
WriteLn(Offset, ’(A6),D0’);
end;
{--------------------------------------------------------------}
263
Call-by-reference PROCEDURES
(Note that the Offset computation changes to allow for the extra
push of A6.)
That’s all it takes. Try this out and see how you like it.
At this point we are generating some relatively nice code for pro-
cedures and procedure calls. Within the limitation that there are no
local variables (yet) and that no procedure nesting is allowed, this
code is just what we need.
There is still just one little small problem remaining:
CALL-BY-REFERENCE
This one is easy, now that we have the mechanisms already in place.
We only have to make a few changes to the code generation. Instead
of pushing a value onto the stack, we must push an address. As it
turns out, the 68000 has an instruction, PEA, that does just that.
We’ll be making a new version of the test program for this. Before
we do anything else,
>>>> MAKE A COPY <<<<
of the program as it now stands, because we’ll be needing it again
later.
Let’s begin by looking at the code we’d like to see generated for
the new case. Using the same example as before, we need the call
264
PROCEDURES Call-by-reference
FOO(X, Y)
to be translated to:
265
Local variables PROCEDURES
To get the count right, we must also change one line in ParamList:
ParamList := 4 * N;
That should do it. Give it a try and see if it’s generating reasonable-
looking code. As you will see, the code is hardly optimal, since we
reload the address register every time a parameter is needed. But
that’s consistent with our KISS approach here, of just being sure to
generate code that works. We’ll just make a little note here, that
here’s yet another candidate for optimization, and press on.
Now we’ve learned to process parameters using pass-by-value and
pass-by-reference. In the real world, of course, we’d like to be able
to deal with BOTH methods. We can’t do that yet, though, because
we have not yet had a session on types, and that has to come first.
If we can only have ONE method, then of course it has to be the
good ol’ FORTRAN method of pass-by-reference, since that’s the
only way procedures can ever return values to their caller.
This, in fact, will be one of the differences between TINY and
KISS. In the next version of TINY, we’ll use pass-by-reference for all
parameters. KISS will support both methods.
LOCAL VARIABLES
So far, we’ve said nothing about local variables, and our definition of
procedures doesn’t allow for them. Needless to say, that’s a big gap
in our language, and one that needs to be corrected.
Here again we are faced with a choice: Static or dynamic storage?
In those old FORTRAN programs, local variables were given static
storage just like global ones. That is, each local variable got a name
and allocated address, like any other variable, and was referenced by
that name.
That’s easy for us to do, using the allocation mechanisms already
in place. Remember, though, that local variables can have the same
names as global ones. We need to somehow deal with that by assign-
ing unique names for these variables.
The characteristic of static storage, of course, is that the data
survives a procedure call and return. When the procedure is called
266
PROCEDURES Local variables
MOVE 8(A6),D0
267
Local variables PROCEDURES
MOVE X(PC),D0.
{--------------------------------------------------------------}
{ Load a Parameter to the Primary Register }
procedure LoadParam(N: integer);
var Offset: integer;
begin
Offset := 8 + 2 * (Base - N);
Emit(’MOVE ’);
WriteLn(Offset, ’(A6),D0’);
end;
{--------------------------------------------------------------}
{ Store a Parameter from the Primary Register }
procedure StoreParam(N: integer);
var Offset: integer;
begin
Offset := 8 + 2 * (Base - N);
Emit(’MOVE D0,’);
WriteLn(Offset, ’(A6)’);
end;
{--------------------------------------------------------------}
The idea is that the value of Base will be frozen after we have
processed the formal parameters, and won’t increase further as the
new, local variables, are inserted in the symbol table. This is taken
care of at the end of FormalList:
268
PROCEDURES Local variables
{--------------------------------------------------------------}
{ Process the Formal Parameter List of a Procedure }
procedure FormalList;
begin
Match(’(’);
if Look <> ’)’ then begin
FormalParam;
while Look = ’,’ do begin
Match(’,’);
FormalParam;
end;
end;
Match(’)’);
Fin;
Base := NumParams;
NumParams := NumParams + 4;
end;
{--------------------------------------------------------------}
(We add four words to make allowances for the return address and
old frame pointer, which end up between the formal parameters and
the locals.)
About all we need to do next is to install the semantics for declar-
ing local variables into the parser. The routines are very similar to
Decl and TopDecls:
{--------------------------------------------------------------}
{ Parse and Translate a Local Data Declaration }
procedure LocDecl;
var Name: char;
begin
Match(’v’);
AddParam(GetName);
Fin;
end;
{--------------------------------------------------------------}
{ Parse and Translate Local Declarations }
function LocDecls: integer;
var n: integer;
begin
n := 0;
while Look = ’v’ do begin
LocDecl;
inc(n);
end;
LocDecls := n;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Procedure Declaration }
procedure DoProc;
var N: char;
k: integer;
begin
Match(’p’);
N := GetName;
if InTable(N) then Duplicate(N);
ST[N] := ’p’;
269
Conclusion PROCEDURES
FormalList;
k := LocDecls;
ProcProlog(N, k);
BeginBlock;
ProcEpilog;
ClearParams;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Write the Prolog for a Procedure }
procedure ProcProlog(N: char; k: integer);
begin
PostLabel(N);
Emit(’LINK A6,#’);
WriteLn(-2 * k)
end;
{--------------------------------------------------------------}
That should do it. Add these changes and see how they work.
CONCLUSION
270
Part XIV
26 May 1990.
TYPES
Introduction TYPES
INTRODUCTION
Before diving into the tutorial, I think you’d like to know where we
are going from here ... especially since it’s been so long since the last
installment.
I have not been idle in the meantime. What I’ve been doing is
reorganizing the compiler itself into Turbo Units. One of the prob-
lems I’ve encountered is that as we’ve covered new areas and thereby
added features to the TINY compiler, it’s been getting longer and
longer. I realized a couple of installments back that this was caus-
ing trouble, and that’s why I’ve gone back to using only compiler
fragments for the last installment and this one. The problem is that
it just seems dumb to have to reproduce the code for, say, process-
ing boolean exclusive OR’s, when the subject of the discussion is
parameter passing.
The obvious way to have our cake and eat it, too, is to break up
the compiler into separately compilable modules, and of course the
Turbo Unit is an ideal vehicle for doing this. This allows us to hide
some fairly complex code (such as the full arithmetic and boolean
expression parsing) into a single unit, and just pull it in whenever
it’s needed. In that way, the only code I’ll have to reproduce in these
installments will be the code that actually relates to the issue under
discussion.
I’ve also been toying with Turbo 5.5, which of course includes
the Borland object-oriented extensions to Pascal. I haven’t decided
whether to make use of these features, for two reasons. First of all,
many of you who have been following this series may still not have
5.5, and I certainly don’t want to force anyone to have to go out
and buy a new compiler just to complete the series. Secondly, I’m
not convinced that the O-O extensions have all that much value for
this application. We’ve been having some discussions about that in
CompuServe’s CLM forum, and so far we’ve not found any compelling
reason to use O-O constructs. This is another of those areas where I
could use some feedback from you readers. Anyone want to vote for
Turbo 5.5 and O-O?
273
The symbol table TYPES
In any case, after the next few installments in the series, the plan
is to upload to you a complete set of Units, and complete functioning
compilers as well. The plan, in fact, is to have THREE compilers:
One for a single-character version of TINY (to use for our experi-
ments), one for TINY and one for KISS. I’ve pretty much isolated
the differences between TINY and KISS, which are these:
• TINY will support only two data types: The character and the
16-bit integer. I may also try to do something with strings, since
without them a compiler would be pretty useless. KISS will
support all the usual simple types, including arrays and even
floating point.
• TINY will only have two control constructs, the IF and the
WHILE. KISS will support a very rich set of constructs, includ-
ing one we haven’t discussed here before ... the CASE.
• KISS will support separately compilable modules.
One caveat: Since I still don’t know much about 80x86 assembler
language, all these compiler modules will still be written to support
68000 code. However, for the programs I plan to upload, all the code
generation has been carefully encapsulated into a single unit, so that
any enterprising student should be able to easily retarget to any other
processor. This task is “left as an exercise for the student.” I’ll make
an offer right here and now: For the person who provides us the first
robust retarget to 80x86, I will be happy to discuss shared copyrights
and royalties from the book that’s upcoming.
But enough talk. Let’s get on with the study of types. As I
said earlier, we’ll do this one as we did in the last installment: by
performing experiments using single-character tokens.
{--------------------------------------------------------------}
{ Variable Declarations }
var Look: char; { Lookahead Character }
ST: Array[’A’..’Z’] of char; { *** ADD THIS LINE ***}
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
for i := ’A’ to ’Z’ do
ST[i] := ’?’;
GetChar;
end;
{--------------------------------------------------------------}
We don’t really need the next procedure, but it will be helpful for
debugging. All it does is to dump the contents of the symbol table:
{--------------------------------------------------------------}
{ Dump the Symbol Table }
procedure DumpTable;
var i: char;
begin
for i := ’A’ to ’Z’ do
WriteLn(i, ’ ’, ST[i]);
end;
{--------------------------------------------------------------}
It really doesn’t matter much where you put this procedure ... I
plan to cluster all the symbol table routines together, so I put mine
just after the error reporting procedures.
If you’re the cautious type (as I am), you might want to begin
with a test program that does nothing but initializes, then dumps
the table. Just to be sure that we’re all on the same wavelength
here, I’m reproducing the entire program below, complete with the
new procedures. Note that this version includes support for white
space:
{--------------------------------------------------------------}
program Types;
{--------------------------------------------------------------}
{ Constant Declarations }
const TAB = ^I;
CR = ^M;
LF = ^J;
{--------------------------------------------------------------}
{ Variable Declarations }
var Look: char; { Lookahead Character }
ST: Array[’A’..’Z’] of char;
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
275
The symbol table TYPES
Read(Look);
end;
{--------------------------------------------------------------}
{ Report an Error }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
end;
{--------------------------------------------------------------}
{ Report Error and Halt }
procedure Abort(s: string);
begin
Error(s);
Halt;
end;
{--------------------------------------------------------------}
{ Report What Was Expected }
procedure Expected(s: string);
begin
Abort(s + ’ Expected’);
end;
{--------------------------------------------------------------}
{ Dump the Symbol Table }
procedure DumpTable;
var i: char;
begin
for i := ’A’ to ’Z’ do
WriteLn(i, ’ ’, ST[i]);
end;
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Decimal Digit }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an AlphaNumeric Character }
function IsAlNum(c: char): boolean;
begin
IsAlNum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addop }
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’, ’-’];
end;
{--------------------------------------------------------------}
{ Recognize a Mulop }
function IsMulop(c: char): boolean;
begin
IsMulop := c in [’*’, ’/’];
end;
{--------------------------------------------------------------}
{ Recognize a Boolean Orop }
function IsOrop(c: char): boolean;
begin
IsOrop := c in [’|’, ’~’];
end;
{--------------------------------------------------------------}
{ Recognize a Relop }
function IsRelop(c: char): boolean;
begin
IsRelop := c in [’=’, ’#’, ’<’, ’>’];
end;
{--------------------------------------------------------------}
{ Recognize White Space }
function IsWhite(c: char): boolean;
begin
IsWhite := c in [’ ’, TAB];
end;
276
TYPES The symbol table
{--------------------------------------------------------------}
{ Skip Over Leading White Space }
procedure SkipWhite;
begin
while IsWhite(Look) do
GetChar;
end;
{--------------------------------------------------------------}
{ Skip Over an End-of-Line }
procedure Fin;
begin
if Look = CR then begin
GetChar;
if Look = LF then
GetChar;
end;
end;
{--------------------------------------------------------------}
{ Match a Specific Input Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected(’’’’ + x + ’’’’);
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: char;
begin
if not IsAlpha(Look) then Expected(’Name’);
GetName := UpCase(Look);
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNum: char;
begin
if not IsDigit(Look) then Expected(’Integer’);
GetNum := Look;
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Output a String with Tab }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Output a String with Tab and CRLF }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
{--------------------------------------------------------------}
{ Initialize }
procedure Init;
var i: char;
begin
for i := ’A’ to ’Z’ do
ST[i] := ’?’;
GetChar;
SkipWhite;
end;
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
DumpTable;
end.
{--------------------------------------------------------------}
OK, run this program. You should get a (very fast) printout of all
the letters of the alphabet (potential identifiers), each followed by a
277
Adding entries TYPES
ST[’A’] := ’a’;
ST[’P’] := ’b’;
ST[’X’] := ’c’;
This time, when you run the program, you should get an output
showing that the symbol table is working right.
ADDING ENTRIES
{--------------------------------------------------------------}
{ Report Type of a Variable }
function TypeOf(N: char): char;
begin
TypeOf := ST[N];
end;
{--------------------------------------------------------------}
{ Report if a Variable is in the Table }
function InTable(N: char): boolean;
begin
InTable := TypeOf(N) <> ’?’;
end;
{--------------------------------------------------------------}
{ Check for a Duplicate Variable Name }
procedure CheckDup(N: char);
begin
if InTable(N) then Abort(’Duplicate Name ’ + N);
278
TYPES Allocating storage
end;
{--------------------------------------------------------------}
{ Add Entry to Table }
procedure AddEntry(N, T: char);
begin
CheckDup(N);
ST[N] := T;
end;
{--------------------------------------------------------------}
AddEntry(’A’, ’a’);
AddEntry(’P’, ’b’);
AddEntry(’X’, ’c’);
and run the program again. Did it work? Then we have the
symbol table routines needed to support our work on types. In the
next section, we’ll actually begin to use them.
ALLOCATING STORAGE
In other programs like this one, including the TINY compiler itself,
we have already addressed the issue of declaring global variables,
and the code generated for them. Let’s build a vestigial version of a
“compiler” here, whose only function is to allow us declare variables.
Remember, the syntax for a declaration is:
Again, we can lift a lot of the code from previous programs. The
following are stripped-down versions of those procedures. They are
greatly simplified since I have eliminated niceties like variable lists
and initializers. In procedure Alloc, note that the new call to Ad-
dEntry will also take care of checking for duplicate declarations:
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }
procedure Alloc(N: char);
begin
AddEntry(N, ’v’);
WriteLn(N, ’:’, TAB, ’DC 0’);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }
procedure Decl;
var Name: char;
begin
Match(’v’);
Alloc(GetName);
end;
{--------------------------------------------------------------}
279
Declaring types TYPES
Now, in the main program, add a call to TopDecls and run the
program. Try allocating a few variables, and note the resulting code
generated. This is old stuff for you, so the results should look familiar.
Note from the code for TopDecls that the program is ended by a
terminating period.
While you’re at it, try declaring two variables with the same name,
and verify that the parser catches the error.
DECLARING TYPES
where:
{--------------------------------------------------------------}
{ Generate Code for Allocation of a Variable }
procedure AllocVar(N, T: char);
begin
WriteLn(N, ’:’, TAB, ’DC.’, T, ’ 0’);
end;
{--------------------------------------------------------------}
{ Allocate Storage for a Variable }
procedure Alloc(N, T: char);
begin
AddEntry(N, T);
AllocVar(N, T);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Data Declaration }
procedure Decl;
var Typ: char;
begin
Typ := GetName;
Alloc(GetName, Typ);
end;
{--------------------------------------------------------------}
{ Parse and Translate Global Declarations }
procedure TopDecls;
begin
while Look <> ’.’ do begin
case Look of
’b’, ’w’, ’l’: Decl;
else Abort(’Unrecognized Keyword ’ + Look);
end;
Fin;
end;
end;
{--------------------------------------------------------------}
Make the changes shown to these procedures, and give the thing a
try. Use the single characters ’b’, ’w’, and ’l’ for the keywords (they
must be lower case, for now). You will see that in each case, we are
allocating the proper storage size. Note from the dumped symbol
table that the sizes are also recorded for later use. What later use?
Well, that’s the subject of the rest of this installment.
ASSIGNMENTS
{---------------------------------------------------------------}
{ Load a Variable to Primary Register }
procedure LoadVar(Name, Typ: char);
begin
Move(Typ, Name + ’(PC)’, ’D0’);
end;
{---------------------------------------------------------------}
281
Assignments TYPES
{---------------------------------------------------------------}
{ Generate a Move Instruction }
procedure Move(Size: char; Source, Dest: String);
begin
EmitLn(’MOVE.’ + Size + ’ ’ + Source + ’,’ + Dest);
end;
{---------------------------------------------------------------}
Note that these two routines are strictly code generators; they
have no error-checking or other logic. To complete the picture, we
need one more layer of software that provides these functions.
First of all, we need to make sure that the type we are dealing with
is a loadable type. This sounds like a job for another recognizer:
{--------------------------------------------------------------}
{ Recognize a Legal Variable Type }
function IsVarType(c: char): boolean;
begin
IsVarType := c in [’B’, ’W’, ’L’];
end;
{--------------------------------------------------------------}
Next, it would be nice to have a routine that will fetch the type
of a variable from the symbol table, while checking it to make sure
it’s valid:
{--------------------------------------------------------------}
{ Get a Variable Type from the Symbol Table }
function VarType(Name: char): char;
var Typ: char;
begin
Typ := TypeOf(Name);
if not IsVarType(Typ) then Abort(’Identifier ’ + Name +
’ is not a variable’);
VarType := Typ;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Load a Variable to the Primary Register }
procedure Load(Name: char);
begin
LoadVar(Name, VarType(Name));
end;
{--------------------------------------------------------------}
282
TYPES Assignments
Load(’A’);
Load(’B’);
Load(’C’);
Load(’X’);
{---------------------------------------------------------------}
{ Store Primary to Variable }
procedure StoreVar(Name, Typ: char);
begin
EmitLn(’LEA ’ + Name + ’(PC),A0’);
Move(Typ, ’D0’, ’(A0)’);
end;
{--------------------------------------------------------------}
{ Store a Variable from the Primary Register }
procedure Store(Name: char);
begin
StoreVar(Name, VarType(Name));
end;
{--------------------------------------------------------------}
You can test this one the same way as the loads.
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
var Name: char;
begin
Load(GetName);
end;
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: char;
begin
Name := GetName;
Match(’=’);
Expression;
Store(Name);
end;
{--------------------------------------------------------------}
{ Parse and Translate a Block of Statements }
procedure Block;
begin
while Look <> ’.’ do begin
Assignment;
Fin;
end;
end;
{--------------------------------------------------------------}
(It’s worth noting that, if anything, the new procedures that per-
mit us to manipulate types are, if anything, even simpler and cleaner
than what we’ve seen before. This is mostly thanks to our efforts to
encapsulate the code generator procedures.)
There is one small, nagging problem. Before, we used the Pascal
terminating period to get us out of procedure TopDecls. This is now
the wrong character ... it’s used to terminate Block. In previous
programs, we’ve used the BEGIN symbol (abbreviated ’b’) to get us
out. But that is now used as a type symbol.
The solution, while somewhat of a kludge, is easy enough. We’ll
use an UPPER CASE ’B’ to stand for the BEGIN. So change the
character in the WHILE loop within TopDecls, from ’.’ to ’B’, and
everything will be fine.
Now, we can complete the task by changing the main program to
read:
{--------------------------------------------------------------}
{ Main Program }
begin
Init;
TopDecls;
Match(’B’);
Fin;
Block;
DumpTable;
end.
{--------------------------------------------------------------}
(Note that I’ve had to sprinkle a few calls to Fin around to get us
out of Newline troubles.)
OK, run this program. Try the input:
284
TYPES Assignments
For each declaration, you should get code generated that allocates
storage. For each assignment, you should get code that loads a vari-
able of the correct size, and stores one, also of the correct size.
There’s only one small little problem: The generated code is WRONG!
Look at the code for a=c above. The code is:
MOVE.L C(PC),D0
LEA A(PC),A0
MOVE.B D0,(A0)
MOVE.B A(PC),D0
LEA C(PC),A0
MOVE.L D0,(A0)
{---------------------------------------------------------------}
{ Load a Variable to Primary Register }
procedure LoadVar(Name, Typ: char);
begin
if Typ <> ’L’ then
EmitLn(’CLR.L D0’);
Move(Typ, Name + ’(PC)’, ’D0’);
end;
{---------------------------------------------------------------}
CLR.L D0
MOVE.W B(PC),D0
LEA A(PC),A0
MOVE.B D0,(A0)
In this case, the CLR turns out not to be necessary, since the
result is going into a byte-sized variable. With a little bit of work,
286
TYPES A more reasonable solution
we can do better. Still, this is not bad, and it typical of the kinds of
inefficiencies that we’ve seen before in simple-minded compilers.
I should point out that, by setting the high bits to zero, we are in
effect treating the numbers as UNSIGNED integers. If we want to
treat them as signed ones instead (the more likely case) we should do
a sign extension after the load, instead of a clear before it. Just to
tie this part of the discussion up with a nice, red ribbon, let’s change
LoadVar as shown below:
{---------------------------------------------------------------}
{ Load a Variable to Primary Register }
procedure LoadVar(Name, Typ: char);
begin
if Typ = ’B’ then
EmitLn(’CLR.L D0’);
Move(Typ, Name + ’(PC)’, ’D0’);
if Typ = ’W’ then
EmitLn(’EXT.L D0’);
end;
{---------------------------------------------------------------}
Next, let’s add a new procedure that will convert from one type
to another:
{---------------------------------------------------------------}
{ Convert a Data Item from One Type to Another }
procedure Convert(Source, Dest: char);
begin
if Source <> Dest then begin
if Source = ’B’ then
EmitLn(’AND.W #$FF,D0’);
if Dest = ’L’ then
EmitLn(’EXT.L D0’);
end;
end;
{--------------------------------------------------------------}
Note that Load is a function, which not only emits the code for a
load, but also returns the variable type. In this way, we always know
288
TYPES Literal arguments
Again, note how incredibly simple these two routines are. We’ve
encapsulated all the type logic into Load and Store, and the trick of
passing the type around makes the rest of the work extremely easy.
Of course, all of this is for our special, trivial case of Expression.
Naturally, for the general case it will have to get more complex. But
you’re looking now at the FINAL version of procedure Assignment!
All this seems like a very simple and clean solution, and it is
indeed. Compile this program and run the same test cases as before.
You will see that all types of data are converted properly, and there
are few if any wasted instructions. Only the byte-to-long conversion
uses two instructions where one would do, and we could easily modify
Convert to handle this case, too.
Although we haven’t considered unsigned variables in this case,
I think you can see that we could easily fix up procedure Convert
to deal with these types as well. This is “left as an exercise for the
student.”
LITERAL ARGUMENTS
Now, when dealing with literal data, we have one little small prob-
lem. With variables, we know what type things should be because
they’ve been declared to be that type. We have no such type infor-
mation for literals. When the programmer says, “-1,” does that mean
a byte, word, or longword version? We have no clue. The obvious
thing to do would be to use the largest type possible, i.e. a longword.
But that’s a bad idea, because when we get to more complex expres-
sions, we’ll find that it will cause every expression involving literals
to be promoted to long, as well.
A better approach is to select a type based upon the value of the
literal, as shown next:
{--------------------------------------------------------------}
{ Load a Constant to the Primary Register }
function LoadNum(N: LongInt): char;
var Typ : char;
begin
if abs(N) <= 127 then
Typ := ’B’
else if abs(N) <= 32767 then
Typ := ’W’
else Typ := ’L’;
LoadConst(N, Typ);
LoadNum := Typ;
end;
{---------------------------------------------------------------}
(I know, I know, the number base isn’t really symmetric. You can
store -128 in a single byte, and -32768 in a word. But that’s easily
fixed, and not worth the time or the added complexity to fool with
it here. It’s the thought that counts.)
Note that LoadNum calls a new version of the code generator
routine LoadConst, which has an added argument to define the type:
290
TYPES Additive expressions
{---------------------------------------------------------------}
{ Load a Constant to the Primary Register }
procedure LoadConst(N: LongInt; Typ: char);
var temp:string;
begin
Str(N, temp);
Move(Typ, ’#’ + temp, ’D0’);
end;
{--------------------------------------------------------------}
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
function Expression: char;
begin
if IsAlpha(Look) then
Expression := Load(GetName)
else
Expression := LoadNum(GetNum);
end;
{--------------------------------------------------------------}
(Wow, that sure didn’t hurt too bad! Just a few extra lines do the
job.)
OK, compile this code into your program and give it a try. You’ll
see that it now works for either variables or constants as valid ex-
pressions.
ADDITIVE EXPRESSIONS
If you’ve been following this series from the beginning, I’m sure you
know what’s coming next: We’ll expand the form for an expression
to handle first additive expressions, then multiplicative, then general
expressions with parentheses.
The nice part is that we already have a pattern for dealing with
these more complex expressions. All we have to do is to make sure
that all the procedures called by Expression (Term, Factor, etc.)
always return a type identifier. If we do that, the program structure
gets changed hardly at all.
The first step is easy: We can rename our existing function Ex-
pression to Term, as we’ve done so many times before, and create the
new version of Expression:
{---------------------------------------------------------------}
{ Parse and Translate an Expression }
function Expression: char;
var Typ: char;
291
Additive expressions TYPES
begin
if IsAddop(Look) then
Typ := Unop
else
Typ := Term;
while IsAddop(Look) do begin
Push(Typ);
case Look of
’+’: Typ := Add(Typ);
’-’: Typ := Subtract(Typ);
end;
end;
Expression := Typ;
end;
{--------------------------------------------------------------}
Note in this routine how each procedure call has become a function
call, and how the local variable Typ gets updated at each pass.
Note also the new call to a function Unop, which lets us deal with
a leading unary minus. This change is not necessary ... we could
still use a form more like what we’ve done before. I’ve chosen to
introduce UnOp as a separate routine because it will make it easier,
later, to produce somewhat better code than we’ve been doing. In
other words, I’m looking ahead to optimization issues.
For this version, though, we’ll retain the same dumb old code,
which makes the new routine trivial:
{---------------------------------------------------------------}
{ Process a Term with Leading Unary Operator }
function Unop: char;
begin
Clear;
Unop := ’W’;
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ Push Primary onto Stack }
procedure Push(Size: char);
begin
Move(Size, ’D0’, ’-(SP)’);
end;
{---------------------------------------------------------------}
Now, let’s take a look at functions Add and Subtract. In the older
versions of these routines, we let them call code generator routines
PopAdd and PopSub. We’ll continue to do that, which makes the
functions themselves extremely simple:
292
TYPES Additive expressions
{---------------------------------------------------------------}
{ Recognize and Translate an Add }
function Add(T1: char): char;
begin
Match(’+’);
Add := PopAdd(T1, Term);
end;
{-------------------------------------------------------------}
{ Recognize and Translate a Subtract }
function Subtract(T1: char): char;
begin
Match(’-’);
Subtract := PopSub(T1, Term);
end;
{---------------------------------------------------------------}
The general idea is that all the “Pop-Op” routines can call this
one. When this is done, we will then have both operands in registers,
so we can promote whichever one we need to. To deal with this,
procedure Convert needs another argument, the register name:
293
Additive expressions TYPES
{---------------------------------------------------------------}
{ Convert a Data Item from One Type to Another }
procedure Convert(Source, Dest: char; Reg: String);
begin
if Source <> Dest then begin
if Source = ’B’ then
EmitLn(’AND.W #$FF,’ + Reg);
if Dest = ’L’ then
EmitLn(’EXT.L ’ + Reg);
end;
end;
{---------------------------------------------------------------}
The next function does a conversion, but only if the current type
T1 is smaller in size than the desired type T2. It is a function,
returning the final type to let us know what it decided to do:
{---------------------------------------------------------------}
{ Promote the Size of a Register Value }
function Promote(T1, T2: char; Reg: string): char;
var Typ: char;
begin
Typ := T1;
if T1 <> T2 then
if (T1 = ’B’) or ((T1 = ’W’) and (T2 = ’L’)) then begin
Convert(T1, T2, Reg);
Typ := T2;
end;
Promote := Typ;
end;
{---------------------------------------------------------------}
294
TYPES Why so many procedures?
After all the buildup, the final results are almost anticlimactic.
Once again, you can see that the logic is quite simple. All the two
routines do is to pop the top-of-stack into D7, force the two operands
to be the same size, and then generate the code.
Note the new code generator routines GenAdd and GenSub. These
are vestigial forms of the ORIGINAL PopAdd and PopSub. That is,
they are pure code generators, producing a register-to-register add
or subtract:
{---------------------------------------------------------------}
{ Add Top of Stack to Primary }
procedure GenAdd(Size: char);
begin
EmitLn(’ADD.’ + Size + ’ D7,D0’);
end;
{---------------------------------------------------------------}
{ Subtract Primary from Top of Stack }
procedure GenSub(Size: char);
begin
EmitLn(’SUB.’ + Size + ’ D7,D0’);
EmitLn(’NEG.’ + Size + ’ D0’);
end;
{---------------------------------------------------------------}
At this point, you may think I’ve pretty much gone off the deep end
in terms of deeply nested procedures. There is admittedly a lot of
overhead here. But there’s a method in my madness. As in the case
295
Multiplicative expressions TYPES
of UnOp, I’m looking ahead to the time when we’re going to want
better code generation. The way the code is organized, we can achieve
this without major modifications to the program. For example, in
cases where the value pushed onto the stack does NOT have to be
converted, it’s still better to use the “pop and add” instruction. If
we choose to test for such cases, we can embed the extra tests into
PopAdd and PopSub without changing anything else much.
MULTIPLICATIVE EXPRESSIONS
you’d like to test the program before we get into that, you can build
dummy versions of them, similar to PopAdd and PopSub. Again,
the code won’t be correct at this point, but the parser should handle
expressions of arbitrary complexity.
MULTIPLICATION
Once you’ve convinced yourself that the parser itself is working prop-
erly, we need to figure out what it will take to generate the right code.
This is where things begin to get a little sticky, because the rules are
more complex.
Let’s take the case of multiplication first. This operation is similar
to the “addops” in that both operands should be of the same size. It
differs in two important respects:
• The type of the product is typically not the same as that of the
two operands. For the product of two words, we get a longword
result.
• The 68000 does not support a 32 x 32 multiply, so a call to a
software routine is needed. This routine will become part of the
run-time library.
• It also does not support an 8 x 8 multiply, so all byte operands
must be promoted to words.
The actions that we have to take are best shown in the following
table:
T1 --> | | | |
| | | |
| | B | W | L |
T2 V | | | |
-----------------------------------------------------------------
| | | |
297
Multiplication TYPES
| | | |
W| Convert D7 to W | | Convert D0 to L |
| MULS | MULS | JSR MUL32 |
| Result = L | Result = L | Result = L |
| | | |
-----------------------------------------------------------------
| | | |
L | Convert D7 to L | Convert D7 to L | |
| JSR MUL32 | JSR MUL32 | JSR MUL32 |
| Result = L | Result = L | Result = L |
| | | |
-----------------------------------------------------------------
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary (Word) }
procedure GenMult;
begin
EmitLn(’MULS D7,D0’)
end;
{---------------------------------------------------------------}
{ Multiply Top of Stack by Primary (Long) }
procedure GenLongMult;
begin
EmitLn(’JSR MUL32’);
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ Generate Code to Multiply Primary by Stack }
function PopMul(T1, T2: char): char;
var T: char;
begin
Pop(T1);
298
TYPES Division
T := SameType(T1, T2);
Convert(T, ’W’, ’D7’);
Convert(T, ’W’, ’D0’);
if T = ’L’ then
GenLongMult
else
GenMult;
if T = ’B’ then
PopMul := ’W’
else
PopMul:= ’L’;
end;
{---------------------------------------------------------------}
As you can see, the routine starts off just like PopAdd. The two
arguments are forced to the same type. The two calls to Convert take
care of the case where both operands are bytes. The data themselves
are promoted to words, but the routine remembers the type so as to
assign the correct type to the result. Finally, we call one of the two
code generator routines, and then assign the result type. Not too
complicated, really.
At this point, I suggest that you go ahead and test the program.
Try all combinations of operand sizes.
DIVISION
The case of division is not nearly so symmetric. I also have some bad
news for you:
All modern 16-bit CPU’s support integer divide. The manufac-
turer’s data sheet will describe this operation as a 32 x 16-bit divide,
meaning that you can divide a 32-bit dividend by a 16-bit divisor.
Here’s the bad news:
If you don’t believe it, try dividing any large 32-bit number (mean-
ing that it has non-zero bits in the upper 16 bits) by the integer 1.
You are guaranteed to get an overflow exception.
The problem is that the instruction really requires that the result-
ing quotient fit into a 16-bit result. This won’t happen UNLESS the
divisor is sufficiently large. When any number is divided by unity,
the quotient will of course be the same as the dividend, which had
better fit into a 16-bit word.
Since the beginning of time (well, computers, anyway), CPU ar-
chitects have provided this little gotcha in the division circuitry. It
299
Division TYPES
• The type of the quotient must always be the same as that of the
dividend. It is independent of the divisor.
• In spite of the fact that the CPU supports a longword dividend,
the hardware-provided instruction can only be trusted for byte
and word dividends. For longword dividends, we need another
library routine that can return a long result.
This looks like a job for another table, to summarize the required
actions:
T1 --> | | | |
| | | |
| | B | W | L |
T2 V | | | |
-----------------------------------------------------------------
| | | |
B | Convert D0 to W | Convert D0 to W | Convert D0 to L |
| Convert D7 to L | Convert D7 to L | |
| DIVS | DIVS | JSR DIV32 |
| Result = B | Result = W | Result = L |
| | | |
-----------------------------------------------------------------
| | | |
W | Convert D7 to L | Convert D7 to L | Convert D0 to L |
| DIVS | DIVS | JSR DIV32 |
| Result = B | Result = W | Result = L |
| | | |
-----------------------------------------------------------------
| | | |
L | Convert D7 to L | Convert D7 to L | |
| JSR DIV32 | JSR DIV32 | JSR DIV32 |
| Result = B | Result = W | Result = L |
| | | |
-----------------------------------------------------------------
are any high bits set in it, the result of the division must be zero. We
might not get that if we only use the lower word of the divisor.)
The following code provides the correct function for PopDiv:
{---------------------------------------------------------------}
{ Generate Code to Divide Stack by the Primary }
function PopDiv(T1, T2: char): char;
begin
Pop(T1);
Convert(T1, ’L’, ’D7’);
if (T1 = ’L’) or (T2 = ’L’) then begin
Convert(T2, ’L’, ’D0’);
GenLongDiv;
PopDiv := ’L’;
end
else begin
Convert(T2, ’W’, ’D0’);
GenDiv;
PopDiv := T1;
end;
end;
{---------------------------------------------------------------}
{---------------------------------------------------------------}
{ Divide Top of Stack by Primary (Word) }
procedure GenDiv;
begin
EmitLn(’DIVS D0,D7’);
Move(’W’, ’D7’, ’D0’);
end;
{---------------------------------------------------------------}
{ Divide Top of Stack by Primary (Long) }
procedure GenLongDiv;
begin
EmitLn(’JSR DIV32’);
end;
{---------------------------------------------------------------}
The main concept that made things easy was that of converting
procedures such as Expression into functions that return the type
of the result. Once this was done, we were able to retain the same
general structure of the compiler.
I won’t pretend that we’ve covered every single aspect of the issue.
I conveniently ignored unsigned arithmetic. From what we’ve done,
I think you can see that to include them adds no new challenges, just
extra possibilities to test for.
I’ve also ignored the logical operators And, Or, etc. It turns out
that these are pretty easy to handle. All the logical operators are
bitwise operations, so they are symmetric and therefore work in the
same fashion as PopAdd. There is one difference, however: if it is
necessary to extend the word length for a logical variable, the ex-
tension should be done as an UNSIGNED number. Floating point
numbers, again, are straightforward to handle ... just a few more pro-
cedures to be added to the run-time library, or perhaps instructions
for a math chip.
What we’ve done here is to collapse such a jump table into far fewer
procedures, simply by making use of symmetry and other simplifying
rules.
In case you haven’t gotten this message yet, it sure appears that
TINY and KISS will probably NOT be strongly typed languages,
since I’ve allowed for automatic mixing and conversion of just about
any type. Which brings up the next issue:
The answer depends on what kind of language you want, and the
way you’d like it to behave. What we have not addressed is the issue
of when to allow and when to deny the use of operations involving
different data types. In other words, what should be the SEMAN-
TICS of our compiler? Do we want automatic type conversion for all
cases, for some cases, or not at all?
Let’s pause here to think about this a bit more. To do so, it will
help to look at a bit of history.
FORTRAN II supported only two simple data types: Integer and
Real. It allowed implicit type conversion between real and integer
types during assignment, but not within expressions. All data items
(including literal constants) on the right-hand side of an assignment
statement had to be of the same type. That made things pretty easy
... much simpler than what we’ve had to do here.
This was changed in FORTRAN IV to support “mixed-mode”
arithmetic. If an expression had any real data items in it, they were
all converted to reals and the expression itself was real. To round out
the picture, functions were provided to explicitly convert from one
type to the other, so that you could force an expression to end up as
either type.
This led to two things: code that was easier to write, and code
that was less efficient. That’s because sloppy programmers would
write expressions with simple constants like 0 and 1 in them, which
the compiler would dutifully compile to convert at execution time.
Still, the system worked pretty well, which would tend to indicate
that implicit type conversion is a Good Thing.
303
To coerce or not to coerce TYPES
In the spirit of strong typing, Pascal will not allow you to mix
Char and Integer variables, without applying the explicit coercion
functions Chr and Ord.
Turbo Pascal also includes the types Byte, Word, and LongInt.
The first two are basically the same as unsigned integers. In Turbo,
these can be freely intermixed with variables of type Integer, and
Turbo will automatically handle the conversion. There are run-time
checks, though, to keep you from overflowing or otherwise getting the
wrong answer. Note that you still can’t mix Byte and Char types,
even though they are stored internally in the same representation.
The ultimate in a strongly-typed language is Ada, which allows
NO implicit type conversions at all, and also will not allow mixed-
mode arithmetic. Jean Ichbiah’s position is that conversions cost
execution time, and you shouldn’t be allowed to build in such cost in
a hidden manner. By forcing the programmer to explicitly request
a type conversion, you make it more apparent that there could be a
cost involved.
I have been using another strongly-typed language, a delightful
little language called Whimsical, by John Spray. Although Whimsi-
cal is intended as a systems programming language, it also requires
explicit conversion EVERY time. There are NEVER any automatic
conversions, even the ones supported by Pascal.
This approach does have certain advantages: The compiler never
has to guess what to do: the programmer always tells it precisely
what he wants. As a result, there tends to be a more nearly one-
to-one correspondence between source code and compiled code, and
John’s compiler produces VERY tight code.
On the other hand, I sometimes find the explicit conversions to be
a pain. If I want, for example, to add one to a character, or AND
it with a mask, there are a lot of conversions to make. If I get it
wrong, the only error message is “Types are not compatible.” As
it happens, John’s particular implementation of the language in his
compiler doesn’t tell you exactly WHICH types are not compatible
... it only tells you which LINE the error is in.
I must admit that most of my errors with this compiler tend to be
errors of this type, and I’ve spent a lot of time with the Whimsical
compiler, trying to figure out just WHERE in the line I’ve offended
it. The only real way to fix the error is to keep trying things until
something works.
305
Conclusion TYPES
So what should we do in TINY and KISS? For the first one, I have
the answer: TINY will support only the types Char and Integer, and
we’ll use the C trick of promoting Chars to Integers internally. That
means that the TINY compiler will be MUCH simpler than what
we’ve already done. Type conversion in expressions is sort of moot,
since none will be required! Since longwords will not be supported,
we also won’t need the MUL32 and DIV32 run-time routines, nor the
logic to figure out when to call them. I LIKE it!
KISS, on the other hand, will support the type Long.
Should it support both signed and unsigned arithmetic? For the
sake of simplicity I’d rather not. It does add quite a bit to the
complexity of type conversions. Even Niklaus Wirth has eliminated
unsigned (Cardinal) numbers from his new language Oberon, with the
argument that 32-bit integers should be long enough for anybody, in
either case.
But KISS is supposed to be a systems programming language,
which means that we should be able to do whatever operations that
can be done in assembler. Since the 68000 supports both flavors of
integers, I guess KISS should, also. We’ve seen that logical operations
need to be able to extend integers in an unsigned fashion, so the
unsigned conversion procedures are required in any case.
CONCLUSION
306
Part XV
5 March 1994.
BACK TO THE FUTURE
New starts, old directions BACK TO THE FUTURE
INTRODUCTION
Can it really have been four years since I wrote installment fourteen
of this series? Is it really possible that six long years have passed
since I began it? Funny how time flies when you’re having fun, isn’t
it?
I won’t spend a lot of time making excuses; only point out that
things happen, and priorities change. In the four years since install-
ment fourteen, I’ve managed to get laid off, get divorced, have a
nervous breakdown, begin a new career as a writer, begin another
one as a consultant, move, work on two real-time systems, and raise
fourteen baby birds, three pigeons, six possums, and a duck. For
awhile there, the parsing of source code was not high on my list of
priorities. Neither was writing stuff for free, instead of writing stuff
for pay. But I do try to be faithful, and I do recognize and feel my
responsibility to you, the reader, to finish what I’ve started. As the
tortoise said in one of my son’s old stories, I may be slow, but I’m
sure. I’m sure that there are people out there anxious to see the last
reel of this film, and I intend to give it to them. So, if you’re one
of those who’s been waiting, more or less patiently, to see how this
thing comes out, thanks for your patience. I apologize for the delay.
Let’s move on.
if you don’t know the language. What you see is almost always what
you get, and we can concentrate on concepts rather than implemen-
tation details. I’ve said from the beginning that the purpose of this
tutorial series was not to generate the world’s fastest compiler, but
to teach the fundamentals of compiler technology, while spending the
least amount of time wrestling with language syntax or other aspects
of software implementation. Finally, since a lot of what we do in this
course amounts to software experimentation, it’s important to have a
compiler and associated environment that compiles quickly and with
no fuss. In my opinion, by far the most significant time measure
in software development is the speed of the edit/compile/test cycle.
In this department, Turbo Pascal is king. The compilation speed
is blazing fast, and continues to get faster in every release (how do
they keep doing that?). Despite vast improvements in C compilation
speed over the years, even Borland’s fastest C/C++ compiler is still
no match for Turbo Pascal. Further, the editor built into their IDE,
the make facility, and even their superb smart linker, all complement
each other to produce a wonderful environment for quick turnaround.
For all of these reasons, I intend to stick with Pascal for the duration
of this series. We’ll be using Turbo Pascal for Windows, one of the
compilers provided Borland Pascal with Objects, version 7.0. If you
don’t have this compiler, don’t worry ... nothing we do here is go-
ing to count on your having the latest version. Using the Windows
version helps me a lot, by allowing me to use the Clipboard to copy
code from the compiler’s editor into these documents. It should also
help you at least as much, copying the code in the other direction.
I’ve thought long and hard about whether or not to introduce ob-
jects to our discussion. I’m a big advocate of object-oriented methods
for all uses, and such methods definitely have their place in compiler
technology. In fact, I’ve written papers on just this subject (Refs.
1-3). But the architecture of a compiler which is based on object-
oriented approaches is vastly different than that of the more classical
compiler we’ve been building. Again, it would seem to be entirely
too much to change these horses in mid-stream. As I said, program-
ming styles change. Who knows, it may be another six years before
we finish this thing, and if we keep changing the code every time
programming style changes, we may NEVER finish.
So for now, at least, I’ve determined to continue the classical style
in Pascal, though we might indeed discuss objects and object orienta-
tion as we go. Likewise, the target machine will remain the Motorola
68000 family. Of all the decisions to be made here, this one has been
309
New starts, old directions BACK TO THE FUTURE
the easiest. Though I know that many of you would like to see code
for the 80x86, the 68000 has become, if anything, even more popu-
lar as a platform for embedded systems, and it’s to that application
that this whole effort began in the first place. Compiling for the PC,
MSDOS platform, we’d have to deal with all the issues of DOS sys-
tem calls, DOS linker formats, the PC file system and hardware, and
all those other complications of a DOS environment. An embedded
system, on the other hand, must run standalone, and it’s for this
kind of application, as an alternative to assembly language, that I’ve
always imagined that a language like KISS would thrive. Anyway,
who wants to deal with the 80x86 architecture if they don’t have to?
The one feature of Turbo Pascal that I’m going to be making
heavy use of is units. In the past, we’ve had to make compromises
between code size and complexity, and program functionality. A lot
of our work has been in the nature of computer experimentation,
looking at only one aspect of compiler technology at a time. We did
this to avoid to avoid having to carry around large programs, just
to investigate simple concepts. In the process, we’ve re-invented the
wheel and re-programmed the same functions more times than I’d like
to count. Turbo units provide a wonderful way to get functionality
and simplicity at the same time: You write reusable code, and invoke
it with a single line. Your test program stays small, but it can do
powerful things.
One feature of Turbo Pascal units is their initialization block. As
with an Ada package, any code in the main begin-end block of a unit
gets executed as the program is initialized. As you’ll see later, this
sometimes gives us neat simplifications in the code. Our procedure
Init, which has been with us since Installment 1, goes away entirely
when we use units. The various routines in the Cradle, another key
features of our approach, will get distributed among the units.
The concept of units, of course, is no different than that of C
modules. However, in C (and C++), the interface between modules
comes via preprocessor include statements and header files. As some-
one who’s had to read a lot of other people’s C programs, I’ve always
found this rather bewildering. It always seems that whatever data
structure you’d like to know about is in some other file. Turbo units
are simpler for the very reason that they’re criticized by some: The
function interfaces and their implementation are included in the same
file. While this organization may create problems with code security,
it also reduces the number of files by half, which isn’t half bad. Link-
310
BACK TO THE FUTURE Starting over?
ing of the object files is also easy, because the Turbo compiler takes
care of it without the need for make files or other mechanisms.
STARTING OVER?
Four years ago, in Installment 14, I promised you that our days of
re-inventing the wheel, and recoding the same software over and over
for each lesson, were over, and that from now on we’d stick to more
complete programs that we would simply add new features to. I
still intend to keep that promise; that’s one of the main purposes for
using units. However, because of the long time since Installment 14,
it’s natural to want to at least do some review, and anyhow, we’re
going to have to make rather sweeping changes in the code to make
the transition to units. Besides, frankly, after all this time I can’t
remember all the neat ideas I had in my head four years ago. The
best way for me to recall them is to retrace some of the steps we
took to arrive at Installment 14. So I hope you’ll be understanding
and bear with me as we go back to our roots, in a sense, and rebuild
the core of the software, distributing the routines among the various
units, and bootstrapping ourselves back up to the point we were at lo,
those many moons ago. As has always been the case, you’re going to
get to see me make all the mistakes and execute changes of direction,
in real time. Please bear with me ... we’ll start getting to the new
stuff before you know it.
Since we’re going to be using multiple modules in our new ap-
proach, we have to address the issue of file management. If you’ve
followed all the other sections of this tutorial, you know that, as
our programs evolve, we’re going to be replacing older, more simple-
minded units with more capable ones. This brings us to an issue
of version control. There will almost certainly be times when we
will overlay a simple file (unit), but later wish we had the simple
one again. A case in point is embodied in our predilection for using
single-character variable names, keywords, etc., to test concepts with-
out getting bogged down in the details of a lexical scanner. Thanks
to the use of units, we will be doing much less of this in the future.
Still, I not only suspect, but am certain that we will need to save
some older versions of files, for special purposes, even though they’ve
been replaced by newer, more capable ones.
To deal with this problem, I suggest that you create different di-
rectories, with different versions of the units as needed. If we do this
311
The Input unit BACK TO THE FUTURE
A key concept that we’ve used since Day 1 has been the idea of an
input stream with one lookahead character. All the parsing routines
examine this character, without changing it, to decide what they
should do next. (Compare this approach with the C/Unix approach
using getchar and unget, and I think you’ll agree that our approach
is simpler). We’ll begin our hike into the future by translating this
concept into our new, unit-based organization. The first unit, appro-
priately called Input, is shown below:
{--------------------------------------------------------------}
unit Input;
{--------------------------------------------------------------}
interface
var Look: char; { Lookahead character }
procedure GetChar; { Read new character }
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Read New Character From Input Stream }
procedure GetChar;
begin
Read(Look);
end;
{--------------------------------------------------------------}
{ Unit Initialization }
begin
GetChar;
end.
{--------------------------------------------------------------}
As you can see, there’s nothing very profound, and certainly noth-
ing complicated, about this unit, since it consists of only a single
procedure. But already, we can see how the use of units gives us ad-
vantages. Note the executable code in the initialization block. This
code “primes the pump” of the input stream for us, something we’ve
always had to do before, by inserting the call to GetChar in line, or in
procedure Init. This time, the call happens without any special ref-
erence to it on our part, except within the unit itself. As I predicted
earlier, this mechanism is going to make our lives much simpler as we
proceed. I consider it to be one of the most useful features of Turbo
Pascal, and I lean on it heavily.
312
BACK TO THE FUTURE The Input unit
Copy this unit into your compiler’s IDE, and compile it. To test
the software, of course, we always need a main program. I used the
following, really complex test program, which we’ll later evolve into
the Main for our compiler:
{--------------------------------------------------------------}
program Main;
uses WinCRT, Input;
begin
WriteLn(Look);
end.
{--------------------------------------------------------------}
{--------------------------------------------------------------}
unit Output;
{--------------------------------------------------------------}
interface
procedure Emit(s: string); { Emit an instruction }
procedure EmitLn(s: string); { Emit an instruction line }
{--------------------------------------------------------------}
implementation
const TAB = ^I;
{--------------------------------------------------------------}
{ Emit an Instruction }
procedure Emit(s: string);
begin
Write(TAB, s);
end;
{--------------------------------------------------------------}
{ Emit an Instruction, Followed By a Newline }
procedure EmitLn(s: string);
begin
Emit(s);
WriteLn;
end;
end.
{--------------------------------------------------------------}
{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Scanner, Parser;
begin
WriteLn(’MAIN:");
EmitLn(’Hello, world!’);
end.
{--------------------------------------------------------------}
314
BACK TO THE FUTURE The Output unit
Did you see anything that surprised you? You may have been
surprised to see that you needed to type something, even though the
main program requires no input. That’s because of the initialization
in unit Input, which still requires something to put into the lookahead
character. Sorry, there’s no way out of that box, or rather, we don’t
WANT to get out. Except for simple test cases such as this, we
will always want a valid lookahead character, so the right thing to do
about this “problem” is ... nothing.
Perhaps more surprisingly, notice that the TAB character had no
effect; our line of “instructions” begins at column 1, same as the
fake label. That’s right: WinCRT doesn’t support tabs. We have a
problem.
There are a few ways we can deal with this problem. The one
thing we can’t do is to simply ignore it. Every assembler I’ve ever
used reserves column 1 for labels, and will rebel to see instructions
starting there. So, at the very least, we must space the instructions
over one column to keep the assembler happy. . That’s easy enough
to do: Simply change, in procedure Emit, the line:
Write(TAB, s);
by:
Write(’ ’, s);
I must admit that I’ve wrestled with this problem before, and find
myself changing my mind as often as a chameleon changes color. For
the purposes we’re going to be using, 99% of which will be examining
the output code as it’s displayed on a CRT, it would be nice to see
neatly blocked out “object” code. The line:
just plain looks neater than the different, but functionally identical
code,
SUB1:
MOVE #4,D0
315
The Error unit BACK TO THE FUTURE
Our next set of routines are those that handle errors. To refresh your
memory, we take the approach, pioneered by Borland in Turbo Pas-
cal, of halting on the first error. Not only does this greatly simplify
our code, by completely avoiding the sticky issue of error recovery,
but it also makes much more sense, in my opinion, in an interactive
environment. I know this may be an extreme position, but I consider
the practice of reporting all errors in a program to be an anachronism,
316
BACK TO THE FUTURE Scanning and parsing
a holdover from the days of batch processing. It’s time to scuttle the
practice. So there.
In our original Cradle, we had two error-handling procedures: Er-
ror, which didn’t halt, and Abort, which did. But I don’t think we
ever found a use for the procedure that didn’t halt, so in the new,
lean and mean unit Errors, shown next, procedure Error takes the
place of Abort.
{--------------------------------------------------------------}
unit Errors;
{--------------------------------------------------------------}
interface
procedure Error(s: string);
procedure Expected(s: string);
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Write error Message and Halt }
procedure Error(s: string);
begin
WriteLn;
WriteLn(^G, ’Error: ’, s, ’.’);
Halt;
end;
{--------------------------------------------------------------}
{ Write "<something> Expected" }
procedure Expected(s: string);
begin
Error(s + ’ Expected’);
end;
end.
{--------------------------------------------------------------}
{--------------------------------------------------------------}
program Test;
uses WinCRT, Input, Output, Errors;
begin
Expected(’Integer’);
end.
{--------------------------------------------------------------}
Have you noticed that the “uses” line in our main program keeps
getting longer? That’s OK. In the final version, the main program
will only call procedures in our parser, so its use clause will only have
a couple of entries. But for now, it’s probably best to include all the
units so we can test procedures in them.
{--------------------------------------------------------------}
unit Scanner1;
{--------------------------------------------------------------}
interface
uses Input, Errors;
function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
318
BACK TO THE FUTURE Scanning and parsing
Write(GetName);
Match(’=’);
319
The Scanner unit BACK TO THE FUTURE
Write(GetNumber);
Match(’+’);
WriteLn(GetName);
x=0+y
The next, and by far the most important, version of the scanner
is the one that handles the multi-character tokens that all real lan-
guages must have. Only the two functions, GetName and GetNum-
ber, change between the two units, but just to be sure there are no
mistakes, I’ve reproduced the entire unit here. This is unit Scanner:
{--------------------------------------------------------------}
unit Scanner;
{--------------------------------------------------------------}
interface
uses Input, Errors;
function IsAlpha(c: char): boolean;
function IsDigit(c: char): boolean;
function IsAlNum(c: char): boolean;
function IsAddop(c: char): boolean;
function IsMulop(c: char): boolean;
procedure Match(x: char);
function GetName: string;
function GetNumber: longint;
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Recognize an Alpha Character }
function IsAlpha(c: char): boolean;
begin
IsAlpha := UpCase(c) in [’A’..’Z’];
end;
{--------------------------------------------------------------}
{ Recognize a Numeric Character }
function IsDigit(c: char): boolean;
begin
IsDigit := c in [’0’..’9’];
end;
{--------------------------------------------------------------}
{ Recognize an Alphanumeric Character }
function IsAlnum(c: char): boolean;
begin
IsAlnum := IsAlpha(c) or IsDigit(c);
end;
{--------------------------------------------------------------}
{ Recognize an Addition Operator }
function IsAddop(c: char): boolean;
begin
320
BACK TO THE FUTURE Decisions, decisions
IsAddop := c in [’+’,’-’];
end;
{--------------------------------------------------------------}
{ Recognize a Multiplication Operator }
function IsMulop(c: char): boolean;
begin
IsMulop := c in [’*’,’/’];
end;
{--------------------------------------------------------------}
{ Match One Character }
procedure Match(x: char);
begin
if Look = x then GetChar
else Expected(’’’’ + x + ’’’’);
end;
{--------------------------------------------------------------}
{ Get an Identifier }
function GetName: string;
var n: string;
begin
n := ’’;
if not IsAlpha(Look) then Expected(’Name’);
while IsAlnum(Look) do begin
n := n + Look;
GetChar;
end;
GetName := n;
end;
{--------------------------------------------------------------}
{ Get a Number }
function GetNumber: string;
var n: string;
begin
n := ’’;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
n := n + Look;
GetChar;
end;
GetNumber := n;
end;
end.
{--------------------------------------------------------------}
The same test program will test this scanner, also. Simply change
the “uses” clause to use Scanner instead of Scanner1. Now you should
be able to type multi-character names and numbers.
DECISIONS, DECISIONS
Pascal, where the case of characters doesn’t matter. For such lan-
guages, it’s easier to go ahead and map all identifiers to upper case in
the scanner, so we don’t have to worry later on when we’re comparing
strings for equality.
We could have even gone a step further, and map the characters to
upper case right as they come in, in GetChar. This approach works
too, and I’ve used it in the past, but it’s too confining. Specifically,
it will also map characters that may be part of quoted strings, which
is not a good idea. So if you’re going to map to upper case at all,
GetName is the proper place to do it.
Note that the function GetNumber in this scanner returns a string,
just as GetName does. This is another one of those things I’ve oscil-
lated about almost daily, and the last swing was all of ten minutes
ago. The alternative approach, and one I’ve used many times in past
installments, returns an integer result.
Both approaches have their good points. Since we’re fetching a
number, the approach that immediately comes to mind is to return
it as an integer. But bear in mind that the eventual use of the
number will be in a write statement that goes back to the outside
world. Someone — either us or the code hidden inside the write
statement — is going to have to convert the number back to a string
again. Turbo Pascal includes such string conversion routines, but
why use them if we don’t have to? Why convert a number from
string to integer form, only to convert it right back again in the code
generator, only a few statements later?
Furthermore, as you’ll soon see, we’re going to need a temporary
storage spot for the value of the token we’ve fetched. If we treat the
number in its string form, we can store the value of either a variable
or a number in the same string. Otherwise, we’ll have to create a
second, integer variable.
On the other hand, we’ll find that carrying the number as a string
virtually eliminates any chance of optimization later on. As we get
to the point where we are beginning to concern ourselves with code
generation, we’ll encounter cases in which we’re doing arithmetic on
constants. For such cases, it’s really foolish to generate code that
performs the constant arithmetic at run time. Far better to let the
parser do the arithmetic at compile time, and merely code the result.
To do that, we’ll wish we had the constants stored as integers rather
than strings.
322
BACK TO THE FUTURE Parsing
{--------------------------------------------------------------}
{ Get a Number (integer version) }
function GetNumber: longint;
var n: longint;
begin
n := 0;
if not IsDigit(Look) then Expected(’Integer’);
while IsDigit(Look) do begin
n := 10 * n + (Ord(Look) - Ord(’0’));
GetChar;
end;
GetNumber := n;
end;
{--------------------------------------------------------------}
You might file this one away, as I intend to, for a rainy day.
PARSING
At this point, we have distributed all the routines that made up our
Cradle into units that we can draw upon as we need them. Obviously,
they will evolve further as we continue the process of bootstrapping
ourselves up again, but for the most part their content, and certainly
the architecture that they imply, is defined. What remains is to
embody the language syntax into the parser unit. We won’t do much
of that in this installment, but I do want to do a little, just to leave
323
Parsing BACK TO THE FUTURE
us with the good feeling that we still know what we’re doing. So
before we go, let’s generate just enough of a parser to process single
factors in an expression. In the process, we’ll also, by necessity, find
we have created a code generator unit, as well.
Remember the very first installment of this series? We read an
integer value, say n, and generated the code to load it into the D0
register via an immediate move:
MOVE #n,D0
MOVE X(PC),D0
{--------------------------------------------------------------}
unit CodeGen;
{--------------------------------------------------------------}
interface
uses Output;
procedure LoadConstant(n: string);
{--------------------------------------------------------------}
implementation
{--------------------------------------------------------------}
{ Load the Primary Register with a Constant }
procedure LoadConstant(n: string);
begin
EmitLn(’MOVE #’ + n + ’,D0’ );
end;
end.
{--------------------------------------------------------------}
Copy and compile this unit, and execute the following main pro-
gram:
{--------------------------------------------------------------}
program Main;
uses WinCRT, Input, Output, Errors, Scanner, Parser;
begin
Factor;
end.
{--------------------------------------------------------------}
325
Parsing BACK TO THE FUTURE
The parser unit itself doesn’t change, but we have a more complex
version of procedure Factor:
{--------------------------------------------------------------}
{ Parse and Translate a Factor }
procedure Factor;
begin
if IsDigit(Look) then
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error(’Unrecognized character ’ + Look);
end;
{--------------------------------------------------------------}
Now, without altering the main program, you should find that our
program will process either a variable or a constant factor. At this
point, our architecture is almost complete; we have units to do all
the dirty work, and enough code in the parser and code generator
to demonstrate that everything works. What remains is to flesh out
the units we’ve defined, particularly the parser and code generator,
to support the more complex syntax elements that make up a real
language. Since we’ve done this many times before in earlier install-
ments, it shouldn’t take long to get us back to where we were before
the long hiatus. We’ll continue this process in Installment 16, coming
soon. See you then.
326
Part XVI
29 May, 1995 .
UNIT CONSTRUCTION
Introduction UNIT CONSTRUCTION
INTRODUCTION
I still believe that’s a good way to learn any subject; no one wants
to have to make changes to 100,000 line programs just to try out a
new idea. But the idea of just dealing with code snippets, rather than
complete programs, also has its drawbacks in that we often seemed
to be writing the same code fragments over and over. Although
repetition has been thoroughly proven to be a good way to learn new
ideas, it’s also true that one can have too much of a good thing. By
the time I had completed Installment 14 I seemed to have reached the
limits of my abilities to juggle multiple files and multiple versions of
the same software functions. Who knows, perhaps that’s one reason
I seemed to have run out of gas at that point.
• Input
• Output
• Errors
• Scanner
• Parser
• CodeGen
once had. A closer look, however, should convince you that, while
the names are similar, the functionalities are quite different.
Together, the scanner and parser of a classical implementation
comprise the so-called “front end,” and the code generator, the back
end. The front end routines process the language-dependent, syntax-
related aspects of the source language, while the code generator, or
back end, deals with the target machine-dependent parts of the prob-
lem. In classical compilers, the two ends communicate via a file of
instructions written in an intermediate language (IL).
Typically, a classical scanner is a single procedure, operating as a
coprocedure with the parser. It “tokenizes” the source file, reading
it character by character, recognizing language elements, translating
them into tokens, and passing them along to the parser. You can
think of the parser as an abstract machine, executing “op codes,”
which are the tokens. Similarly, the parser generates op codes of a
second abstract machine, which mechanizes the IL. Typically, the IL
file is written to disk by the parser, and read back again by the code
generator.
Our organization is quite different. We have no lexical scanner, in
the classical sense; our unit Scanner, though it has a similar name, is
not a single procedure or co-procedure, but merely a set of separate
subroutines which are called by the parser as needed.
Similarly, the classical code generator, the back end, is a translator
in its own right, reading an IL “source” file, and emitting an object
file. Our code generator doesn’t work that way. In our compiler, there
IS no intermediate language; every construct in the source language
syntax is converted into assembly language as it is recognized by
the parser. Like Scanner, the unit CodeGen consists of individual
procedures which are called by the parser as needed.
This “code them as you find them” philosophy may not produce
the world’s most efficient code — for example, we haven’t provided
(yet!) a convenient place for an optimizer to work its magic — but
it sure does simplify the compiler, doesn’t it?
And that observation prompts me to reflect, once again, on how
we have managed to reduce a compiler’s functions to such compar-
atively simple terms. I’ve waxed eloquent on this subject in past
installments, so I won’t belabor the point too much here. However,
because of the time that’s elapsed since those last soliloquies, I hope
331
Just like classical? UNIT CONSTRUCTION
you’ll grant me just a little time to remind myself, as well as you, how
we got here. We got here by applying several principles that writers
of commercial compilers seldom have the luxury of using. These are:
(Here, and elsewhere in this series, I’m only going to show you the
new routines. I’m counting on you to put them into the proper unit,
which you should normally have no trouble identifying. Don’t forget
to add the procedure’s prototype to the interface section of the unit.)
In the main program, simply change the procedure called from
Factor to SignedFactor, and give the code a test. Isn’t it neat how
the Turbo linker and make facility handle all the details?
Yes, I know, the code isn’t very efficient. If we input a number,
-3, the generated code is:
MOVE #3,D0
NEG D0
334
UNIT CONSTRUCTION Terms and expressions
I’m sure you know what’s coming next: We must, yet again, create
the rest of the procedures that implement the recursive-descent pars-
ing of an expression. We all know that the hierarchy of procedures
for arithmetic expressions is:
expression
term
factor
The three procedures Push, PopAdd, and PopSub are new code
generation routines. As the name implies, procedure Push generates
code to push the primary register (D0, in our 68000 implementation)
to the stack. PopAdd and PopSub pop the top of the stack again,
and add it to, or subtract it from, the primary register. The code is
shown next:
336
UNIT CONSTRUCTION Terms and expressions
{--------------------------------------------------------------}
{ Push Primary to Stack }
procedure Push;
begin
EmitLn(’MOVE D0,-(SP)’);
end;
{--------------------------------------------------------------}
{ Add TOS to Primary }
procedure PopAdd;
begin
EmitLn(’ADD (SP)+,D0’);
end;
{--------------------------------------------------------------}
{ Subtract TOS from Primary }
procedure PopSub;
begin
EmitLn(’SUB (SP)+,D0’);
Negate;
end;
{--------------------------------------------------------------}
Add these routines to Parser and CodeGen, and change the main
program to call Expression. Voila!
The next step, of course, is to add the capability for dealing with
multiplicative terms. To that end, we’ll add a procedure Term, and
code generation procedures PopMul and PopDiv. These code gener-
ation procedures are shown next:
{--------------------------------------------------------------}
{ Multiply TOS by Primary }
procedure PopMul;
begin
EmitLn(’MULS (SP)+,D0’);
end;
{--------------------------------------------------------------}
{ Divide Primary by TOS }
procedure PopDiv;
begin
EmitLn(’MOVE (SP)+,D7’);
EmitLn(’EXT.L D7’);
EmitLn(’DIVS D0,D7’);
EmitLn(’MOVE D7,D0’);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Term }
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
’*’: Multiply;
’/’: Divide;
end;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Parse and Translate a Term with Optional Leading Sign }
procedure SignedTerm;
var Sign: char;
begin
Sign := Look;
if IsAddop(Look) then
GetChar;
Term;
if Sign = ’-’ then Negate;
end;
{--------------------------------------------------------------}
...
{--------------------------------------------------------------}
{ Parse and Translate an Expression }
procedure Expression;
begin
SignedTerm;
while IsAddop(Look) do
case Look of
’+’: Add;
’-’: Subtract;
end;
end;
{--------------------------------------------------------------}
-x*y
it’s very apparent that the sign goes with the whole TERM, x*y,
and not just the factor x, and that’s the way Expression is coded.
Test this new code by executing Main. It still calls Expression, so
you should now be able to deal with expressions containing any of
the four arithmetic operators.
338
UNIT CONSTRUCTION Assignments
{--------------------------------------------------------------}
{ Parse and Translate a Factor }
procedure Factor;
begin
if Look =’(’ then begin
Match(’(’);
Expression;
Match(’)’);
end
else if IsDigit(Look) then
LoadConstant(GetNumber)
else if IsAlpha(Look)then
LoadVariable(GetName)
else
Error(’Unrecognized character ’ + Look);
end;
{--------------------------------------------------------------}
ASSIGNMENTS
As long as we’re this close, we might as well create the code to deal
with an assignment statement. This code needs only to remember
the name of the target variable where we are to store the result of an
expression, call Expression, then store the number. The procedure is
shown next:
{--------------------------------------------------------------}
{ Parse and Translate an Assignment Statement }
procedure Assignment;
var Name: string;
begin
Name := GetName;
Match(’=’);
Expression;
StoreVariable(Name);
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Store the Primary Register to a Variable }
procedure StoreVariable(Name: string);
begin
EmitLn(’LEA ’ + Name + ’(PC),A0’);
EmitLn(’MOVE D0,(A0)’);
end;
{--------------------------------------------------------------}
339
Booleans UNIT CONSTRUCTION
Now, change the call in Main to call Assignment, and you should
see a full assignment statement being processed correctly. Pretty
neat, eh? And painless, too.
In the past, we’ve always tried to show BNF relations to define
the syntax we’re developing. I haven’t done that here, and it’s high
time I did. Here’s the BNF:
BOOLEANS
The next step, as we’ve learned several times before, is to add Boolean
algebra. In the past, this step has at least doubled the amount of
code we’ve had to write. As I’ve gone over this step in my mind, I’ve
found myself diverging more and more from what we did in previous
installments. To refresh your memory, I noted that Pascal treats
the Boolean operators pretty much identically to the way it treats
arithmetic ones. A Boolean “and” has the same precedence level
as multiplication, and the “or” as addition. C, on the other hand,
sets them at different precedence levels, and all told has a whopping
17 levels. In our earlier work, I chose something in between, with
seven levels. As a result, we ended up with things called Boolean
expressions, paralleling in most details the arithmetic expressions,
but at a different precedence level. All of this, as it turned out, came
about because I didn’t like having to put parentheses around the
Boolean expressions in statements like:
{--------------------------------------------------------------}
function IsAddop(c: char): boolean;
begin
IsAddop := c in [’+’,’-’, ’|’, ’~’];
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
procedure Expression;
begin
SignedTerm;
while IsAddop(Look) do
case Look of
’+’: Add;
’-’: Subtract;
’|’: _Or;
’~’: _Xor;
end;
{--------------------------------------------------------------}
end;
(The underscores are needed, of course, because "or" and "xor" are
reserved words in Turbo Pascal.)
Next, the procedures _Or and _Xor:
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure _Or;
begin
Match(’|’);
Push;
Term;
PopOr;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Subtraction Operation }
procedure _Xor;
begin
Match(’~’);
Push;
Term;
PopXor;
end;
{--------------------------------------------------------------}
{--------------------------------------------------------------}
{ Or TOS with Primary }
procedure PopOr;
begin
EmitLn(’OR (SP)+,D0’);
end;
{--------------------------------------------------------------}
{ Exclusive-Or TOS with Primary }
procedure PopXor;
begin
EmitLn(’EOR (SP)+,D0’);
end;
{--------------------------------------------------------------}
341
Booleans UNIT CONSTRUCTION
Now, let’s test the translator (you might want to change the call
in Main back to a call to Expression, just to avoid having to type
“x=” for an assignment every time).
So far, so good. The parser nicely handles expressions of the form:
x|y~z
(a+b)*(c~d)
We’ve talked about this a bit, in the past. In general the rules
for what operations are legal or not cannot be enforced by the parser
itself, because they are not part of the syntax of the language, but
rather its semantics. A compiler that doesn’t allow mixed-mode ex-
pressions of this sort must recognize that c and d are Boolean vari-
ables, rather than numeric ones, and balk at multiplying them in the
next step. But this “policing” can’t be done by the parser; it must
be handled somewhere between the parser and the code generator.
We aren’t in a position to enforce such rules yet, because we haven’t
got either a way of declaring types, or a symbol table to store the
types in. So, for what we’ve got to work with at the moment, the
parser is doing precisely what it’s supposed to do.
Anyway, are we sure that we DON’T want to allow mixed-type
operations? We made the decision some time ago (or, at least, I did)
to adopt the value 0000 as a Boolean “false,” and -1, or FFFFh,
as a Boolean “true.” The nice part about this choice is that bitwise
operations work exactly the same way as logical ones. In other words,
when we do an operation on one bit of a logical variable, we do it on
all of them. This means that we don’t need to distinguish between
logical and bitwise operations, as is done in C with the operators &
and &&, and | and ||. Reducing the number of operators by half
certainly doesn’t seem all bad.
From the point of view of the data in storage, of course, the com-
puter and compiler couldn’t care less whether the number FFFFh
represents the logical TRUE, or the numeric -1. Should we? I sort
of think not. I can think of many examples (though they might be
frowned upon as “tricky” code) where the ability to mix the types
might come in handy. Example, the Dirac delta function, which could
be coded in one simple line:
342
UNIT CONSTRUCTION Booleans
-(x=0)
x*(1+2*(x<0))
Please note, I’m not advocating coding like this as a way of life. I’d
almost certainly write these functions in more readable form, using
IFs, just to keep from confusing later maintainers. Still, a moral
question arises: Do we have the right to ENFORCE our ideas of
good coding practice on the programmer, but writing the language
so he can’t do anything else? That’s what Nicklaus Wirth did, in
many places in Pascal, and Pascal has been criticized for it — for
not being as “forgiving” as C.
An interesting parallel presents itself in the example of the Mo-
torola 68000 design. Though Motorola brags loudly about the or-
thogonality of their instruction set, the fact is that it’s far from or-
thogonal. For example, you can read a variable from its address:
but you can’t write in the same way. To write, you must load
an address register with the address of X. The same is true for PC-
relative addressing:
BOOLEAN “AND”
With that bit of philosophy out of the way, we can press on to the
“and” operator, which goes into procedure Term. By now, you can
probably do this without me, but here’s the code, anyway:
In Scanner,
{--------------------------------------------------------------}
function IsMulop(c: char): boolean;
begin
IsMulop := c in [’*’,’/’, ’&’];
end;
{--------------------------------------------------------------}
In Parser,
{--------------------------------------------------------------}
procedure Term;
begin
Factor;
while IsMulop(Look) do
case Look of
’*’: Multiply;
’/’: Divide;
’&’: _And;
end;
end;
{--------------------------------------------------------------}
{ Parse and Translate a Boolean And Operation }
procedure _And;
begin
Match(’&’);
Push;
Factor;
PopAnd;
end;
{--------------------------------------------------------------}
and in CodeGen,
344
UNIT CONSTRUCTION Boolean “and”
{--------------------------------------------------------------}
{ And Primary with TOS }
procedure PopAnd;
begin
EmitLn(’AND (SP)+,D0’);
end;
{--------------------------------------------------------------}
Your parser should now be able to process almost any sort of logical
expression, and (should you be so inclined), mixed-mode expressions
as well.
Why not “all sorts of logical expressions”? Because, so far, we
haven’t dealt with the logical “not” operator, and this is where it
gets tricky. The logical “not” operator seems, at first glance, to be
identical in its behavior to the unary minus, so my first thought
was to let the exclusive or operator, ’∼’, double as the unary “not.”
That didn’t work. In my first attempt, procedure SignedTerm simply
ate my ’∼’, because the character passed the test for an addop, but
SignedTerm ignores all addops except ’-’. It would have been easy
enough to add another line to SignedTerm, but that would still not
solve the problem, because note that Expression only accepts a signed
term for the FIRST argument.
Mathematically, an expression like:
-a * -b
In the case of these unary operators, choosing to make them act the
same way seems an artificial force fit, sacrificing reasonable behavior
on the altar of implementational ease. While I’m all for keeping the
implementation as simple as possible, I don’t think we should do so
at the expense of reasonableness. Patching like this would be missing
the main point, which is that the logical “not” is simply NOT the
same kind of animal as the unary minus. Consider the exclusive or,
which is most naturally written as:
If we allow the “not” to modify the whole term, the last term in
parentheses would be interpreted as:
345
Boolean “and” UNIT CONSTRUCTION
not(a and b)
which is not the same thing at all. So it’s clear that the logical
“not” must be thought of as connected to the FACTOR, not the
term.
The idea of overloading the ’∼’ operator also makes no sense from
a mathematical point of view. The implication of the unary minus is
that it’s equivalent to a subtraction from zero:
-x <=> 0-x
Looking at this list, it’s certainly not hard to see why we had
trouble using ’∼’ as the “not” symbol!
So how do we mechanize the rules? In the same way as we did
with SignedTerm, but at the factor level. We’ll define a procedure
NotFactor:
346
UNIT CONSTRUCTION Boolean “and”
{--------------------------------------------------------------}
{ Parse and Translate a Factor with Optional "Not" }
procedure NotFactor;
begin
if Look =’!’ then begin
Match(’!’);
Factor;
Notit;
end
else
Factor;
end;
{--------------------------------------------------------------}
and call it from all the places where we formerly called Factor,
i.e., from Term, Multiply, Divide, and And. Note the new code
generation procedure:
{--------------------------------------------------------------}
{ Bitwise Not Primary }
procedure NotIt;
begin
EmitLn(’EOR #-1,D0’);
end;
{--------------------------------------------------------------}
Try this now, with a few simple cases. In fact, try that exclusive
or example,
a&!b|!a&b
That’s precisely what we’d like to get. So, at least for both arith-
metic and logical operators, our new precedence and new, slimmer
syntax hang together. Even the peculiar, but legal, expression with
leading addop:
~x
347
Boolean “and” UNIT CONSTRUCTION
0~x,
which is equal to x.
When we look at the BNF we’ve created, we find that our boolean
algebra now adds only one extra line:
That’s a big improvement over earlier efforts. Will our luck con-
tinue to hold when we get to relational operators? We’ll find out
soon, but it will have to wait for the next installment. We’re at a
good stopping place, and I’m anxious to get this installment into your
hands. It’s already been a year since the release of Installment 15. I
blush to admit that all of this current installment has been ready for
almost as long, with the exception of relational operators. But the
information does you no good at all, sitting on my hard disk, and by
holding it back until the relational operations were done, I’ve kept
it out of your hands for that long. It’s time for me to let go of it
and get it out where you can get value from it. Besides, there are
quite a number of serious philosophical questions associated with the
relational operators, as well, and I’d rather save them for a separate
installment where I can do them justice.
Have fun with the new, leaner arithmetic and logical parsing, and
I’ll see you soon with relationals.
348
Contents
Copyright 1988–1994
c Jack W. Crenshaw. All rights reserved.
349