How To Write Compiler
How To Write Compiler
COMPILER
Nils M Holm
Write Your Own Compiler
Nils M Holm, 2017
Print and Distribution:
Lulu Press, Inc.; Raleigh, NC; USA
Put your heart and soul in whatever you do
Contents
Preface 7
Rules of the Game 9
The Language 12
Syntax 12
Semantics 14
The Target Architecture 16
Execution Model 16
The Compiler 19
Prelude 19
Symbol Table 23
Code Generator 27
Digression: Execution Model 29
The VSM Instructions 30
Marking and Resolving 32
Function Contexts 33
Back to the Code 35
Digression: The Accumulator 39
Back to the Code 40
Scanner 44
The Scanner Code 44
Parser 56
Parser Prelude 56
Declaration Parser 59
Expression Parser 71
Statement Parser 89
Program Parser 100
Initialization 101
Main Program 105
6 Contents
Preface
This text is the most minimal complete introduction to compiler-
writing that I can imagine. It covers the entire process from
analyzing source code to generating an executable file in about
100 pages of prose. The compiler discussed on the text is entirely
self-contained and does not rely on any third-par ty software,
except for an operating system.
The book covers lexical analysis, syntax analysis, and code
generation by means of a minimal high-level programming
language. The language is powerful enough to implement its own
compiler in less than 1600 lines of readable code. The main part
of the text is comprised of a tour through that code.
The language used in this book is a subset of T3X, a minimal
procedural language that was first described in 1996 in the book
‘‘Lightweight Compiler Techniques’’. Although T3X already is quite
minimal, T3X9, the dialect discussed here, is even smaller.
If you are familiar with Pascal, C, Java, or any other procedural
language, T3X will be easy to pick up. If you are completely new to
the field, there is a brief introduction to T3X9 in the appendix.
The T3X9 compiler runs on FreeBSD-386 and generates code for
FreeBSD-386, so you will need that system on your computer if
you want to experiment with the compiler. Of course it seems
curious to install 2G bytes of software in order to run a 32K-byte
executable, but then these are interesting times.
Like the compiler, this book is self-contained. It includes the full
compiler, its runtime support code, and enough information to
understand both the source language and the target platform.
Welcome to compiler writing, enjoy the tour!
Nils M Holm, April 2017
8
9
Implementation
Source Language S Language I Target Language T
Figure 1: Compilation
Compiler
Linker Library
Executable
The Language
The source language and implementation language used in this
book is a subset of an obscure, little, procedural language called
T3X. It is a tiny language that once had a tiny community, and it
was even used to write some real-life software, like its own
compiler and linker, an integrated development system, a
database system, and a few simple games. It was also subject of
a few college course, most probably because (1) it was reasonably
well defined and documented and (2) due to the size of its
community, nobody would do your homework assignments for you.
T3X looks like a mixture of C, Pascal, and BCPL. It has untyped
data and typed operators, which simplifies the compiler a lot, but
also leaves all the type checking to the programmer, which is a
nightmare from the perspective of a modern software developer.
But this text is not about creating a product and make a shiny web
page about it. This is about diving right into the depths of the
matter and having some fun. And a fun language T3X is. It is
interesting to see how little you need to be able to write quite
comprehensible and expressive programs.
Syntax
Syntax is what a language looks like. T3X is block-structured,
procedural language, which means that its programs describe
procedures for manipulating data, i.e. ‘‘what to do with data’’. It is
called block-structured, because it is structured language that
divides programs into blocks. A block is a chunk of source code
that describes a part of a procedure.
A structured language uses certain constructs to describe the flow
of control while a program executes, typically selection and loops
(repetition).
Source code of procedural languages is mostly organized in the
form a hierarchy consisting of a programs, procedures or
functions, declarations, statements, and expressions. The most
Syntax 13
More Abstract
Program
Declaration
Statement
Expression
Less Abstract
In procedural languages:
• programs contain declarations, statements, and expressions
• declarations contain statements and expressions
• statements contain expressions
If you are familiar with C or Pascal or Java, the T3X syntax will
look quite familiar. Here is the infamous bubblesor t algorithm in
T3X:
! This is a comment
bubblesort(n, v) do var i, swapped, t;
swapped := %1;
while (swapped) do
swapped := 0;
for (i=0, n-1) do
if (v[i] > v[i+1]) do
t := v[i];
v[i] := v[i+1];
v[i+1] := t;
swapped := %1;
14 Syntax
end
end
end
end
bubblesort(n, v) star ts the declaration of the procedure
bubblesort with the formal arguments n and v . The body of the
procedure is a block statement (or compound statement) enclosed
in the keywords DO and END. The compound statement declares
the local variables i , swapped , and t .
Assignment is done by the := operator (and equality is expressed
with =). The statement FOR (i=0, n-1) counts from 0 to
n − 2. The i th element of a vector (or array) v is accessed using
v[i]. Elements are numbered v[0] . . . v[n − 1].
The lexeme %1 denotes the number −1. You could also write -1,
but there is a subtle difference: the former is a value, and the latter
is an operator applied to a value, which will not work in contexts
where a constant is expected.
Fur thermore, /\ and \/ denotes logical (short-circuit) AND and
OR, and X->Y:Z means ‘‘if x then y else z ’’, just like x?y:z in C.
IF with an ELSE is called IE (If/Else).
You will pick up the rest of the T3X syntax as we walk through the
compiler source code. If you are interested, there is a brief
introduction to T3X in the appendix.
Semantics
Semantics is how the syntax is interpreted. Note that ‘‘interpreted’’
does not imply the use of interpreting software here. Inter pretation
can be done at various levels, and in the case of the T3X compiler
presented here, the code will eventually be interpreted by a 386
(or x86) CPU.
Interpretation in this case is a question of meaning. What does a
statement like
WHILE (swapped) DO ... END
Semantics 15
Execution Model
In the ideal case, the execution model of a compiler is exactly the
target machine for which code is generated. In fact, CPUs and
compilers are designed in such a way that the mapping from
source code to machine instructions is as straight-forward as
possible.
However, such an ideal mapping is also non-trivial, so in practice,
there is often an additional layer between the source language
and the target language. We call this additional layer the execution
model of the compiler.
Execution Model 17
Program Stack
load a a
load b b a
load c c b a
mul b*c a
add a+b*c
store a
Figure 4: Stack Machine Program Execution
Stack
Machine 386 CPU
load a pushl a
mul pop %ecx
pop %eax
mul %ecx
push %eax
add pop %ecx
pop %eax
add %ecx
push %eax
store a popl a
Figure 5: Stack Machine Instruction Mapping
19
The Compiler
All code in this book, but not the prose, is provided
under the Creative Commons Zero license, i.e. it
can be used for any pur pose, without attribution:
http://creativecommons.org/publicdomain/zero/1.0/
Prelude
! T3X9 -> ELF-FreeBSD-386 compiler
! Nils M Holm, 2017, CC0 license
We are assuming four bytes per word, which is fair enough for
most 32-bit processors and certainly true for the 386.
const BPW = 4;
PROG_SIZE is the maximum size of the input program. To keep
things simple, we read the whole program into memory in one
chunk. When the program is larger than this, we are out of luck.
const PROG_SIZE = 65536;
The next two constants specify the maximum size of the text
(code) segment and the data segment of the generated
executable. Since we are operating in a 32-bit environment, feel
free to increase those (and PROG_SIZE, above) as you see fit.
Due to the way in which the virtual load addresses in the resulting
ELF file will be organized, TEXT_SIZE must be a multiple of
PAGE_SIZE. (See page 27.)
const TEXT_SIZE = 65536; ! * PAGE_SIZE !
const DATA_SIZE = 65536;
NRELOC specifies the maximum number of relocation entries. A
relocation entry specifies a location in the generated executable
whose content has to be adjusted later, because some part of the
program is relocated in memory.
20 Prelude
ntoa(x) do var i, k;
if (x = 0) return "0";
i := 0;
Prelude 21
k := x<0-> -x: x;
while (k > 0) do
i := i+1;
k := k/10;
end
i := i+1;
if (x < 0) i := i+1;
ntoa_buf::i := 0;
k := x<0-> -x: x;
while (k > 0) do
i := i-1;
ntoa_buf::i := ’0’ + k mod 10;
k := k/10;
end
if (x < 0) do
i := i-1;
ntoa_buf::i := ’-’;
end
return @ntoa_buf::i;
end
The str.length() function returns the length of a string. Since
strings are NUL-terminated, this can be done efficiently using
t.memscan(). The function will return wrong results for strings
longer than 32766 characters. Feel free to increase the number.
str.length(s) return t.memscan(s, 0, 32767);
Str.copy(sd,ss) copies the string ss (source) to the byte
vector sd (destination). Str.append(sd,ss) appends ss to sd .
Sd must provide enough space for the concatenated string.
str.copy(sd, ss)
t.memcopy(ss, sd, str.length(ss)+1);
str.append(sd, ss)
t.memcopy(ss, @sd::str.length(sd),
str.length(ss)+1);
22 Prelude
push(x) do
if (Sp >= STACK_SIZE)
oops("stack overflow", 0);
Stack[Sp] := x;
Sp := Sp+1;
end
pop() do
if (Sp < 1) oops("stack underflow", 0);
Sp := Sp-1;
return Stack[Sp];
end
swap() do var t;
if (Sp < 2) oops("stack underflow", 0);
t := Stack[Sp-1];
Stack[Sp-1] := Stack[Sp-2];
Stack[Sp-2] := t;
end
The following functions define character classes by returning truth
when the character c belongs to the corresponding class.
numeric(c) return ’0’ <= c /\ c <= ’9’;
Symbol Table
A symbol table entry consists of three fields. SNAME holds the
address of the corresponding name in the name list. SFLAGS
contains the type of the symbol, and SVALUE its value (which is
the address of the symbol in case of variables, vectors, and
functions).
struct SYM = SNAME, SFLAGS, SVALUE;
24 Symbol Table
SNAME Syms[0*SYM+SNAME]
Slot #0 SFLAGS Syms[0*SYM+SFLAGS]
SVALUE Syms[0*SYM+SVALUE]
SNAME Syms[1*SYM+SNAME]
Slot #1 SFLAGS Syms[1*SYM+SFLAGS]
SVALUE Syms[1*SYM+SVALUE]
...
The syntax v::n allocates a vector of n bytes. Note that this is still
a vector of machine words, it is just being allocated in units of
bytes.
Yp and Np are the offsets of the free regions of the above vectors.
A SYM structure is allocated in Syms by incrementing Yp by SYM,
and a name is allocated in Nlist by adding the length of the
name (plus 1 for the trailing NUL) to Np.
Symbol Table 25
The connection between the symbol table and the name list is
illustrated in figure 7.
var Syms[SYM*SYMTBL_SIZE];
var Nlist::NLIST_SIZE;
The function also makes sure that the located symbol has the
proper type by checking the symbol flags against f . For instance,
to locate a constant, CNST would be passed to loopup() in the f
argument.
lookup(s, f) do var y;
y := find(s);
if (y = 0) aw("undefined", s);
if (y[SFLAGS] & f \= f)
aw("unexpected type", s);
return y;
end
Newname() adds a name to the name list.
newname(s) do var k, new;
k := str.length(s)+1;
if (Np+k >= NLIST_SIZE)
aw("too many symbol names", s);
new := @Nlist::Np;
t.memcopy(s, new, k);
Np := Np+k;
return new;
end
Add() adds the symbol s with flags f and value v to the symbol
table. Since there is no shadowing in T3X, a name already
existing in the table is a redefinition error, except when the existing
symbol is a forward declaration (DECL) and the new symbol
names a function.
add(s, f, v) do var y;
y := find(s);
if (y \= 0) do
ie (y[SFLAGS] & FORW /\ f & FUNC)
return y;
else
aw("redefined", s);
end
if (Yp+SYM >= SYMTBL_SIZE*SYM)
aw("too many symbols", 0);
Symbol Table 27
y := @Syms[Yp];
Yp := Yp+SYM;
y[SNAME] := newname(s);
y[SFLAGS] := f;
y[SVALUE] := v;
return y;
end
Code Generator
An ELF module typically has (at least) two segments: one holding
the text (code) and one holding the data of the program. The
following constants define the virtual memory addresses at which
the operating system will load these segments.
The (hexa-decimal) address 8048000h is the typical load address
for the first segment on a 386-based system. [Lev99] The data
segment is placed right after the text segment.
Because of the way in which DATA_VADDR is computed,
TEXT_SIZE must be a multiple of PAGE_SIZE (all virtual
addresses must be page-aligned).
const TEXT_VADDR = 134512640; ! 08048000h
const DATA_VADDR = TEXT_VADDR + TEXT_SIZE;
HEADER_SIZE is the size of a minimal two-segment ELF header.
const HEADER_SIZE = 116; ! 74h
PAGE_SIZE specifies the size of a vir tual memor y page in the
target operating system. It is also used for segment alignment in
the ELF file.
const PAGE_SIZE = 4096;
A relocation entry consists of an address (RADDR) and a segment
(RSEG). Relocation entries are allocated in the same was as
symbol table entries (page 24).
struct RELOC = RADDR, RSEG;
var Rel[RELOC*NRELOC];
28 Code Generator
These byte vectors hold the two segments and the ELF header.
var Text_seg::TEXT_SIZE;
var Data_seg::DATA_SIZE;
var Header::HEADER_SIZE;
The following variables indicate:
• Rp - the free region in the relocation table
• Tp - the next address in the text segment
• Dp - the next address in the data segment
• Lp - the next address in a local stack frame
• Hp - the free region in the ELF header
When emitting code and data, this means that Tp always contains
the current address in the code segment, and Dp always contains
the current address in the data segment. Because jump
instructions are always relative to the current address on the 386,
text addresses never need relocation, only data addresses do.
This is why the RELOC structure has only one segment field, which
indicates the segment in which the relocation has to be performed
(but no field indicating relative to which segment it has to be
done).
var Rp, Tp, Dp, Lp, Hp;
The Acc and Codetbl variables as well as the CG structure are
par t of the execution model of the T3X9 compiler. The model will
be explained immediately.
var Acc;
var Codetbl;
struct CG =
CG_PUSH, CG_CLEAR,
CG_LDVAL, CG_LDADDR, CG_LDLREF, CG_LDGLOB,
CG_LDLOCL,
CG_STGLOB, CG_STLOCL, CG_STINDR, CG_STINDB,
CG_INCGLOB, CG_INCLOCL,
CG_ALLOC, CG_DEALLOC, CG_LOCLVEC, CG_GLOBVEC,
Code Generator 29
CG_DREFB A: = b[ A]
CG_MARK see next section
CG_RESOLV see next section
CG_CALL w P: = P − 1; S 0 : = I ; I : = w
CG_JUMPFWD w I : = w;
CG_JUMPBACK w I : = w;
CG_JMPFALSE w if S 0 = 0, then I : = w ; always: P: = P + 1
CG_JMPTRUE w if S 0 ≠ 0, then I : = w ; always: P: = P + 1
CG_FOR w if S 0 ≥ A, then I : = w ; always: P: = P + 1
CG_FORDOWN w if S 0 ≤ A, then I : = w ; always: P: = P + 1
CG_ENTER P: = P − 1; S 0 : = F; F: = P
CG_EXIT F: = S 0 ; I : = S 1 ; P: = P + 2
CG_HALT w halt program execution, return w
CG_NEG A: = − A
CG_INV A: = bitwise complement of A
CG_LOGNOT if A = 0 then A: = −1 else A: = 0
CG_ADD A: = S 0 + A; P: = P + 1
CG_SUB A: = S 0 − A; P: = P + 1
CG_MUL A: = S 0 ⋅ A; P: = P + 1
CG_DIV A: = S 0 div A; P: = P + 1
CG_MOD A: = S 0 mod A; P: = P + 1
( x div y is the integer quotient of x and y and x mod y is the
remainder of the integer division.)
CG_AND A: = S 0 AND A; P: = P + 1
CG_OR A: = S 0 OR A; P: = P + 1
CG_XOR A: = S 0 XOR A; P: = P + 1
32 The VSM Instructions
Then the code between the marked address and the destination is
generated. (F8.2)
Finally, when the destination of the jump is reached, the compiler
pops the mark off the stack and inserts the current address (Tp) in
the location pointed to by the mark. (F8.3)
Tp
Code (F8.1)
Mark Tp
Code (F8.2)
Mark Tp
Code (F8.3)
Destination
Figure 8: Resolving a Mark by Backpatching
Because this last step modifies code that has been emitted earlier,
it is referred to as backpatching.
A mark can also be resolved by just generating a jump back to the
marked address. In this case, no backpatching is required,
because the destination is already known when the jump
instruction is generated.
The following VSM instruction generate forward jumps:
CG_JUMPFWD, CG_JMPFALSE, CG_JMPTRUE, CG_FOR,
CG_FORDOWN. They place a mark that has to be resolved later.
There is only one instruction involving a backward jump:
CG_JUMPBACK. It requires a mark to already be in place.
Function Contexts
A function context (or context) is a region on the stack where a
function stores its arguments and local variables. A context is set
up by the CG_CALL and CG_ENTER instructions.
34 Function Contexts
CG_CALL puts the return address on the stack after the arguments
have been placed there by the caller. The return address is the
address where program execution will continue when the called
function returns.
CG_ENTER saves the current frame pointer F on the stack, and
then sets F to the value of P , thereby creating a new frame. The
names ‘‘frame’’ and ‘‘context’’ are used as synonyms here.
A function context with two arguments and three local variables is
shown in figure 9.
Argument #1 F+12
Argument #2 F+8
Return Address F+4
Old Frame Pointer F
Local Variable #1 F-4
Stack growth Local Variable #2 F-8
P Local Variable #3 F-12
emitw(x) do
emit(255&x);
emit(255&(x>>8));
emit(255&(x>>16));
emit(255&(x>>24));
end
The tag() function tags the current location in the text or data
segment for relocation. Note that BPW is subtracted from the
current location, because the address to relocate already has
been emitted when tag() is called.
tag(seg) do
if (Rp+RELOC >= RELOC*NRELOC)
oops("relocation buffer overflow", 0);
Rel[Rp+RADDR] := seg = ’t’-> Tp-BPW: Dp-BPW;
Rel[Rp+RSEG] := seg;
Rp := Rp+RELOC;
end
tpatch() patches the machine word at text segment location a
to contain the value x , and tfetch() retrieves the machine word
at text segment location a. These are used for backpatching and
relocation.
tpatch(a, x) do
Text_seg::a := 255&x;
Text_seg::(a+1) := 255&(x>>8);
36 Code Generator
Text_seg::(a+2) := 255&(x>>16);
Text_seg::(a+3) := 255&(x>>24);
end
dataw(x) do
data(255&x);
data(255&(x>>8));
data(255&(x>>16));
data(255&(x>>24));
end
Similarly, dpatch() and dfetch() patch and retrieve values in
the data segment.
dpatch(a, x) do
Data_seg::a := 255&x;
Data_seg::(a+1) := 255&(x>>8);
Data_seg::(a+2) := 255&(x>>16);
Data_seg::(a+3) := 255&(x>>24);
end
emitw(v);
end
else ie (s::1 = ’a’) do
emitw(v);
tag(’t’);
end
else ie (s::1 = ’m’) do
push(Tp);
end
else ie (s::1 = ’>’) do
push(Tp);
emitw(0);
end
else ie (s::1 = ’<’) do
emitw(pop()-Tp-BPW);
end
else ie (s::1 = ’r’) do
x := pop();
tpatch(x, Tp-x-BPW);
end
else do
oops("bad code", 0);
end
end
else do
emit(hex(s::0)*16+hex(s::1));
end
s := s+2;
end
end
Gen() is like rgen(), but generates code for the VSM instruction
id rather than emitting code from a raw hexdump.
gen(id, v) rgen(Codetbl[id][1], v);
Digression: The Accumulator 39
clear() Acc := 0;
activate() Acc := 1;
The relocate() function resolves relocation entries by fixing the
addresses of data objects in the text and data segments. When
code is generated, data addresses begin at Dp = 0, so a VSM
load instruction for the first data object would read LDGLOB 0.
However, the data segment is not mapped to virtual address 0 at
run time, but to DATA_VADDR. Just adding DATA_VADDR to each
data reference would not suffice, though, due to the way in which
an ELF file is mapped to memory. This is illustrated in figure 11.
Because the ELF executable is loaded in chunks of pages, the last
page of the text segment will contain the first bytes of the data
segment and the first page of the data segment will contain the
last bytes of the text segment.
Code Generator 41
So the overlap between text and data segment causes all data
objects in the data segment to move towards the end of the
segment by the size of that overlap. The overlap is computed
using the formula
(HEADER_SIZE + Tp) mod PAGE_SIZE
and adding DATA_VADDR to that overlap gives exactly the distance
of the data segment from 0.
ELF
Text Segment Data Segment
Header
overlap
Figure 11: Mapping an ELF File to Memory
b := b+2;
end
lewrite(x) do
hdwrite(x & 255);
hdwrite(x>>8 & 255);
hdwrite(x>>16 & 255);
hdwrite(x>>24 & 255);
end
The elfheader() function writes the complete ELF header to
the Header vector. See [ELF95] for further details. The FreeBSD
ELF loader probably ignores the physical load address.
elfheader() do
hexwrite("7f454c46"); ! magic
hexwrite("01"); ! 32-bit
hexwrite("01"); ! little endian
hexwrite("01"); ! header version
hexwrite("09"); ! FreeBSD ABI
hexwrite("0000000000000000");
! padding
hexwrite("0200"); ! executable
hexwrite("0300"); ! 386
lewrite(1); ! version
lewrite(TEXT_VADDR+HEADER_SIZE);
! initial entry point
lewrite(52); ! program header offset
lewrite(0); ! no header segments
lewrite(0); ! flags
hexwrite("3400"); ! header size
hexwrite("2000"); ! program header size
hexwrite("0200"); ! number of prog headers
hexwrite("2800"); ! segment hdr size (unused)
hexwrite("0000"); ! number of segment headers
hexwrite("0000"); ! string index (unused)
! text segment descrition
lewrite(1); ! loadable segment
lewrite(HEADER_SIZE); ! offset in file
lewrite(TEXT_VADDR); ! virtual load address
44 Code Generator
Scanner
The scanner is the part of the compiler that reads the input
program and splits it up into small units called tokens. A token
consists of multiple elements: a small integer identifying the token
(in this text called the token ID), and a set of values that describe
the token more in detail.
Here is a small sample program:
fac(x) RETURN x<1-> 1: x*fac(x-1);
Its tokenized form is shown in figure 12.
Each time the scanner is called, it returns the token ID of the next
token in the source program and fills in the string, value, and
operator ID values, as outlined in figure 12.
var Prog::PROG_SIZE;
When the scanner extracts a token from the source code, it will fill
in the following values:
• T - the token value itself
• Str - the literal text of the token (or the value of a string literal)
• Val - the values of integers and characters
• Oid - the operator IDs of operators
(These are exactly the variables used in the tokenized program
example in figure 12.)
var T;
var Str::TOKEN_LEN;
var Val;
var Oid;
Here are the operator IDs of some frequently-used operators.
var Equal_op, Minus_op, Mul_op, Add_op;
The OPER structure describes a T3X operator. It contains the
following fields:
• OPREC - the operator precedence
• OLEN - the length of the operator symbol
• ONAME - the operator symbol
• OTOK - the token ID generated for the operator
• OCODE - the machine code associated with the operator
The Ops variable will be assigned to the operator table in the
initialization part of the compiler (see page 101).
struct OPER = OPREC, OLEN, ONAME, OTOK, OCODE;
var Ops;
The TOKENS structure contains all token IDs that are used by the
compiler. Note that the scanner returns BINOP for all binary
operators and UNOP for all unary operators. In addition it sets the
Oid variable to the index of the operator in the operator table.
Scanner 47
readc() do var c;
c := readrc();
return ’A’ <= c /\ c <= ’Z’-> c-’A’+’a’: c;
end
Readec() reads an extended character, i.e. a (raw) character or
an escape sequence. Escape sequences are used to include
otherwise unrepresentable characters in string literals or character
constants. For instance, \n is used to include a newline character,
and \q is used to include a double quote character. See figure 49
on page 131 for a full list of escape sequences.
readec() do var c;
48 Scanner
c := readrc();
if (c \= ’\\’) return c;
c := readrc();
if (c = ’a’) return ’\a’;
if (c = ’b’) return ’\b’;
if (c = ’e’) return ’\e’;
if (c = ’f’) return ’\f’;
if (c = ’n’) return ’\n’;
if (c = ’q’) return ’"’ | META;
if (c = ’r’) return ’\r’;
if (c = ’s’) return ’\s’;
if (c = ’t’) return ’\t’;
if (c = ’v’) return ’\v’;
return c;
end
This function backs up to the previous character in the buffer.
reject() Pp := Pp-1;
The skip() function is the first step when processing the input
program. It reads the program character by character and skips
over white space and comments. It returns the first character that
is neither a space character nor contained in a comment.
Remember : comments start with an exclamation point (!) and
extend up to (and including) the end of the line.
The function treats \r as a space character, so it can also process
input programs in DOS-format text files, where \r\n is used to
separate lines.
skip() do var c;
c := readc();
while (1) do
while ( c = ’ ’ \/ c = ’\t’ \/
c = ’\n’ \/ c = ’\r’ )
do
if (c = ’\n’) Line := Line+1;
c := readc();
end
if (c \= ’!’)
Scanner 49
return c;
while (c \= ’\n’ /\ c \= ENDFILE)
c := readc();
end
end
Findkw() tests whether the text in the argument s is a keyword of
the T3X language. If it is, it will return the token ID of the keyword
and otherwise it will return 0.
The function first tests the first character of s in order to minimize
the number of string comparisons. Hence it performs an average
of 1 string comparison instead of an average of 8 (there are 16
keywords).
Non-optimizing compilers spend a significant part of their time in
the scanning phase, so this optimization makes a lot of sense.
Note that mod is an operator that looks like a keyword. This will
cause a special case later in the code (see page 53).
findkw(s) do
if (s::0 = ’c’) do
if (str.equal(s, "const")) return KCONST;
return 0;
end
if (s::0 = ’d’) do
if (str.equal(s, "do")) return KDO;
if (str.equal(s, "decl")) return KDECL;
return 0;
end
if (s::0 = ’e’) do
if (str.equal(s, "else")) return KELSE;
if (str.equal(s, "end")) return KEND;
return 0;
end
if (s::0 = ’f’) do
if (str.equal(s, "for")) return KFOR;
return 0;
end
if (s::0 = ’h’) do
50 Scanner
j := j+1;
Str::j := 0;
aw("unknown operator", Str);
end
Str::j := 0;
reject();
return Ops[Oid][OTOK];
end
Findop() locates the operator of the given name in the Ops
table. Because this function does not process program input, only
valid operator symbols are passed to it, and the case of an
unknown operator should not happen.
findop(s) do var i;
i := 0;
while (Ops[i][OLEN] > 0) do
if (str.equal(s, Ops[i][ONAME])) do
Oid := i;
return Oid;
end
i := i+1;
end
oops("operator not found", s);
end
The symbolic() functions tests whether the character c is a
valid character for star ting a symbol name. Subsequent characters
of a symbol name may also be numeric.
symbolic(c)
return alphabetic(c) \/ c = ’_’ \/ c = ’.’;
Scan() is the principal input function of the compiler. Each time it
is called it extracts a token from the program buffer and returns its
token ID. In addition it returns:
• the value of integer and character literals in Val
• the value of string literals in Str
• the textual representation of other tokens in Str
• the offsets of operators in the Ops table in Oid
Scanner 53
if (c = ENDFILE) do
str.copy(Str, "end of file");
return ENDFILE;
end
if (symbolic(c)) do
i := 0;
while (symbolic(c) \/ numeric(c)) do
if (i >= TOKEN_LEN-1) do
Str::i := 0;
aw("symbol too long", Str);
end
Str::i := c;
i := i+1;
c := readc();
end
Str::i := 0;
reject();
k := findkw(Str);
if (k \= 0) do
if (k = BINOP) findop(Str);
return k;
end
return SYMBOL;
end
if (numeric(c) \/ c = ’%’) do
sgn := 1;
i := 0;
if (c = ’%’) do
sgn := %1;
c := readc();
Str::i := c;
i := i+1;
if (\numeric(c))
aw("missing digits after ’%’",0);
end
Val := 0;
while (numeric(c)) do
if (i >= TOKEN_LEN-1) do
Scanner 55
Str::i := 0;
aw("integer too long", Str);
end
Str::i := c;
i := i+1;
Val := Val * 10 + c - ’0’;
c := readc();
end
Str::i := 0;
reject();
Val := Val * sgn;
return INTEGER;
end
if (c = ’\’’) do
Val := readec();
if (readc() \= ’\’’)
aw("missing ’’’ in character", 0);
return INTEGER;
end
if (c = ’"’) do
i := 0;
c := readec();
while (c \= ’"’ /\ c \= ENDFILE) do
if (i >= TOKEN_LEN-1) do
Str::i := 0;
aw("string too long", Str);
end
Str::i := c & (META-1);
i := i+1;
c := readec();
end
Str::i := 0;
return STRING;
end
return scanop(c);
end
56 Parser
Parser
The parser is the main component of this compiler. It reads tokens
through the scanner and emits machine code though the code
generator. The structure of the input program directs the control
flow through the parser. This is why this approach is called syntax-
directed translation.
The parser analyzes the structure of the token stream
representing the input program. Its makes sure that all sentences
of the input program are syntactically correct and reports errors
otherwise. A sentence is a self-contained unit of a program, such
as an expression, a statement, a declaration, etc. These units will
be covered more in detail in this section.
The process of parsing is similar to the process of scanning,
because the next input unit controls how the subsequent units are
processed. For instance, when the scanner finds a double quote
character, it will invoke a routine that scans a string literal. When
the parser finds an IF keywords, it will invoke a routine that parses
an IF statement.
There is, however, a difference in the level of abstraction. A
scanner deals with simple, linear structures, while the parser
handles recursive tree structures. For example, a string may not
contain a string, but an IF statement may contain another IF
statement, e.g.:
IF (a) IF (b) IF (c) DO END
Parser Prelude
The MAXTBL constant specifies the maximum size of a table (see
page 73). MAXLOOP is the maximum number of nested loops
and/or LEAVE and LOOP keywords per loop (see below and page
92 and following).
const MAXTBL = 128;
const MAXLOOP = 100;
Parser Prelude 57
xsemi() do
expect(SEMI, "’;’");
T := scan();
end
xlparen() do
expect(LPAREN, "’(’");
T := scan();
end
xrparen() do
expect(RPAREN, "’)’");
T := scan();
end
Xsymbol() expects a SYMBOL token, but does not consume it in
case of success, because doing so would overwrite its Str
attribute.
xsymbol() expect(SYMBOL, "symbol");
In general, the parser presented in this chapter will be a recursive
descent parser (RDP), where each type of sentence is handled by
one or multiple functions.
Whenever a function is called, the type of sentence to analyze is
already known. For instance, the function analyzing an IF
statement would only be called when an IF keyword was found in
a statement context.
Because declarations may contain statements and statements
may contain expressions, a parser function analyzing a declaration
could eventually descend into the functions analyzing statements,
which could, in turn, descend into the functions analyzing
Parser Prelude 59
Declaration Parser
The constfac() and constval() functions parse constant
values, also called cvalues.
Constfac() expects a constant factor in the form of an integer or
a symbol that as been defined by a CONST declaration. Of course,
the ‘‘defined by CONST’’ par t is impossible to assert on a syntactic
level, because constants and variables are both SYMBOLs. This is
why constfac() checks the attributes of the symbol by looking
at the symbol table.
Constfac() returns the value of a constant factor.
constfac() do var v, y;
if (T = INTEGER) do
v := Val;
T := scan();
return v;
end
if (T = SYMBOL) do
y := lookup(Str, CNST);
T := scan();
return y[SVALUE];
end
aw("constant value expected", Str);
end
Constval() is the principal constant expression parser. It
accepts a constant factor c1 (by descending into constfac())
and when a + or * operator follows, it consumes it and expects
60 Declaration Parser
constval → constfac
| constfac * constfac
| constfac + constfac
Figure 13: Constant Value Syntax Rules
vars → SYMBOL
| SYMBOL , vars
| SYMBOL subscript
| SYMBOL subscript , vars
subscript → [ constval ]
| :: constval
Figure 14: Variable Declaration Syntax Rules
Vector
Anonymous Vector Variable p
Variable v
Function
Argument x
BTW, the allocation of vectors on the stack is the reason why the
generated code jumps over functions. See figure 16. The parts of
the program that will be run before the main program is entered
are shown in gray.
Beginning of Program
Main Body
size := constval();
if (size < 1)
aw("invalid size", 0);
y[SFLAGS] := y[SFLAGS] | VECT;
expect(RBRACK, "’]’");
T := scan();
end
else if (T = BYTEOP) do
T := scan();
size := constval();
if (size < 1)
aw("invalid size", 0);
size := (size + BPW-1) / BPW;
y[SFLAGS] := y[SFLAGS] | VECT;
end
ie (glob & GLOBF) do
if (y[SFLAGS] & VECT) do
gen(CG_ALLOC, size*BPW);
gen(CG_GLOBVEC, Dp);
end
dataw(0);
end
else do
gen(CG_ALLOC, size*BPW);
Lp := Lp - size*BPW;
if (y[SFLAGS] & VECT) do
gen(CG_LOCLVEC, 0);
Lp := Lp - BPW;
end
y[SVALUE] := Lp;
end
if (T \= COMMA) leave;
T := scan();
end
xsemi();
end
Declaration Parser 65
The syntax rules of constant declaration are simple, see figure 17.
members → SYMBOL
| SYMBOL , members
Figure 18: Constant Declaration Syntax Rules
stcdecl(glob) do var y, i;
T := scan();
xsymbol();
y := add(Str, glob|CNST, 0);
T := scan();
xeqsign();
i := 0;
while (1) do
xsymbol();
add(Str, glob|CNST, i);
i := i+1;
T := scan();
if (T \= COMMA) leave;
T := scan();
end
y[SVALUE] := i;
xsemi();
end
A forward declaration is similar to a constant declaration, but sets
the values of its symbols to zero and expects cvalues in
parentheses after each symbol (see figure 19). It stores these
cvalues in the flags fields of the symbol table entries; they form the
types/arities of the declared functions.
fwddecl() do var y, n;
T := scan();
while (1) do
xsymbol();
y := add(Str, GLOBF|FORW, 0);
T := scan();
xlparen();
n := constval();
Declaration Parser 67
resolv_fwd() does.
The loc argument holds the address of the last generated forward
call and fn is the address of the actual function. Resolv_fwd()
traverses the linked list generated by calling the forward-declared
function and replaces the links by the function address.
resolve_fwd(loc, fn) do var nloc;
while (loc \= 0) do
nloc := tfetch(loc);
tpatch(loc, fn-loc-BPW);
loc := nloc;
end
end
Fundecl() parses a function declaration and generates code for
the resulting function. It also creates code to jump over the
function, as described previously (page 62).
The function is added to the symbol table as a global symbol of
the type FUNC. Its arguments are added as local variables. Note
that arguments are passed in left-to-right order in T3X, so the
arguments are placed on the stack ‘‘upside down’’, i.e. with the
first argument buried deepest.
Allocated Actual
Argument Address Address
1 F+8 F+(4 ⋅ n + 4)
2 F+12 F+(4 ⋅ (n − 1) + 4)
... ... ...
n F+(4 ⋅ n + 4) F+8
Figure 21: Fixing Argument Addresses
create a new symbol, but removes the FORW flag from the existing
symbol table entry and sets the FUNC flag instead.
Finally, fundecl() generates code to create a function context,
parses the statement (body) of the function, and emits code to
delete the context and return a value of 0.
Before exiting, it also releases the symbol table slots and name list
space allocated by function arguments. It does so by memorizing
the Yp and Np pointers at the beginning and resetting them at the
end. It also sets Lp to zero, because no compound statements
may exist outside of a function (except for the main program, but
after parsing that, the state of Lp does not matter any longer).
decl compound(0), stmt(0);
fundecl() do
var l_base, l_addr;
var i, na, oyp, onp;
var y;
l_addr := 2*BPW;
na := 0;
gen(CG_JUMPFWD, 0);
y := add(Str, GLOBF|FUNC, Tp);
T := scan();
xlparen();
oyp := Yp;
onp := Np;
l_base := Yp;
while (T = SYMBOL) do
add(Str, 0, l_addr);
l_addr := l_addr + BPW;
na := na+1;
T := scan();
if (T \= COMMA) leave;
T := scan();
end
for (i = l_base, Yp, SYM) do
Syms[i+SVALUE] :=
70 Declaration Parser
12+na*BPW - Syms[i+SVALUE];
end
if (y[SFLAGS] & FORW) do
resolve_fwd(y[SVALUE], Tp);
if (na \= y[SFLAGS] >> 8)
aw("function does not match DECL",
y[SNAME]);
y[SFLAGS] := y[SFLAGS] & ˜FORW;
y[SFLAGS] := y[SFLAGS] | FUNC;
y[SVALUE] := Tp;
end
xrparen();
y[SFLAGS] := y[SFLAGS] | (na << 8);
gen(CG_ENTER, 0);
Fun := 1;
stmt();
Fun := 0;
gen(CG_CLEAR, 0);
gen(CG_EXIT, 0);
gen(CG_RESOLV, 0);
Yp := oyp;
Np := onp;
Lp := 0;
end
The declaration() function delegates handling of all kinds of
declarations to the above functions. Note that it does not pass the
glob argument on to fwddecl() or fundecl(), because these
declarations can only appear in the global context. Therefore
declaration() is never called with glob ≠ GLOB while the
current token indicates a forward or function declaration.
declaration(glob)
ie (T = KVAR)
vardecl(glob);
else ie (T = KCONST)
constdecl(glob);
else ie (T = KSTRUCT)
stcdecl(glob);
else ie (T = KDECL)
Declaration Parser 71
fwddecl();
else
fundecl();
Expression Parser
An expression is anything that has a value. A value may be a
numeric value in case of an integer, or the result of applying an
operator to integers, or it may be an address, like the location of a
string or a vector in memory.
The difference between an expression and a cvalue is that the
value of an expression is not known at compile time, but will be
computed only when the program containing the expression
executes.
This section generates code that computes values at run time.
Note that expressions are inherently recursive structures. For
instance, expressions may contain function calls and functions
calls may contain expressions (as actual arguments).
arguments → expr
| expr , arguments
Figure 22: Function Call Syntax Rules
fncall(fn) do var i;
T := scan();
if (fn = 0) aw("call of non-function", 0);
i := 0;
while (T \= RPAREN) do
expr(0);
i := i+1;
if (T \= COMMA) leave;
T := scan();
if (T = RPAREN)
aw("syntax error", Str);
end
if (i \= fn[SFLAGS] >> 8)
aw("wrong number of arguments",
fn[SNAME]);
expect(RPAREN, "’)’");
T := scan();
if (active())
spill();
ie (fn[SFLAGS] & FORW) do
gen(CG_CALL, fn[SVALUE]);
fn[SVALUE] := Tp-BPW;
end
else do
gen(CG_CALL, fn[SVALUE]-Tp-5); ! TP-BPW+1
end
if (i \= 0) gen(CG_DEALLOC, i*BPW);
activate();
end
Expression Parser 73
table → [ members ]
member → constval
| STRING
| table
| ( exprs )
exprs → expr
| expr , exprs
Figure 23: Table Syntax Rules
[3,1,4,1,5,9,2,7] "Hello"
"World" 1 2
Memor y
T := scan();
dynamic := 0;
n := 0;
while (T \= RBRACK) do
if (n >= MAXTBL)
aw("table too big", 0);
ie (T = LPAREN /\ \dynamic) do
T := scan();
dynamic := 1;
loop;
end
else ie (dynamic) do
expr(1);
gen(CG_STGLOB, 0);
tbl[n] := 0;
af[n] := Tp-BPW;
n := n+1;
if (T = RPAREN) do
T := scan();
dynamic := 0;
end
end
else ie (T = INTEGER \/ T = SYMBOL) do
tbl[n] := constval();
af[n] := 0;
n := n+1;
end
else ie (T = STRING) do
tbl[n] := mkstring(Str);
af[n] := 1;
n := n+1;
T := scan();
end
else ie (T = LBRACK) do
tbl[n] := mktable();
76 Expression Parser
af[n] := 1;
n := n+1;
end
else do
aw("invalid table element", Str);
end
if (T \= COMMA) leave;
T := scan();
if (T = RBRACK)
aw("syntax error", Str);
end
if (dynamic)
aw("missing ’)’ in dynamic table", 0);
expect(RBRACK, "’]’");
if (n = 0) aw("empty table", 0);
T := scan();
a := Dp;
for (i=0, n) do
dataw(tbl[i]);
ie (af[i] = 1) do
tag(’d’);
end
else if (af[i] > 1) do
tpatch(af[i], Dp-4);
end
end
return a;
end
Load() and store() are shortcuts for loading values into the
accumulator and storing values from the accumulator, no matter if
the given symbol denotes a local or global variable.
load(y)
ie (y[SFLAGS] & GLOBF)
gen(CG_LDGLOB, y[SVALUE]);
else
gen(CG_LDLOCL, y[SVALUE]);
store(y)
Expression Parser 77
address → SYMBOL
| SYMBOL subscripts
subscripts → [ expr ]
| [ expr ] subscripts
| :: factor
T := scan();
bp[0] := 1;
factor();
y := 0;
gen(CG_INDXB, 0);
if (lv = 0) gen(CG_DREFB, 0);
end
return y;
end
A factor is the smallest component of an expression. A single
factor is a complete expression, but most expressions are
comprised of multiple factors combined by operators. The syntax
rules for factors are listed in figure 26.
factor → address
| SYMBOL ()
| INTEGER
| STRING
| table
| @ address
| - factor
| UNOP factor
| ( expr )
Figure 26: Factor Syntax Rules
end
else ie (T = SYMBOL) do
y := address(0, @b);
if (T = LPAREN) fncall(y);
end
else ie (T = STRING) do
spill();
gen(CG_LDADDR, mkstring(Str));
T := scan();
end
else ie (T = LBRACK) do
spill();
gen(CG_LDADDR, mktable());
end
else ie (T = ADDROF) do
T := scan();
y := address(2, @b);
ie (y = 0) do
;
end
else ie (y[SFLAGS] & GLOBF) do
spill();
gen(CG_LDADDR, y[SVALUE]);
end
else do
spill();
gen(CG_LDLREF, y[SVALUE]);
end
end
else ie (T = BINOP) do
if (Oid \= Minus_op)
aw("syntax error", Str);
T := scan();
factor();
gen(CG_NEG, 0);
end
else ie (T = UNOP) do
op := Oid;
82 Expression Parser
T := scan();
factor();
gen(Ops[op][OCODE], 0);
end
else ie (T = LPAREN) do
T := scan();
expr(0);
xrparen();
end
else do
aw("syntax error", Str);
end
end
The syntax rules for all arithmetic binary operators (BINOPs) are
simple. They are shown in figure 27.
arith → factor
| factor BINOP arith
Figure 27: Arithmetic Operation Syntax Rules
Note that these rules are, strictly speaking, wrong, because they
do not consider precedence or associativity.
Associativity is the direction in which equal operations are
grouped. For example, the ‘‘minus’’ operation groups to the left in
mathematics, so a − b − c equals (a − b) − c . The ‘‘power’’ (ˆ)
operation, however, associates to the right, so aˆbˆc equals
aˆ(bˆc). All T3X operators except for :: associate to the left.
The byte operator :: is a subscript delivering a byte value, so it
has to associate to the right: v : : x : : y = v : : (x : : y), because
(v : : x) : : y would create a reference to an invalid (byte-sized)
address.
Precedence specifies which operations will be computed first in
expressions involving different operators. An example for
precedence would be the ‘‘multiplication before addition’’ rule from
mathematics: a + b ⋅ c + d equals a + (b ⋅ c) + d .
Expression Parser 83
end
Conjn() and disjn() are essentially the same procedure,
which will be explained here by the example of conjn(). The
syntax rules implemented by these functions are given in figure
29.
conjn → arith
| arith /\ conjn
disjn → conjn
| conjn \/ disjn
Figure 29: Conjunction and Disjunction Syntax Rules
a=0
a≠0
a /\ b ...
a /\ b /\ c ...
a /\ b /\ c ...
disjn() do var n;
conjn();
n := 0;
while (T = DISJ) do
T := scan();
gen(CG_JMPTRUE, 0);
clear();
conjn();
n := n+1;
end
while (n > 0) do
gen(CG_RESOLV, 0);
n := n-1;
end
end
The conditional operator of T3X is the only ternary operator of the
language. It is a generalization of the short-circuit logical operators
described above:
a /\ b = \a -> 0 : b
a \/ b = a -> a : b
The operator may also be thought of as an in-expression if-then-
else construct: if a in a->b:c is true, it delivers b and else c . The
syntax rules of the conditional operator are specified in figure 32.
expr → disjn
| disjn -> expr : expr
Figure 32: Conditional Operator Syntax Rules
stack for backpatching. It then swaps the top stack entries before
resolving the jumps, leading to the overlapping branches in figure
33.
a=0
a≠0
a -> b : c ...
Statement Parser
The statement is the fundamental building block of a procedural
program. While an expression has a value, a statement does
something. For instance, a RETURN statement ends interpretation
of a function and returns a value, an assignment changes the
value of a variable or vector element, and a loop repeats a part of
a program.
Many statements operate on other statements. For instance, the
IF statement runs another statement conditionally or WHILE
repeats another statement. So statements, like expressions, are
inherently recursive structures.
return_stmt → RETURN ;
| RETURN expr ;
Figure 35: RETURN Statement Syntax Rules
expr = 0
expr ≠ 0
IF (expr) stmt ...
e=0
e≠0
IE (e) stmt ELSE stmt ...
if_stmt(alt) do
T := scan();
xlparen();
expr(1);
gen(CG_JMPFALSE, 0);
xrparen();
stmt();
if (alt) do
gen(CG_JUMPFWD, 0);
swap();
gen(CG_RESOLV, 0);
expect(KELSE, "ELSE");
T := scan();
stmt();
end
gen(CG_RESOLV, 0);
end
The WHILE loop, as depicted in figure 38, has structure similar to
the IE statement. It also uses a jump around jump, but the second
jump, after the statement, goes back to the beginning of the loop.
92 Statement Parser
expr = 0
expr ≠ 0
WHILE (expr) stmt ...
LOOP LEAVE
v := v + c
else
FOR (v=x1 , x2 stmt , c) ...
LOOP LEAVE
Therefore, the head of the FOR construct (in gray) has been
spread around the statement in figure 41. After running the
statement, the variable v is incremented and then the loop is
repeated by jumping back to the test part.
94 Statement Parser
T := scan();
oll := Llp;
olv := Lvp;
olp := Loop0;
Loop0 := 0;
xlparen();
xsymbol();
y := lookup(Str, 0);
if (y[SFLAGS] & (CNST|FUNC|FORW))
aw("unexpected type", y[SNAME]);
T := scan();
xeqsign();
expr(1);
store(y);
expect(COMMA, "’,’");
T := scan();
gen(CG_MARK, 0);
test := tos();
load(y);
expr(0);
ie (T = COMMA) do
T := scan();
Statement Parser 95
step := constval();
end
else do
step := 1;
end
gen(step<0-> CG_FORDOWN: CG_FOR, 0);
xrparen();
stmt();
while (Llp > oll) do
push(Loops[Llp-1]);
gen(CG_RESOLV, 0);
Llp := Llp-1;
end
ie (y[SFLAGS] & GLOBF)
gen(CG_INCGLOB, y[SVALUE]);
else
gen(CG_INCLOCL, y[SVALUE]);
gen(CG_WORD, step);
swap();
gen(CG_JUMPBACK, 0);
gen(CG_RESOLV, 0);
while (Lvp > olv) do
push(Leaves[Lvp-1]);
gen(CG_RESOLV, 0);
Lvp := Lvp-1;
end
Loop0 := olp;
end
The LEAVE and LOOP syntax rules are very simple, see figure 42.
They are implemented by the leave_stmt() and loop_stmt()
functions. Both of the functions check whether the corresponding
keyword appears in a loop context.
leave_stmt → LEAVE ;
loop_stmt → LOOP ;
Figure 42: LEAVE and LOOP Syntax Rules
96 Statement Parser
loop_stmt() do
T := scan();
if (Loop0 < 0)
aw("LOOP not in loop context", 0);
xsemi();
ie (Loop0 > 0) do
push(Loop0);
gen(CG_JUMPBACK, 0);
end
else do
if (Llp >= MAXLOOP)
aw("too many LOOPs", 0);
gen(CG_JUMPFWD, 0);
Loops[Llp] := pop();
Llp := Llp+1;
end
end
The asg_or_call() function parses both assignments and
function call statements, because they both begin with a symbol.
The corresponding syntax rules can be seen in figure 43. A
function call statement looks like a function call in an expression
(page 80), but with a terminating semicolon attached.
Statement Parser 97
asg_or_call → assigment
| fncall ;
Figure 43: Assignment and Function Call Syntax Rules
stmt()
ie (T = KFOR)
for_stmt();
else ie (T = KHALT)
halt_stmt();
else ie (T = KIE)
if_stmt(1);
else ie (T = KIF)
if_stmt(0);
else if (T = KELSE) do
aw("ELSE without IE", 0);
else ie (T = KLEAVE)
leave_stmt();
else ie (T = KLOOP)
loop_stmt();
else ie (T = KRETURN)
return_stmt();
else ie (T = KWHILE)
while_stmt();
else ie (T = KDO)
compound();
else ie (T = SYMBOL)
asg_or_call();
else ie (T = SEMI)
T := scan();
else
expect(%1, "statement");
The compound statement or statement block is a set of
statements and optional declarations that is delimited by the DO
and END keywords. Its complete syntax rules can be seen in figure
44. While, formally, both the statements and the declarations are
optional, the case with declarations only is very uncommon,
because it essentially makes the statement a null statement, i.e. a
statement that ‘‘does nothing’’.
Declarations local to compound statements are limited to
variables, constants, and structures; no local forward declarations
or nested functions are allowed.
Statement Parser 99
local_decls → local_decl
| local_decl local_decls
statments → stmt
| stmt statements
compound_stmt → DO END
| DO local_decls END
| DO statements END
| DO local_decls statements END
Figure 44: Compound Statement Syntax Rules
Program Parser
The syntax rules for a complete T3X program are shown in figure
45. The program() function, which parses a full program,
collects global declarations, expects one compound statement,
and also makes sure that no characters are following after the
compound statement forming the main body of the program.
global_decl → vardecl
| constdecl
| stcdecl
| fwddecl
| fundecl
global_decls → global_decl
| global_decl global_decls
program() do var i;
T := scan();
while ( T = KVAR \/ T = KCONST \/
T = SYMBOL \/ T = KDECL \/
T = KSTRUCT
)
declaration(GLOBF);
Program Parser 101
if (T \= KDO)
aw("DO or declaration expected", 0);
gen(CG_ENTER, 0);
compound();
if (T = ENDFILE)
aw("trailing characters", Str);
gen(CG_HALT, 0);
for (i=0, Yp, SYM)
if (Syms[i+SFLAGS] & FORW /\ Syms[i+SVALUE])
aw("undefined function",
Syms[i+SNAME]);
end
Initialization
The init() function initializes the global variables of the
compiler, verifies the code table, and generates the built-in
functions.
The Codetbl variable is assigned a vector holding the code
fragments emitted by the compiler. Each fragment is associated
with a constant with a CG_ prefix. These constants have been
defined in a structure earlier in the program (see page 28). They
are included in the table so that fragments are easier to locate, for
making modifications, etc.
The init() function checks the consistency of the table by
making sure that the CG_ constants are in ascending order. This is
necessar y, because fragments will be addressed using the CG_
constants. For instance, an LDVAL instruction will be generated by
emitting the fragment
Codetbl[CG_LDVAL][1]
The local variables tread, twrite, etc are used to store the machine
code for the built-in functions. Note that some strings in the
init() function have been split across multiple lines. This has
been done for typographical reasons and is not actually possible in
the T3X language; the fragments have to be contained in single
lines or have to be concatenated from smaller strings using
str.append() or a similar function.
102 Initialization
[ CG_GLOBVEC, "8925,a" ],
[ CG_INDEX, "c1e0025b01d8" ],
[ CG_DEREF, "8b00" ],
[ CG_INDXB, "5b01d8" ],
[ CG_DREFB, "89c331c08a03" ],
[ CG_MARK, ",m" ],
[ CG_RESOLV, ",r" ],
[ CG_CALL, "e8,w" ],
[ CG_JUMPFWD, "e9,>" ],
[ CG_JUMPBACK, "e9,<" ],
[ CG_JMPFALSE, "09c00f84,>" ],
[ CG_JMPTRUE, "09c00f85,>" ],
[ CG_FOR, "5b39c30f8d,>" ],
[ CG_FORDOWN, "5b39c30f8e,>" ],
[ CG_ENTER, "5589e5" ],
[ CG_EXIT, "5dc3" ],
[ CG_HALT, "68,w5031c040cd80" ],
[ CG_NEG, "f7d8" ],
[ CG_INV, "f7d0" ],
[ CG_LOGNOT, "f7d819c0f7d0" ],
[ CG_ADD, "5b01d8" ],
[ CG_SUB, "89c35829d8" ],
[ CG_MUL, "5bf7eb" ],
[ CG_DIV, "89c35899f7fb" ],
[ CG_MOD, "89c35899f7fb89d0" ],
[ CG_AND, "5b21d8" ],
[ CG_OR, "5b09d8" ],
[ CG_XOR, "5b31d8" ],
[ CG_SHL, "89c158d3e0" ],
[ CG_SHR, "89c158d3e8" ],
[ CG_EQ, "5b39c30f95c20fb6c248" ],
[ CG_NEQ, "5b39c30f94c20fb6c248" ],
[ CG_LT, "5b39c30f9dc20fb6c248" ],
[ CG_GT, "5b39c30f9ec20fb6c248" ],
[ CG_LE, "5b39c30f9fc20fb6c248" ],
[ CG_GE, "5b39c30f9cc20fb6c248" ],
[ CG_WORD, ",w" ],
[ %1, "" ]
104 Initialization
];
tread := "8b4424048744240c89442404b803000000
cd800f830300000031c048c3";
twrite := "8b4424048744240c89442404b80400000
0cd800f830300000031c048c3";
tcomp := "8b74240c8b7c24088b4c240441fcf3a609
c90f850300000031c0c38a46ff2a47ff66
9898c3";
tcopy := "8b74240c8b7c24088b4c2404fcf3a431c0c3";
tfill := "8b7c240c8b4424088b4c2404fcf3aa31c0c3";
tscan := "8b7c240c8b4424088b4c24044189fafcf2"
"ae09c90f840600000089f829d048c331c0"
"48c3";
Ops := [
[ 7, 1, "mod", BINOP, CG_MOD ],
[ 6, 1, "+", BINOP, CG_ADD ],
[ 7, 2, "*", BINOP, CG_MUL ],
[ 0, 1, ";", SEMI, 0 ],
[ 0, 1, ",", COMMA, 0 ],
[ 0, 1, "(", LPAREN, 0 ],
[ 0, 1, ")", RPAREN, 0 ],
[ 0, 1, "[", LBRACK, 0 ],
[ 0, 1, "]", RBRACK, 0 ],
[ 3, 1, "=", BINOP, CG_EQ ],
[ 5, 1, "&", BINOP, CG_AND ],
[ 5, 1, "|", BINOP, CG_OR ],
[ 5, 1, "ˆ", BINOP, CG_XOR ],
[ 0, 1, "@", ADDROF, 0 ],
[ 0, 1, "˜", UNOP, CG_INV ],
[ 0, 1, ":", COLON, 0 ],
[ 0, 2, "::", BYTEOP, 0 ],
[ 0, 2, ":=", ASSIGN, 0 ],
[ 0, 1, "\\", UNOP, CG_LOGNOT ],
[ 1, 2, "\\/", DISJ, 0 ],
[ 3, 2, "\\=", BINOP, CG_NEQ ],
[ 4, 1, "<", BINOP, CG_LT ],
[ 4, 2, "<=", BINOP, CG_LE ],
[ 5, 2, "<<", BINOP, CG_SHL ],
Initialization 105
Main Program
Here comes the main program of the compiler. Note that program,
procedure, and function are often used as synonyms, so the main
program is in this case just a part of a larger program.
The T3X9 compiler main program works as follows: it first
initializes its internal state and emits the built-in functions. Then it
reads the source program, analyzes it, and relocates it. In the final
steps, it creates an ELF header for the resulting binary and writes
106 Main Program
The ABI
The T3X9 compiler is in principle capable generating binaries for
all 386-based operating system that use the ELF executable file
format. The only parts of the code that have to change are those
that use the FreeBSD Application Binary Interface (ABI).
Operating system services, like reading input or writing output, are
requested in system calls. The exact structure of a system call is
defined by the ABI.
The FreeBSD ABI uses the stack to pass arguments to the
operating system. The number of the system call performing a
specific service is specified in the %eax register and the system
call itself is initiated by triggering software interrupt 128. The
return value of the call is returned in %eax and the carry flag
ser ves as an error flag. See figure 46 and [BSD14].
stack
argument #n
... %eax
result
argument #1 system
return addr call carry
error flag
%eax
system call id
Bootstrapping
Full Compiler
Compiler
Source Code
Source Code
(Language S )
(Language B)
Bootstrapping
Pre-Existing
S -Compiler
B-Compiler
(Stage 0)
read S -Compiler
(Stage 1)
generate
Bootstrapping
Compiler Full Compiler Source Code
Source Code
Pre-
Stage-0 Stage-1 Stage-2
existing
Compiler Compiler Compiler
Compiler
read Stage-3
Compiler
generate
equal stages
Figure 48: Triple Test (light gray boxes indicate binaries)
Future Projects
Here are some things you can do with the T3X9 compiler once
you have become familiar with its internals. In order of increasing
complexity:
(1) Add more built-in functions from the T3X specification, see
[NMH96] or the home page at http://t3x.org.
(2) Extend the language. Are you missing a switch statement or
nested functions? Go ahead and implement them!
(3) Por t the compiler to a different 386-based platform. In the
easiest case, you just have to change the ABI (see page 107), in
other cases you may have to generate a completely different
executable file format.
(4) Look for optimizations. Maybe there are obvious ways to make
the compiler generate better code.
(5) Por t the compiler to a different 32-bit CPU. Includes everything
from (3) plus you have to write your own machine code fragments.
(6) Write a minimal bootstrapping compiler or interpreter. The one
coming with the package probably has some features that are not
used in the compiler source code.
(7) Por t the compiler to a CPU with a word size other than 32 bits
or to a virtual machine, like the JVM or the Tcode machine (see
[NMH96] or the home page).
Have fun!
114
Appendix
115
VSM 386
Instruction Instructions
DREFB mov %eax,%ebx
xor %eax,%eax
mov (%ebx),%al
MARK
RESOLV
CALL w call w
JUMPFWD w jmp w
JUMPBACK w jmp w
JMPFALSE w or %eax,%eax
je w
JMPTRUE w or %eax,%eax
jne w
FOR w pop %ebx
cmp %eax,%ebx
jge w
FORDOWN w pop %ebx
cmp %eax,%ebx
jle w
ENTER push %ebp
mov %esp,%ebp
EXIT pop %ebp
ret
HALT w push $w
push %eax
xor %eax,%eax
inc %eax
int $128
NEG neg %eax
INV not %eax
LOGNOT neg %eax
sbb %eax,%eax
not %eax
VSM Code Fragments 117
VSM 386
Instruction Instructions
ADD pop %ebx
add %ebx,%eax
SUB mov %eax,%ebx
pop %eax
sub %ebx,%eax
MUL pop %ebx
imul %ebx
DIV mov %eax,%ebx
pop %eax
cltd
idiv %ebx
MOD mov %eax,%ebx
pop %eax
cltd
idiv %ebx
mov %edx,%eax
AND pop %ebx
and %ebx,%eax
OR pop %ebx
or %ebx,%eax
XOR pop %ebx
xor %ebx,%eax
SHL mov %eax,%ecx
pop %eax
shl %cl,%eax
SHR mov %eax,%ecx
pop %eax
shr %cl,%eax
EQ pop %ebx
cmp %eax,%ebx
setne %dl
movzbl %dl,%eax
dec %eax
118 VSM Code Fragments
VSM 386
Instruction Instructions
NEQ pop %ebx
cmp %eax,%ebx
sete %dl
movzbl %dl,%eax
dec %eax
LT pop %ebx
cmp %eax,%ebx
setge %dl
movzbl %dl,%eax
dec %eax
GT pop %ebx
cmp %eax,%ebx
setle %dl
movzbl %dl,%eax
dec %eax
LE pop %ebx
cmp %eax,%ebx
setg %dl
movzbl %dl,%eax
dec %eax
GE pop %ebx
cmp %eax,%ebx
setl %dl
movzbl %dl,%eax
dec %eax
T.READ() Function
mov 4(%esp),%eax
xchg %eax,12(%esp)
mov %eax,4(%esp)
mov $3,%eax
int $128
jnc 1
xor %eax,%eax
VSM Code Fragments 119
dec %eax
1:ret
T.WRITE() Function
mov 4(%esp),%eax
xchg %eax,12(%esp)
mov %eax,4(%esp)
mov $4,%eax
int $128
jnc 1
xor %eax,%eax
dec %eax
1:ret
T.MEMCOMP() Function
mov 12(%esp),%esi
mov 8(%esp),%edi
mov 4(%esp),%ecx
inc %ecx
cld
repz cmpsb
or %ecx,%ecx
jne 1
xor %eax,%eax
ret
1:mov -1(%esi),%al
sub -1(%edi),%al
cbtw
cwtl
ret
T.MEMCOPY() Function
mov 12(%esp),%esi
mov 8(%esp),%edi
mov 4(%esp),%ecx
120 VSM Code Fragments
cld
rep movsb
xor %eax,%eax
ret
T.MEMFILL() Function
mov 12(%esp),%edi
mov 8(%esp),%eax
mov 4(%esp),%ecx
cld
rep stosb
xor %eax,%eax
ret
T.MEMSCAN() Function
mov 12(%esp),%edi
mov 8(%esp),%eax
mov 4(%esp),%ecx
inc %ecx
mov %edi,%edx
cld
repnz scasb
or %ecx,%ecx
je 1
mov %edi,%eax
sub %edx,%eax
dec %eax
ret
1:xor %eax,%eax
dec %eax
ret
121
T3X9 Summary
Program
A program is a set of declarations followed by a compound
statement. Here is the smallest possible T3X program:
DO END
Comments
A comment is started with an exclamation point (!) and extends
up to the end of the current line. Example:
DO END ! Do nothing
Declarations
CONST name = cvalue, ... ;
Assign names to constant values.
Example:
CONST false = 0, true = %1;
Statements
name := expression;
Assign the value of an expression to a variable.
Example:
DO VAR x; x := 123; END
name[value]... := value;
name::value := value;
Assign the value of an expression to an element of a vector or
byte vector. Multiple subscripts may be applied to to a vector:
vec[i][j] := i*j;
In general, vec[i][ j] denotes the j th element of the i th element of
vec and vec : : i indicates the i th byte of vec.
Subscript and byte subscript operators can be mixed in the same
expression, but note that the byte operator :: associates to the
right, so v : : x : : i equals v : : (x : : i) and therefore,
vec[i]::j[k]
would denote the j[k]th byte of vec[i].
name();
name(expression_1, ...);
Call the function with the given name, passing the values of the
expressions to the function as arguments. An empty set of
parentheses is used to pass zero arguments. The result of the
function is discarded.
For fur ther details see the description of function calls in the
expression section.
124 Statements
IF (condition) statement_1
IE (condition) statement_1 ELSE statement_2
Both of these statements run statement_1, if the given condition is
true.
In addition, IE/ELSE runs statement_2, if the conditions is false.
IF just passes control to the subsequent statement in this case.
Example:
IE (0)
IF (1) RETURN 1;
ELSE
RETURN 2;
The example always returns 2, because only an IE statement can
have an ELSE branch. There is no ‘‘dangling else’’ problem.
name := expression_1
WHILE ( cvalue > 0 /\ name < expression \/
cvalue < 0 /\ name > expression )
DO
statement;
name := name + cvalue;
END
If cvalue is omitted, it defaults to 1.
Example:
DO VAR i;
FOR (i=1, 11); ! count from 1 to 10
FOR (i=10, 0, %1); ! count from 10 to 1
END
LEAVE;
Leave the innermost WHILE or FOR loop, passing control to the
first statement following the loop (if any).
Example:
DO VAR i; ! Count from 1 to 50
FOR (i=1, 100)
IF (i=50) LEAVE;
END
LOOP;
Re-enter the innermost WHILE or FOR loop. WHILE loops are re-
entered at the point where the condition is tested, and FOR loops
are re-entered at the point where the counter is incremented.
Example:
DO VAR i; ! This program never prints X
FOR (i=0, 10) DO
LOOP;
T.WRITE(1, "X", 1);
END
END
126 Statements
RETURN expression;
Return a value from a function. For further details see the
description of function calls in the expression section.
Example:
inc(x) RETURN x+1;
HALT cvalue;
Halt program execution and return the given exit code to the
operating system.
Example:
HALT 1;
DO END
;
These are both empty statements or null statements. They do not
do anything when run and may be used as placeholders where a
statement would be expected. They are also used to show that
nothing is to be done in a specific situation, like in
IE (x = 0)
;
ELSE IE (x < 0)
statement;
ELSE
statement;
Example:
FOR (i=0, 100000) DO END ! waste some time
Expressions
An expression is an operand of the form of a variable, a literal, or a
function call, or a set of operators applied to operands. There are
unar y, binar y, and ternar y operators.
Examples:
-a ! negate a
b*c ! product of b and c
x->y:z ! if x then y else z
f(x) ! the value returned by F of X
In the following, the symbols X, Y, and Z denote variables or
literals.
The operators of the T3X language are listed in figure 49.
The symbol P denotes precedence. Higher precedence means
that an operator binds stronger to its arguments, e.g. -X::Y
actually means -(X::Y).
128 Expressions
Operator P A Description
X[Y] 9 L the Y’th element of the vector X
X::Y 9 R the Y’th byte of the byte vector X
-X 8 - the negative value of X
˜X 8 - the bitwise inverse of X
\X 8 - %1, if X is 0, else 0 (logical NOT)
@X 8 - the address of X
X*Y 7 L the product of X and Y
Y/Y 7 L the integer quotient of X and Y
X mod Y 7 L the division remainder of X and Y
X+Y 6 L the sum of X and Y
X-Y 6 L the difference between X and Y
X&Y 5 L the bitwise AND of X and Y
X|Y 5 L the bitwise OR of X and Y
XˆY 5 L the bitwise XOR of X and Y
X<<Y 5 L X shifted to the left by Y bits
X>>Y 5 L X shifted to the right by Y bits
X<Y 4 L %1, if X is less than Y, else 0
X>Y 4 L %1, if X is less than Y, else 0
X<=Y 4 L %1, if X is less/equal Y, else 0
X>=Y 4 L %1, if X is greater/equal Y, else 0
X=Y 3 L %1, if X equals Y, else 0
X\=Y 3 L %1, if X does not equal Y, else 0
X/\Y 2 L if X then Y else 0
(shor t-circuit logical AND)
X\/Y 1 L if X then X else Y
(shor t-circuit logical OR)
X->Y:Z 0 - if X then Y else Z
Figure 49: T3X Operators
Expressions 129
Conditions
A condition is an expression appearing in a condition context, like
the condition of an IF or WHILE statement or the first operand of
the X->Y:Z operator.
In an expression context, the value 0 is considered to be "false",
and any other value is considered to be true. For example:
X=X is true
1=2 is false
"x" is true
5>7 is false
The canonical truth value, as returned by 1=1, is %1.
Function Calls
When a function call appears in an expression, the result of the
function, as returned by RETURN is used as an operand.
A function call is performed as follows:
Each actual argument in the call
function(argument_1, ...)
is passed to the function and stored in the corresponding formal
argument (‘‘argument’’) of the receiving function. The function then
runs its statement, which may produce a value via RETURN. When
no RETURN statement exists in the statement, 0 is returned.
Note that the order of argument evaluation is strictly left-to-right,
so in
f(g(a), g(b))
g(a) will always be called before g(b).
Example:
pow(x, y) DO VAR a;
a := 1;
WHILE (y) DO
a := a*x;
130 Function Calls
y := y-1;
END
RETURN a;
END
DO VAR x;
x := pow(2,10);
END
Literals
Integers
An integer is a decimal number representing its own value. Note
that negative numbers have a leading % sign rather than a - sign.
While the latter also works, it is, strictly speaking, the application
of the - operator to a positive number, so it may not appear in
cvalue contexts.
Examples:
0
12345
%1
Characters
Characters are integers internally. They are represented by single
characters enclosed in single quotes. In addition, the same
escape sequences as in strings may be used.
Examples:
’x’
’\\’
’’’
’\e’
Literals 131
Strings
A string is a byte vector filled with characters. Strings are delimited
by " characters and NUL-terminated internally. All characters
between the delimiting double quotes represent themselves. In
addition, the escape sequences from figure 50 may be used to
include some special characters.
Examples:
""
"hello, world!\n"
"\qhi!\q, she said."
Tables
A table is a vector literal, i.e. a sequence of values. It is delimited
by square brackets and elements are separated by commas.
Table elements can be cvalues, strings, and tables.
Examples:
[1, 2, 3]
["5 times -7", %35]
[[1,0,0],[0,1,0],[0,0,1]]
132 Literals
Dynamic Tables
The dynamic table is a special case of the table in which one or
multiple elements are computed at program run time. Dynamic
table elements are enclosed in parentheses. E.g. in the table
["x times 7", (x*7)]
the value of the second element would be computed and filled in
when the table is being evaluated. Note that dynamic table
elements are being replaced in situ, and remain the same only
until they are replaced again.
Multiple dynamic elements may be enclosed by a single pair of
parentheses. For instance, the following tables are the same:
[(x), (y), (z)]
[(x, y, z)]
Cvalues
A cvalue (constant value) is an expression whose value is known
at compile time. In full T3X, this is a large subset of full
expressions, but in T3X9, it it limited to the following:
• integers
• characters
• constants
as well as (given that X and Y are one of the above):
• X+Y
• X*Y
Naming Conventions
Symbolic names for variables, constants, structures, and functions
are constructed from the following alphabet:
• the characters a-z
Naming Conventions 133
Shadowing
There is a single name space without any shadowing in T3X:
• all global names must be different
• no local name may have the same name as a global name
• all local names in the same scope must be different
The latter means that local names may be re-used in subsequent
scopes, e.g.:
f(x) RETURN x;
g(x) RETURN x;
would be a valid program. However,
f(x) DO VAR x; END !!! WRONG !!!
134 Shadowing
Variadic Functions
T3X implements variadic functions (i.e. functions of a variable
number of arguments) using dynamic tables. For instance, the
following function returns the sum of a vector of arguments:
sum(k, v) DO var i, n;
n := 0;
FOR (i=0, k)
n := n+v[i];
RETURN n;
END
Its is an ordinary function returning the sum of a vector. It can be
considered to be a variadic function, though, because a dynamic
table can be passed to it in the V argument:
sum(5, [(a,b,c,d,e)])
Built-In Functions
The following built-in functions exist in T3X9. They are generated
by the T3X9 compiler and do not have to be declared in any way.
The dot in the function names resembles the message operator of
the full T3X language, but is an ordinary symbolic character in
T3X9.
T.READ(fd, buf, len)
Read up to len characters from the file descriptor fd into the buffer
buf . Return the number of characters actually read. Return −1 in
case of an error.
Built-In Functions 135
Example:
DO VAR buffer::100;
t.read(0, buffer, 99);
END
T.MEMFILL(bv, b, len)
Fill the first len bytes of the byte vector bv with the byte value b.
Return 0.
Example:
DO VAR b:100; t.memfill(b, 0, 100); END
136 Built-In Functions
T.MEMSCAN(bv, b, len)
Locate the first occurrence of the byte value b in the first len bytes
of the byte vector bv and return its offset in the vector. When b
does not exist in the given region, return −1.
Example:
t.memscan("aaab", ’b’, 4) ! returns 3
137
Registers
All registers are 32-bit registers. The ‘‘l ’’-registers (like %al) are
used to access the lowermost 8 bits of a 32-bit register.
Addressing Modes
Key: %r denotes a register, m a memory location n an offset, and
$n a constant.
Template Example Description
%r %eax content of register
m 8058000 content of memory address
(%r) (%eax) content of mem. addr. pointed to by %r
n(%r) 8(%ebp) content of memory address %r + n
$n $123 the value n
138 386 Assembly Summary
Syntax
This text uses AT&T syntax, which means that operands are in
source, destination order, e.g.
mov %eax,%ebx
means ‘‘copy the content of the %eax register to %ebx’’.
As a rule of the thumb, all two-operand instructions support all
combinations of addressing modes, but only source or destination
may be a memor y address (even indirect), and the destination
may not be a constant.
Move Instructions
lea x,y Load address of x into y
mov x,y Load content of x into y
movzbl x,y Like mov, but zero-extend byte to long
pop x pop value from stack into x
push x push x to stack
xchg x,y exchange contents of x and y
Arithmetic Instructions
add x,y add x to y
and x,y bitwise AND of x and y , result in y
dec x subtract 1 from x
idiv x divide %edx:%eax (double) by x ;
result in %eax, remainder in %edx
imul x multiply %eax by x ; result in
%edx:%eax (double)
inc x add 1 to x
neg x negate x (1 − x )
not x bitwise NOT of x
386 Assembly Summary 139
Function Calls
call x call function at x , leaving return address on stack
ret return from function call to address on stack
int x trigger software interrupt, vector x
Conversion
cbtw convert byte in %al to word with sign extension
cwtl convert 16-bit word in %eax to word w/ sign extension
Block Instructions
cld clear direction flag, all subsequent operations will
increment source/destination registers
cmpsb subtract byte at (%edi) from byte at (%esi);
do not store result, but set flags for repz;
increment %esi and %edi
movsb move byte from (%esi) to (%edi);
increment %esi and %edi
rep repeat following movsb %ecx times, i.e.
move block of %ecx bytes from %esi to %edi
or
repeat following stosb %ecx times, i.e.
fill block of %ecx bytes at %edi with %al
repnz Repeat following scasb until %al = (%edi),
but at most %ecx times; i.e. locate %al in block
of %ecx bytes starting at %edi; result is %edi − 1
repz Repeat following cmpsb while (%edi) = (%esi),
but at most %ecx times; i.e. compare blocks of size
%ecx at %edi and %esi. %edi and %esi will point
1 byte past first mismatch.
scasb subtract byte at (%edi) from %al;
do not store result, but set flags for repnz;
increment %edi
stosb store %al in byte at (%edi); increment %edi
141
List of Figures
1 Compilation 9
2 Separate Compilation with Linking 10
3 Elements of Block-Structured Languages 13
4 Stack Machine Execution Model 18
5 Stack Machine Instruction Mapping 18
6 Symbol Table Layout 24
7 Symbol Table and Name List 25
8 Resolving a Mark by Backpatching 33
9 Function Context 34
10 Use of an Accumulator 39
11 Mapping an ELF File to Memory 41
12 Tokenized Program 45
13 Constant Value Syntax Rules 60
14 Variable Declaration Syntax Rules 61
15 Anonymous Vector and References 62
16 Merging Functions and Vector Allocation 63
17 Constant Declaration Syntax Rules 65
18 Constant Declaration Syntax Rules 65
19 Forward Declaration Syntax Rules 66
20 Chain of Forward References 67
21 Fixing Argument Addresses 68
22 Function Call Syntax Rules 71
23 Table Syntax Rules 73
24 Table Layout in Memory 74
25 Address Syntax Rules 77
26 Factor Syntax Rules 80
27 Arithmetic Operation Syntax Rules 82
28 Precedence Parsing 83
29 Conjunction and Disjunction Syntax Rules 85
30 Short-Circuit Logical AND 85
31 Short-Circuit Optimization 86
32 Conditional Operator Syntax Rules 87
33 Conditional Operator Control Flow 88
142 List of Figures
Bibliography
[BSD14] The FreeBSD Documentation Project;
‘‘FreeBSD Developers’ Handbook’’; 2014
Index
Program Symbols
ACC 28, 39 DPATCH() 36
ACTIVATE() 40 ELFHEADER() 43
ACTIVE() 40 EMIT() 35
ADD() 26 EMITOP() 82
ADDRESS() 77 EMITW() 35
ADD_OP 46 END 98
ALIGN() 42 ENDFILE 20, 53
ALPHABETIC() 23 EQUAL_OP 46
ARITH() 82 EXPECT() 57
AW() 22 EXPR() 87
BINOP 82 FACTOR() 80
BPW 19 FIND() 25
BUILTIN() 41 FINDKW() 49
CLEAR() 40 FINDOP() 52
CNST 23 FNCALL() 71
CODETBL 28, 101 FOR 124
COMPOUND() 98 FORW 23
CONJN() 85 FUN 56
CONST 64, 121 FUNC 23
CONSTDECL() 64 FUNDECL() 68
CONSTFAC() 59 FWDDECL() 66
DATA() 36 GEN() 38
DATAW() 36 GLOBF 23
DATA_SEG 27 HALT 126
DATA_SIZE 19 HALT_STMT() 89
DATA_VADDR 27 HDWRITE() 42
DECL 26, 66, 122 HEADER 27
DECLARATION() 70 HEADER_SIZE 27
DFETCH() 36 HEX() 37
DISJN() 85 HEXWRITE() 42
DO 98 HP 28
DP 28, 61 IE/ELSE 124
Program Symbols 145
IF 124 PP 45
IF_STMT() 90 PROG 45
INIT() 101 PROGRAM() 100
LEAVE 125 PROG_SIZE 19
LEAVES 56 PSIZE 45
LEAVE_STMT() 95 PUSH() 22
LINE 20 READC() 47
LLP 56 READEC() 47
LOAD() 76 READPROG() 47
LOG() 22 READRC() 47
LOOKUP() 25 REJECT() 48
LOOP 125 RELOC 27
LOOP0 56, 92 RELOCATE() 40
LOOPS 56 RESOLVE_FWD() 67
LOOP_STMT() 95 RETURN 126
LP 28 RETURN_STMT() 89
LVP 56 RGEN() 37
MAXLOOP 56 RP 28
MAXTBL 56 SCAN() 52
META 44 SCANOP() 50
MINUS_OP 46 SFLAGS 23
MKSTRING() 72 SKIP() 48
MKTABLE() 73 SNAME 23
MUL_OP 46 SP 20
NEWNAME() 26 SPILL() 40
NLIST 24 STACK 20
NLIST_SIZE 20 STACK_SIZE 20
NP 24 STCDECL() 65
NRELOC 19 STORE() 76
NTOA() 20 STR 45, 52
NUMERIC() 23 STR.APPEND() 21
OID 45, 52 STR.COPY() 21
OOPS() 22 STR.EQUAL() 21
OPER 46 STR.LENGTH() 21
OPS 46 STRUCT 65, 122
PAGE_SIZE 27 SVALUE 23
POP() 22 SWAP() 22
146 Program Symbols
SYM 23 TOS() 22
SYMBOLIC() 52 TP 28
SYMS 24 TPATCH() 35
SYMTBL_SIZE 20 UNOP 80
T 45 VAL 45, 52
T.MEMCOMP() 135 VAR 61, 121
T.MEMCOPY() 135 VARDECL() 61
T.MEMFILL() 135 VECT 23
T.MEMSCAN() 135 WHILE 124
T.READ() 134 WHILE_STMT() 91
T.WRITE() 135 WRITES() 22
TAG() 35 XEQSIGN() 57
TEXT_SEG 27 XLPAREN() 57
TEXT_SIZE 19 XRPAREN() 57
TEXT_VADDR 27 XSEMI() 57
TFETCH() 35 XSYMBOL() 58
TOKENS 46 YP 24
TOKEN_LEN 44
Definitions
386 16 BSS 61
ABI 16, 107 built-in function 41, 134
accumulator 39, 72 byte operator 77
actual argument 71 byte vector 24
address 23, 28, 71, 77 cache 39
relative 37 case sensitivity 47, 133
argument character 130
actual 71 comment 48, 121
formal 129 compiler 9
assignment 123 self-hosting 9, 109
associativity 82, 127 compound statement 14, 98,
binar y 106 126
binar y operator 82 condition 129
block 12, 98, 126 conditional operator 87
block-structure 12 conjunction 85
bootstrapping 109 constant 64, 121
bottom-up parser 83 constant value 59, 132
Definitions 147