Compiler Construction
Compiler Construction
Compilation
Definition. Compilation is a process that translates a program in one language (the source language)
into an equivalent program in another language (the object or target language).
Error
messages
Source Target
program Compiler program
An important part of any compiler is the detection and reporting of errors; this will be dis- cussed in
more detail later in the introduction. Commonly, the source language is a high-level programming
language (i.e. a problem-oriented language), and the target language is a ma- chine language or
assembly language (i.e. a machine-oriented language). Thus compilation is a fundamental concept
in the production of software: it is the link between the (abstract) world of application development
and the low-level world of application execution on machines.
Types of Translators. An assembler is also a type of translator:
Assembly Machine
Assembler
program program
An interpreter is closely related to a compiler, but takes both source program and input data. The
translation and execution phases of the source program are one and the same.
Source
program Output
Interpreter
Input data data
Although the above types of translator are the most well-known, we also need knowledge of
compilation techniques to deal with the recognition and translation of many other types of
languages including:
2
• Command-line interface languages;
preprocessor
source program
compiler
assembly program
assembler
link/load editor
Preprocessors
Preprocessing performs (usually simple) operations on the source file(s) prior to compilation.
Typical preprocessing operations include:
(a) Expanding macros (shorthand notations for longer constructs). For example, in C,
#define foo(x,y) (3*x+y*(2+x))
defines a macro foo, that when used in later in the program, is expanded by the preprocessor.
For example, a = foo(a,b) becomes
a = (3*a+b*(2+a))
(b) Inserting named files. For example, in C,
#include "header.h"
is replaced by the contents of the file header.h
Linkers
A linker combines object code (machine code that has not yet been linked) produced from
compiling and assembling many source programs, as well as standard library functions and
resources supplied by the operating system. This involves resolving references in each object file to
external variables and procedures declared in other files.
Loaders
Compilers, assemblers and linkers usually produce code whose memory references are made
relative to an undetermined starting location that can be anywhere in memory (relocatable machine
code). A loader calculates appropriate absolute addresses for these memory locations and amends
the code to use these addresses.
The process of compilation is split up into six phases, each of which interacts with a symbol table
manager and an error handler. This is called the analysis/synthesis model of compilation. There are
many variants on this model, but the essential elements are the same.
source program
lexical analyser
syntax analyser
semantic analyser
code optimizer
code generator
target program
Lexical Analysis
A lexical analyser or scanner is a program that groups sequences of characters into lexemes, and
outputs (to the syntax analyser) a sequence of tokens. Here:
(a) Tokens are symbolic names for the entities that make up the text of the program;
e.g. if for the keyword if, and id for any identifier. These make up the output of the
lexical analyser.
(b) A pattern is a rule that specifies when a sequence of characters from the input
constitutes a token; e.g the sequence i, f for the token if , and any sequence of
alphanumerics starting with a letter for the token id.
(c) A lexeme is a sequence of characters from the input that match a pattern (and hence
constitute an instance of a token); for example if matches the pattern for if , and
foo123bar matches the pattern for id.
For example, the following code might result in the table given below.
A symbol table is a data structure containing all the identifiers (i.e. names of variables, proce- dures
etc.) of a source program together with all the attributes of each identifier.
For variables, typical attributes include:
• its type,
• how much memory it occupies,
• its scope.
The purpose of the symbol table is to provide quick and uniform access to identifier attributes
throughout the compilation process. Information is usually put into the symbol table during the
lexical analysis and/or syntax analysis phases.
Syntax Analysis
A syntax analyser or parser is a program that groups sequences of tokens from the lexical analysis
phase into phrases each with an associated phrase type.
A phrase is a logical unit with respect to the rules of the source language. For example, consider:
a := x * y + z
After lexical analysis, this statement has the structure
Now, a syntactic rule of Pascal is that there are objects called ‘expressions’ for which the rules are
(essentially):
• By rule (1) exp1 = id2 and exp2 = id3 are both phrases with phrase type ‘expression’;
• by rule (2) exp3 = exp1 binop1 exp2 is also a phrase with phrase type ‘expression’;
• by rule (2), exp5 = exp3 binop2 exp4 is a phrase with phrase type ‘expression’.
id assign exp
is a phrase with phrase type ‘assignment’, and so the Pascal statement above is a phrase of type
‘assignment’.
Parse Trees and Syntax Trees. The structure of a phrase is best thought of as a parse tree or a
syntax tree. A parse tree is tree that illustrates the grouping of tokens into phrases. A syntax
tree is a compacted form of parse tree in which the operators appear as the interior nodes. The
construction of a parse tree is a basic activity in compiler-writing.
A parse tree for the example Pascal statement is:
assignment
id2 id3
and a syntax tree is:
assign
id1 binop2
binop1 id4
id2 id3
Comment. The distinction between lexical and syntactical analysis sometimes seems arbitrary.
The main criterion is whether the analyser needs recursion or not:
• lexical analysers hardly ever use recursion; they are sometimes called linear analysers
since they scan the input in a ‘straight line’ (from from left to right).
• syntax analysers almost always use recursion; this is because phrase types are often defined in
terms of themselves (cf. the phrase type ‘expression’ above).
Semantic Analysis
A semantic analyser takes its input from the syntax analysis phase in the form of a parse tree and a
symbol table. Its purpose is to determine if the input has a well-defined meaning; in practice
semantic analysers are mainly concerned with type checking and type coercion based on type rules.
Typical type rules for expressions and assignments are:
Expression Type Rules. Let exp be an expression.
(a) If exp is a constant then exp is well-typed and its type is the type of the constant.
(b) If exp is a variable then exp is well-typed and its type is the type of the variable.
then exp is well-typed and its type is the result type of the operator.
Assignment Type Rules. Let var be a variable of type T1 and let exp be a well-typed
expression of type T2. If
(a) T1 = T2 and
where intvar is stored in the symbol table as being an integer variable, and realarray as an
array or reals. In Pascal this assignment is syntactically correct, but semantically incorrect since + is
only defined on numbers, whereas its second argument is an array. The semantic analyser checks for
such type errors using the parse tree, the symbol table and type rules.
Error Handling
Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors. On
detecting an error the compiler must:
(a) A lexical error is a mistake in a lexeme, for examples, typing tehn instead of then, or
missing off one of the quotes in a literal.
(b) A grammatical error is a one that violates the (grammatical) rules of the language,
for example if x = 7 y := 4 (missing then).
Semantic errors are mistakes concerning the meaning of a program construct; they may be either
type errors, logical errors or run-time errors:
(a) Type errors occur when an operator is applied to an argument of the wrong type, or to
the wrong number of arguments.
11
(b) Logical errors occur when a badly conceived program is executed, for example:
while x = y do ... when x and y initially have the same value and the body of
loop need not change the value of either x or y.
(c) Run-time errors are errors that can be detected only when the program is executed, for
example:
Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way). If
possible, the compiler should make the appropriate correction(s). Semantic errors are much harder
and sometimes impossible for a computer to detect.
After the analysis phases of the compiler have been completed, a source program has been
decomposed into a symbol table and a parse tree both of which may have been modified by the
semantic analyser. From this information we begin the process of generating object code according
to either of two approaches:
Approach (2) is more modular and efficient provided the abstract machine language is simple
enough to:
One of the most widely used intermediate languages is Three-Address Code (TAC).
TAC Programs. A TAC program is a sequence of optionally labelled instructions. Some
common TAC instructions include:
There are also TAC instructions for addresses and pointers, arrays and procedure calls, but will will
use only the above for the following discussion.
Syntax-Directed Code Generation. In essence, code is generated by recursively walking
through a parse (or syntax) tree, and hence the process is referred to as syntax-directed code
generation. For example, consider the code fragment:
z := x * y + x
z +
* x
x y
z := temp1 + temp2
Next we consider how to compute the values of temp1 and temp2 in the same top-down recursive
way.
For temp1 we see that it is the product of two quantities. Assume that we can produce TAC code
that computes the value of the first and second multiplicands and stores these values in temp3 and
temp4 respectively. Then the appropriate TAC for the computing temp1 is
Continuing the recursive walk, we consider temp3. Here we see it is just the variable x and thus the
TAC code
12
temp3 := x
is sufficient. Next we come to temp4 and similar to temp3 the appropriate code is
temp4 := y
temp2 := x
suffices.
Each code fragment is output when we leave the corresponding node; this results in the final
program:
temp3 :=
x temp4 :=
y
temp1 := temp3 *
temp4 temp2 := x
z := temp1 + temp2
Comment. Notice how a compound expression has been broken down and translated into a
sequence of very simple instructions, and furthermore, the process of producing the TAC code
was uniform and simple. Some redundancy has been brought into the TAC code but this can be
removed (along with redundancy that is not due to the TAC-generation) in the optimisation
phase.
Code Optimisation
An optimiser attempts to improve the time and space requirements of a program. There are many
ways in which code can be optimised, but most are expensive in terms of time and space to
implement.
Common optimisations include:
Note that here we are concerned with the general optimisation of abstract code.
Example. Consider the TAC code:
temp1 := x
temp2 := temp1
if temp1 = temp2 goto
200 temp3 := temp1 * y
goto 300
200 temp3 := z
300 temp4 := temp2 + temp3
temp1 := x
if temp1 = temp1 goto
200 temp3 := temp1 * y
goto 300
200 temp3 := z
300 temp4 := temp1 + temp3
temp1 := x
200 temp3 := z
300 temp4 := temp1 + temp3
Notes. Attempting to find a ‘best’ optimisation is expensive for the following reasons:
• A given optimisation technique may have to be applied repeatedly until no further optimi-
sation can be obtained. (For example, removing one redundant identifier may introduce
another.)
• A given optimisation technique may give rise to other forms of redundancy and thus
sequences of optimisation techniques may have to be repeated. (For example, above we
removed a redundant identifier and this gave rise to redundant code, but removing redundant
code may lead to further redundant identifiers.)
• The order in which optimisations are applied may be significant. (How many ways are there of
applying n optimisation techniques to a given piece of code?)
Code Generation
The final phase of the compiler is to generate code for a specific machine. In
this phase we consider:
• memory management,
• register assignment and
• machine-specific optimisation.
The output from this phase is usually assembly language or relocatable machine code.
Example. The TAC code above could typically result in the ARM assembly
program shown below. Note that the example illustrates a mechanical
translation of TAC into ARM; it is not intended to illustrate compact ARM
programming!