0% found this document useful (0 votes)
191 views14 pages

Compiler Construction

The document discusses the process of compilation, which translates a program from a source language into an equivalent program in a target language. It describes the main phases of compilation as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. The complete compilation process involves preprocessing the source code, compiling it, assembling the output, linking object files, and loading the final machine code.

Uploaded by

syed hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
191 views14 pages

Compiler Construction

The document discusses the process of compilation, which translates a program from a source language into an equivalent program in a target language. It describes the main phases of compilation as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation. The complete compilation process involves preprocessing the source code, compiling it, assembling the output, linking object files, and loading the final machine code.

Uploaded by

syed hamza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

1 Introduction

Compilation

Definition. Compilation is a process that translates a program in one language (the source language)
into an equivalent program in another language (the object or target language).

Error
messages

Source Target
program Compiler program

An important part of any compiler is the detection and reporting of errors; this will be dis- cussed in
more detail later in the introduction. Commonly, the source language is a high-level programming
language (i.e. a problem-oriented language), and the target language is a ma- chine language or
assembly language (i.e. a machine-oriented language). Thus compilation is a fundamental concept
in the production of software: it is the link between the (abstract) world of application development
and the low-level world of application execution on machines.
Types of Translators. An assembler is also a type of translator:

Assembly Machine
Assembler
program program

An interpreter is closely related to a compiler, but takes both source program and input data. The
translation and execution phases of the source program are one and the same.

Source
program Output
Interpreter
Input data data

Although the above types of translator are the most well-known, we also need knowledge of
compilation techniques to deal with the recognition and translation of many other types of
languages including:

2
• Command-line interface languages;

• Typesetting / word processing languages (e.g.TEX);


• Natural languages;
• Hardware description languages;
• Page description languages (e.g. PostScript);
• Set-up or parameter files.

Early Development of Compilers.


1940’s. Early stored-program computers were programmed in machine language. Later, as- sembly
languages were developed where machine instructions and memory locations were given symbolic
forms.
1950’s. Early high-level languages were developed, for example FORTRAN. Although more
problem-oriented than assembly languages, the first versions of FORTRAN still had many machine-
dependent features. Techniques and processes involved in compilation were not well- understood at
this time, and compiler-writing was a huge task: e.g. the first FORTRAN compiler took 18 man
years of effort to write.
Chomsky’s study of the structure of natural languages led to a classification of languages according
to the complexity of their grammars. The context-free languages proved to be useful in describing
the syntax of programming languages.
1960’s onwards. The study of the parsing problem for context-free languages during the 1960’s and
1970’s has led to efficient algorithms for the recognition of context-free languages. These al-
gorithms, and associated software tools, are central to compiler construction today. Similarly, the
theory of finite state machines and regular expressions (which correspond to Chomsky’s regular
languages) have proven useful for describing the lexical structure of programming lan- guages.
From Algol 60, high-level languages have become more problem-oriented and machine indepen-
dent, with features much removed from the machine languages into which they are compiled. The
theory and tools available today make compiler construction a managable task, even for complex
languages. For example, your compiler assignment will take only a few weeks (hope- fully) and will
only be about 1000 lines of code (although, admittedly, the source language is small).

The Context of a Compiler

The complete process of compilation is illustrated as:


skeletal source program

preprocessor

source program

compiler

assembly program

assembler

relocatable m/c code

link/load editor

absolute m/c code

Preprocessors

Preprocessing performs (usually simple) operations on the source file(s) prior to compilation.
Typical preprocessing operations include:

(a) Expanding macros (shorthand notations for longer constructs). For example, in C,
#define foo(x,y) (3*x+y*(2+x))
defines a macro foo, that when used in later in the program, is expanded by the preprocessor.
For example, a = foo(a,b) becomes
a = (3*a+b*(2+a))
(b) Inserting named files. For example, in C,
#include "header.h"
is replaced by the contents of the file header.h

Linkers

A linker combines object code (machine code that has not yet been linked) produced from
compiling and assembling many source programs, as well as standard library functions and
resources supplied by the operating system. This involves resolving references in each object file to
external variables and procedures declared in other files.
Loaders

Compilers, assemblers and linkers usually produce code whose memory references are made
relative to an undetermined starting location that can be anywhere in memory (relocatable machine
code). A loader calculates appropriate absolute addresses for these memory locations and amends
the code to use these addresses.

The Phases of a Compiler

The process of compilation is split up into six phases, each of which interacts with a symbol table
manager and an error handler. This is called the analysis/synthesis model of compilation. There are
many variants on this model, but the essential elements are the same.

source program

lexical analyser

syntax analyser

semantic analyser

symbol-table manager error handler

intermediate code gen’tor

code optimizer

code generator

target program

Lexical Analysis

A lexical analyser or scanner is a program that groups sequences of characters into lexemes, and
outputs (to the syntax analyser) a sequence of tokens. Here:

(a) Tokens are symbolic names for the entities that make up the text of the program;
e.g. if for the keyword if, and id for any identifier. These make up the output of the
lexical analyser.
(b) A pattern is a rule that specifies when a sequence of characters from the input
constitutes a token; e.g the sequence i, f for the token if , and any sequence of
alphanumerics starting with a letter for the token id.
(c) A lexeme is a sequence of characters from the input that match a pattern (and hence
constitute an instance of a token); for example if matches the pattern for if , and
foo123bar matches the pattern for id.

For example, the following code might result in the table given below.

program foo(input,output);var x:integer;begin readln(x);writeln(’value read


=’,x) end.

Lexeme Token Pattern


program program p, r, o, g, r, a, m
newlines, spaces, tabs
foo id (foo) letter followed by seq. of alphanumerics
( leftpar a left parenthesis
input input i, n, p, u, t
, comma a comma
output output o, u, t, p, u, t
) rightpar a right parenthesis
; semicolon a semi-colon
var var v, a, r
x id (x) letter followed by seq. of alphanumerics
: colon a colon
integer integer i, n, t, e, g, e, r
; semicolon a semi-colon
begin begin b, e, g, i, n
newlines, spaces, tabs
readln readln r, e, a, d, l, n
( leftpar a left parenthesis
x id (x) letter followed by seq. of alphanumerics
) rightpar a right parenthesis
; semicolon a semi-colon
writeln writeln w, r, i, t, e, l, n
( leftpar a left parenthesis
’value read =’ literal (’value read =’) seq. of chars enclosed in quotes
, comma a comma
x id (x) letter followed by seq. of alphanumerics
) rightpar a right parenthesis
newlines, spaces, tabs
end end e, n, d
. fullstop a fullstop
It is the sequence of tokens in the middle column that are passed as output to the syntax analyser.
This token sequence represents almost all the important information from the input program
required by the syntax analyser. Whitespace (newlines, spaces and tabs), although often im- portant
in separating lexemes, is usually not returned as a token. Also, when outputting an id or literal
token, the lexical analyser must also return the value of the matched lexeme (shown in parentheses)
or else this information would be lost.

Symbol Table Management

A symbol table is a data structure containing all the identifiers (i.e. names of variables, proce- dures
etc.) of a source program together with all the attributes of each identifier.
For variables, typical attributes include:

• its type,
• how much memory it occupies,
• its scope.

For procedures and functions, typical attributes include:

• the number and type of each argument (if any),


• the method of passing each argument, and
• the type of value returned (if any).

The purpose of the symbol table is to provide quick and uniform access to identifier attributes
throughout the compilation process. Information is usually put into the symbol table during the
lexical analysis and/or syntax analysis phases.

Syntax Analysis

A syntax analyser or parser is a program that groups sequences of tokens from the lexical analysis
phase into phrases each with an associated phrase type.
A phrase is a logical unit with respect to the rules of the source language. For example, consider:

a := x * y + z
After lexical analysis, this statement has the structure

id1 assign id2 binop1 id3 binop2 id4

Now, a syntactic rule of Pascal is that there are objects called ‘expressions’ for which the rules are
(essentially):

(1) Any constant or identifier is an expression.


(2) If exp1 and exp2 are expressions then so is exp1 binop exp2.

Taking all the identifiers to be variable names for simplicity, we have:

• By rule (1) exp1 = id2 and exp2 = id3 are both phrases with phrase type ‘expression’;

• by rule (2) exp3 = exp1 binop1 exp2 is also a phrase with phrase type ‘expression’;

• by rule (1) exp4 = id4 is a phase with type ‘expression’;

• by rule (2), exp5 = exp3 binop2 exp4 is a phrase with phrase type ‘expression’.

Of course, Pascal also has a rule that says:

id assign exp
is a phrase with phrase type ‘assignment’, and so the Pascal statement above is a phrase of type
‘assignment’.
Parse Trees and Syntax Trees. The structure of a phrase is best thought of as a parse tree or a
syntax tree. A parse tree is tree that illustrates the grouping of tokens into phrases. A syntax
tree is a compacted form of parse tree in which the operators appear as the interior nodes. The
construction of a parse tree is a basic activity in compiler-writing.
A parse tree for the example Pascal statement is:

assignment

id1 assig exp5


n

exp3 binop2 exp4

exp1 binop1 exp2 id4

id2 id3
and a syntax tree is:

assign

id1 binop2

binop1 id4

id2 id3

Comment. The distinction between lexical and syntactical analysis sometimes seems arbitrary.
The main criterion is whether the analyser needs recursion or not:

• lexical analysers hardly ever use recursion; they are sometimes called linear analysers
since they scan the input in a ‘straight line’ (from from left to right).
• syntax analysers almost always use recursion; this is because phrase types are often defined in
terms of themselves (cf. the phrase type ‘expression’ above).

Semantic Analysis

A semantic analyser takes its input from the syntax analysis phase in the form of a parse tree and a
symbol table. Its purpose is to determine if the input has a well-defined meaning; in practice
semantic analysers are mainly concerned with type checking and type coercion based on type rules.
Typical type rules for expressions and assignments are:
Expression Type Rules. Let exp be an expression.

(a) If exp is a constant then exp is well-typed and its type is the type of the constant.

(b) If exp is a variable then exp is well-typed and its type is the type of the variable.

(c) If exp is an operator applied to further subexpressions such that:


(i) the operator is applied to the correct number of subexpressions,
(ii) each subexpression is well-typed and
(iii) each subexpression is of an appropriate type,

then exp is well-typed and its type is the result type of the operator.

Assignment Type Rules. Let var be a variable of type T1 and let exp be a well-typed
expression of type T2. If
(a) T1 = T2 and

(b) T1 is an assignable type

then var assign exp is a well-typed assignment. For


example, consider the following code fragment:

intvar := intvar + realarray

where intvar is stored in the symbol table as being an integer variable, and realarray as an
array or reals. In Pascal this assignment is syntactically correct, but semantically incorrect since + is
only defined on numbers, whereas its second argument is an array. The semantic analyser checks for
such type errors using the parse tree, the symbol table and type rules.

Error Handling

Each of the six phases (but mainly the analysis phases) of a compiler can encounter errors. On
detecting an error the compiler must:

• report the error in a helpful way,


• correct the error if possible, and
• continue processing (if possible) after the error to look for further errors.

Types of Error. Errors are either syntactic or semantic:


Syntax errors are errors in the program text; they may be either lexical or grammatical:

(a) A lexical error is a mistake in a lexeme, for examples, typing tehn instead of then, or
missing off one of the quotes in a literal.
(b) A grammatical error is a one that violates the (grammatical) rules of the language,
for example if x = 7 y := 4 (missing then).

Semantic errors are mistakes concerning the meaning of a program construct; they may be either
type errors, logical errors or run-time errors:

(a) Type errors occur when an operator is applied to an argument of the wrong type, or to
the wrong number of arguments.

11
(b) Logical errors occur when a badly conceived program is executed, for example:
while x = y do ... when x and y initially have the same value and the body of
loop need not change the value of either x or y.
(c) Run-time errors are errors that can be detected only when the program is executed, for
example:

var x : real; readln(x); writeln(1/x)

which would produce a run time error if the user input 0.

Syntax errors must be detected by a compiler and at least reported to the user (in a helpful way). If
possible, the compiler should make the appropriate correction(s). Semantic errors are much harder
and sometimes impossible for a computer to detect.

Intermediate Code Generation

After the analysis phases of the compiler have been completed, a source program has been
decomposed into a symbol table and a parse tree both of which may have been modified by the
semantic analyser. From this information we begin the process of generating object code according
to either of two approaches:

(1) generate code for a specific machine, or


(2) generate code for a ‘general’ or abstract machine, then use further translators to turn the
abstract code into code for specific machines.

Approach (2) is more modular and efficient provided the abstract machine language is simple
enough to:

(a) produce and analyse (in the optimisation phase), and


(b) easily translated into the required language(s).

One of the most widely used intermediate languages is Three-Address Code (TAC).
TAC Programs. A TAC program is a sequence of optionally labelled instructions. Some
common TAC instructions include:

(i) var1 := var2 binop var3

(ii) var1 := unop var2

(iii) var1 := num


(iv) goto label
(v) if var1 relop var2 goto label

There are also TAC instructions for addresses and pointers, arrays and procedure calls, but will will
use only the above for the following discussion.
Syntax-Directed Code Generation. In essence, code is generated by recursively walking
through a parse (or syntax) tree, and hence the process is referred to as syntax-directed code
generation. For example, consider the code fragment:

z := x * y + x

and its syntax tree (with lexemes replacing tokens):


:=

z +

* x

x y

We use this tree to direct the compilation into TAC as follows.


At the root of the tree we see an assignment whose right-hand side is an expression, and this
expression is the sum of two quantities. Assume that we can produce TAC code that computes the
value of the first and second summands and stores these values in temp1 and temp2 respectively.
Then the appropriate TAC for the assignment statement is just

z := temp1 + temp2

Next we consider how to compute the values of temp1 and temp2 in the same top-down recursive
way.
For temp1 we see that it is the product of two quantities. Assume that we can produce TAC code
that computes the value of the first and second multiplicands and stores these values in temp3 and
temp4 respectively. Then the appropriate TAC for the computing temp1 is

temp1 := temp3 * temp4

Continuing the recursive walk, we consider temp3. Here we see it is just the variable x and thus the
TAC code

12
temp3 := x

is sufficient. Next we come to temp4 and similar to temp3 the appropriate code is

temp4 := y

Finally, considering temp2, of course

temp2 := x

suffices.
Each code fragment is output when we leave the corresponding node; this results in the final
program:

temp3 :=
x temp4 :=
y
temp1 := temp3 *
temp4 temp2 := x
z := temp1 + temp2

Comment. Notice how a compound expression has been broken down and translated into a
sequence of very simple instructions, and furthermore, the process of producing the TAC code
was uniform and simple. Some redundancy has been brought into the TAC code but this can be
removed (along with redundancy that is not due to the TAC-generation) in the optimisation
phase.

Code Optimisation

An optimiser attempts to improve the time and space requirements of a program. There are many
ways in which code can be optimised, but most are expensive in terms of time and space to
implement.
Common optimisations include:

• removing redundant identifiers,


• removing unreachable sections of code,
• identifying common subexpressions,
• unfolding loops and
• eliminating procedures.

Note that here we are concerned with the general optimisation of abstract code.
Example. Consider the TAC code:

temp1 := x
temp2 := temp1
if temp1 = temp2 goto
200 temp3 := temp1 * y
goto 300
200 temp3 := z
300 temp4 := temp2 + temp3

Removing redundant identifiers (just temp2) gives

temp1 := x
if temp1 = temp1 goto
200 temp3 := temp1 * y
goto 300
200 temp3 := z
300 temp4 := temp1 + temp3

Removing redundant code gives

temp1 := x
200 temp3 := z
300 temp4 := temp1 + temp3

Notes. Attempting to find a ‘best’ optimisation is expensive for the following reasons:

• A given optimisation technique may have to be applied repeatedly until no further optimi-
sation can be obtained. (For example, removing one redundant identifier may introduce
another.)
• A given optimisation technique may give rise to other forms of redundancy and thus
sequences of optimisation techniques may have to be repeated. (For example, above we
removed a redundant identifier and this gave rise to redundant code, but removing redundant
code may lead to further redundant identifiers.)
• The order in which optimisations are applied may be significant. (How many ways are there of
applying n optimisation techniques to a given piece of code?)
Code Generation

The final phase of the compiler is to generate code for a specific machine. In
this phase we consider:

• memory management,
• register assignment and
• machine-specific optimisation.

The output from this phase is usually assembly language or relocatable machine code.
Example. The TAC code above could typically result in the ARM assembly
program shown below. Note that the example illustrates a mechanical
translation of TAC into ARM; it is not intended to illustrate compact ARM
programming!

.x EQUD 0 four bytes for x


.z EQUD 0 four bytes for z
.temp EQUD 0 four bytes each for temp1,
EQUD 0 temp3, and
EQUD 0 temp4.

.prog MOV R12,#temp R12 = base address


MOV R0,#x R0 = address of x
LDR R1,[R0] R1 = value of x
STR R1,[R12] store R1 at R12
MOV R0,#z R0 = address of z
LDR R1,[R0] R1 = value of z
STR R1,[R12,#4] store R1 at R12+4
LDR R1,[R12] R1 = value of temp1
LDR R2,[R12,#4] R2 = value of temp3
ADD R3,R1,R2 add temp1 to temp3
STR R3,[R12,#8] store R3 at R12+8

You might also like