0% found this document useful (0 votes)
190 views12 pages

Automata Theory and Compiler Design (AT&CD) Vtu Sce 5th Sem 21cs51

The document discusses the structure and phases of a compiler. It describes compilers translating source programs into target programs by using analysis and synthesis phases. The analysis phase breaks down and analyzes the source program while the synthesis phase constructs the target program. Typical compiler phases include lexical analysis, syntax analysis, semantic analysis, code generation and optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
190 views12 pages

Automata Theory and Compiler Design (AT&CD) Vtu Sce 5th Sem 21cs51

The document discusses the structure and phases of a compiler. It describes compilers translating source programs into target programs by using analysis and synthesis phases. The analysis phase breaks down and analyzes the source program while the synthesis phase constructs the target program. Typical compiler phases include lexical analysis, syntax analysis, semantic analysis, code generation and optimization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Chapter 1

Introduction

Programming languages are notations for describing computations to people


and to machines. The world as we know it depends on programming languages,
because all the software running on all the computers was written in some
programming language. But, before a program can be run, it first must be
translated into a form in which it can be executed by a computer.
T h e software systems t h a t do this translation are called compilers.
This book is about how to design and implement compilers. We shall dis-
cover t h a t a few basic ideas can be used to construct translators for a wide
variety of languages and machines. Besides compilers, the principles and tech-
niques for compiler design are applicable to so many other domains t h a t they
are likely to be reused many times in t h e career of a computer scientist. T h e
study of compiler writing touches upon programming languages, machine ar-
chitecture, language theory, algorithms, and software engineering.
In this preliminary chapter, we introduce the different forms of language
translators, give a high level overview of the structure of a typical compiler,
and discuss the trends in programming languages and machine architecture
t h a t are shaping compilers. We include some observations on the relationship
between compiler design and computer-science theory and an outline of the
applications of compiler technology t h a t go beyond compilation. We end with
a brief outline of key programming-language concepts t h a t will be needed for
our study of compilers.

1.1 Language Processors


Simply stated, a compiler is a program t h a t can read a program in one lan-
guage — t h e source language — and translate it into an equivalent program in
another language — the target language; see Fig. 1.1. An important role of the
compiler is to report any errors in the source program t h a t it detects during
the translation process.

1
2 CHAPTER 1. INTRODUCTION

source program

Compiler

target program

Figure 1.1: A compiler

If the target program is an executable machine-language program, it can


then be called by the user to process inputs and produce outputs; see Fig. 1.2.

input Target Program output

Figure 1.2: Running the target program

An interpreter is another common kind of language processor. Instead of


producing a target program as a translation, an interpreter appears to directly
execute the operations specified in t h e source program on inputs supplied by
t h e user, as shown in Fig. 1.3.

source program
Interpreter output
input

Figure 1.3: An interpreter

The machine-language target program produced by a compiler is usually


much faster t h a n an interpreter at mapping inputs to outputs . An interpreter,
however, can usually give better error diagnostics t h a n a compiler, because it
executes the source program statement by statement.

E x a m p l e 1 . 1 : Java language processors combine compilation and interpreta-


tion, as shown in Fig. 1.4. A Java source program may first be compiled into
an intermediate form called bytecodes. The bytecodes are then interpreted by a
virtual machine. A benefit of this arrangement is t h a t bytecodes compiled on
one machine can be interpreted on another machine, perhaps across a network.
In order to achieve faster processing of inputs to outputs, some Java compil-
ers, called just-in-time compilers, translate the bytecodes into machine language
immediately before they run the intermediate program to process the input. •
1.1. LANGUAGE PROCESSORS 3

source program

Translator

intermediate program —»H Virtual


output
input —*A Machine

Figure 1.4: A hybrid compiler

In addition to a compiler, several other programs may be required to create


an executable target program, as shown in Fig. 1.5. A source program may be
divided into modules stored in separate files. The task of collecting the source
program is sometimes entrusted to a separate program, called a preprocessor.
The preprocessor may also expand shorthands, called macros, into source lan-
guage statements.
T h e modified source program is then fed to a compiler. The compiler may
produce an assembly-language program as its output, because assembly lan-
guage is easier to produce as output and is easier to debug. The assembly
language is then processed by a program called an assembler t h a t produces
relocatable machine code as its output.
Large programs are often compiled in pieces, so the relocatable machine
code may have to be linked together with other relocatable object files and
library files into the code t h a t actually runs on the machine. The linker resolves
external memory addresses, where t h e code in one file may refer to a location
in another file. T h e loader then puts together all of the executable object files
into memory for execution.

1.1.1 Exercises for Section 1.1


E x e r c i s e 1 . 1 . 1 : W h a t is the difference between a compiler and an interpreter?

E x e r c i s e 1 . 1 . 2 : W h a t are the advantages of (a) a compiler over an interpreter


(b) an interpreter over a compiler?

E x e r c i s e 1.1.3 : W h a t advantages are there to a language-processing system in


which the compiler produces assembly language rather t h a n machine language?

E x e r c i s e 1 . 1 . 4 : A compiler t h a t translates a high-level language into another


high-level language is called a source-to-source translator. W h a t advantages are
there to using C as a target language for a compiler?

E x e r c i s e 1 . 1 . 5 : Describe some of the tasks t h a t an assembler needs to per-


form.
4 CHAPTER 1. INTRODUCTION

source program
L_
Preprocessor

modified source program

Compiler

target assembly program

L_
Assembler

relocatable machine code


i
library files
Linker/Loader
relocatable object files
T
target machine code

Figure 1.5: A language-processing system

1.2 The Structure of a Compiler


Up to this point we have treated a compiler as a single box t h a t maps a source
program into a semantically equivalent target program. If we open up this box
a little, we see t h a t there are two parts to this mapping: analysis and synthesis.
T h e analysis part breaks up the source program into constituent pieces and
imposes a grammatical structure on them. It then uses this structure to cre-
ate an intermediate representation of t h e source program. If the analysis part
detects t h a t t h e source program is either syntactically ill formed or semanti-
cally unsound, then it must provide informative messages, so the user can take
corrective action. T h e analysis part also collects information about t h e source
program and stores it in a d a t a structure called a symbol table, which is passed
along with the intermediate representation to t h e synthesis part.
The synthesis part constructs the desired target program from the interme-
diate representation and the information in t h e symbol table. T h e analysis part
is often called the front end of the compiler; the synthesis p a r t is t h e back end.
If we examine the compilation process in more detail, we see t h a t it operates
as a sequence of phases, each of which transforms one representation of t h e
source program to another. A typical decomposition of a compiler into phases
is shown in Fig. 1.6. In practice, several phases may be grouped together,
and the intermediate representations between t h e grouped phases need not be
constructed explicitly. T h e symbol table, which stores information about t h e
1.2. THE STRUCTURE OF A COMPILER 5

character stream

i
Lexical Analyzer

token stream

Syntax Analyzer

syntax tree

Semantic Analyzer

syntax tree

Symbol Table Intermediate Code Generator

intermediate representation
i
Machine-Independent
Code Optimizer
intermediate representation
i
Code Generator

target-machine code
i
Machine-Dependent
Code Optimizer
target-machine code

Figure 1.6: Phases of a compiler

entire source program, is used by all phases of the compiler.


Some compilers have a machine-independent optimization phase between
the front end and the back end. The purpose of this optimization phase is to
perform transformations on t h e intermediate representation, so t h a t the back
end can produce a better target program t h a n it would have otherwise pro-
duced from an unoptimized intermediate representation. Since optimization is
optional, one or the other of the two optimization phases shown in Fig. 1.6 may
be missing.

1.2.1 Lexical Analysis


T h e first phase of a compiler is called lexical analysis or scanning. The lex-
ical analyzer reads the stream of characters making up the source program
6 CHAPTER 1. INTRODUCTION

and groups the characters into meaningful sequences called lexemes. For each
lexeme, the lexical analyzer produces as output a token of t h e form

(token-name, attribute-value)

t h a t it passes on to the subsequent phase, syntax analysis. In the token, the


first component token-name is an abstract symbol t h a t is used during syntax
analysis, and t h e second component attribute-value points to an entry in t h e
symbol table for this token. Information from the symbol-table entry Is needed
for semantic analysis and code generation.
For example, suppose a source program contains t h e assignment statement

p o s i t i o n = i n i t i a l + r a t e * 60 (1.1)

The characters in this assignment could be grouped into the following lexemes
and mapped into the following tokens passed on to the syntax analyzer:

1. p o s i t i o n is a lexeme t h a t would be mapped into a token (id, 1), where id


is an abstract symbol standing for identifier and 1 points to t h e symbol-
table entry for p o s i t i o n . T h e symbol-table entry for an identifier holds
information about the identifier, such as its name and type.

2. The assignment symbol = is a lexeme t h a t is m a p p e d into t h e token ( = ) .


Since this token needs no attribute-value, we have omitted t h e second
component. We could have used any abstract symbol such as a s s i g n for
t h e token-name, but for notational convenience we have chosen to use t h e
lexeme itself as the name of the abstract symbol.

3. i n i t i a l is a lexeme t h a t is m a p p e d into the token (id, 2), where 2 points


to the symbol-table entry for i n i t i a l .

4. + is a lexeme t h a t is m a p p e d into t h e token ( + ) .

5. r a t e is a lexeme t h a t is mapped into the token (id, 3), where 3 points to


the symbol-table entry for r a t e .

6. * is a lexeme t h a t is mapped into the token (*).


1
7. 60 is a lexeme t h a t is mapped into the token (60).

Blanks separating the lexemes would be discarded by the lexical analyzer.


Figure 1.7 shows the representation of the assignment statement (1.1) after
lexical analysis as t h e sequence of tokens
( i d , l ) <=) (id, 2) (+) (id, 3) (*) (60) (1.2)

In this representation, the token names =, +, and * are abstract symbols for
the assignment, addition, and multiplication operators, respectively.
1
Technically speaking, for the lexeme 60 we should make up a token like (number, 4 ) ,
where 4 points to the symbol table for the internal representation of integer 60 but we shall
defer the discussion of tokens for numbers until Chapter 2. Chapter 3 discusses techniques
for building lexical analyzers.
THE STRUCTURE OF A COMPILER

position = initial + rate * 60

i
Lexical Analyzer

(id,l> {=) <id,2> <+) <id,3> (.> (60)


*
Syntax Analyzer

<id,l) +
position (id,2)-
initial (id,3) 60
rate
Semantic Analyzer

SYMBOL TABLE (id,l)


(id, 2)
(id, 3} inttofloat
I
60
Intermediate Code Generator

tl = inttofloat(60)
t2 = id3 * tl
t3 = id2 + t2
idl = t3

i
Code Optimizer

tl = id3 * 60.0
idl = id2 + tl

i
Code Generator

LDF R2, id3


MULF R 2 , R2, #60.0
LDF Rl, id2
ADDF Rl, Rl, R2
STF idl, Rl

Figure 1.7: Translation of an assignment statement


8 CHAPTER 1. INTRODUCTION

1.2.2 Syntax Analysis


The second phase of the compiler is syntax analysis or parsing. T h e parser uses
the first components of t h e tokens produced by t h e lexical analyzer to create
a tree-like intermediate representation t h a t depicts t h e grammatical structure
of the token stream. A typical representation is a syntax tree in which each
interior node represents an operation and the children of the node represent t h e
arguments of the operation. A syntax tree for the token stream (1.2) is shown
as t h e output of the syntactic analyzer in Fig. 1.7.
This tree shows the order in which the operations in the assignment

p o s i t i o n = i n i t i a l + r a t e * 60

are to be performed. T h e tree has an interior node labeled * with (id, 3) as


its left child and the integer 60 as its right child. The node (id, 3) represents
the identifier r a t e . The node labeled * makes it explicit t h a t we must first
multiply the value of r a t e by 60. The node labeled + indicates t h a t we must
add the result of this multiplication to the value of i n i t i a l . The root of the
tree, labeled =, indicates t h a t we must store t h e result of this addition into t h e
location for the identifier p o s i t i o n . This ordering of operations is consistent
with the usual conventions of arithmetic which tell us t h a t multiplication has
higher precedence t h a n addition, and hence t h a t the multiplication is to be
performed before the addition.
T h e subsequent phases of t h e compiler use t h e grammatical structure to help
analyze the source program and generate the target program. In Chapter 4
we shall use context-free grammars to specify t h e grammatical structure of
programming languages and discuss algorithms for constructing efficient syntax
analyzers automatically from certain classes of grammars. In Chapters 2 and 5
we shall see t h a t syntax-directed definitions can help specify t h e translation of
programming language constructs.

1.2.3 Semantic Analysis


T h e semantic analyzer uses the syntax tree and the information in the symbol
table to check the source program for semantic consistency with the language
definition. It also gathers type information and saves it in either the syntax tree
or the symbol table, for subsequent use during intermediate-code generation.
An important part of semantic analysis is type checking, where the compiler
checks t h a t each operator has matching operands. For example, many program-
ming language definitions require an array index to be an integer; t h e compiler
must report an error if a floating-point number is used to index an array.
The language specification may permit some type conversions called coer-
cions. For example, a binary arithmetic operator may be applied to either a
pair of integers or to a pair of floating-point numbers. If t h e operator is applied
to a floating-point number and an integer, the compiler may convert or coerce
t h e integer into a floating-point number.
1.2. THE STRUCTURE OF A COMPILER 9

Such a coercion appears in Fig. 1.7. Suppose t h a t p o s i t i o n , i n i t i a l , and


r a t e have been declared to be floating-point numbers, and t h a t the lexeme 60
by itself forms an integer. The type checker in the semantic analyzer in Fig. 1.7
discovers t h a t the operator * is applied to a floating-point number r a t e and
an integer 60. In this case, t h e integer may be converted into a floating-point
number. In Fig. 1.7, notice t h a t t h e o u t p u t of t h e semantic analyzer has an
extra node for the operator inttofloat, which explicitly converts its integer
argument into a floating-point number. Type checking and semantic analysis
are discussed in Chapter 6.

1.2.4 Intermediate Code Generation


In the process of translating a source program into target code, a compiler may
construct one or more intermediate representations, which can have a variety
of forms. Syntax trees are a form of intermediate representation; they are
commonly used during syntax and semantic analysis.
After syntax and semantic analysis of the source program, many compil-
ers generate an explicit low-level or machine-like intermediate representation,
which we can think of as a program for an abstract machine. This intermedi-
ate representation should have two important properties: it should be easy to
produce and it should be easy to translate into the target machine.
In Chapter 6, we consider an intermediate form called three-address code,
which consists of a sequence of assembly-like instructions with three operands
per instruction. Each operand can act like a register. T h e output of the inter-
mediate code generator in Fig. 1.7 consists of t h e three-address code sequence

tl • inttofloat(60)
t 2 •• id3 * t l
(1.3)
ts •• id2 + t 2
idl t3

There are several points worth noting about three-address instructions.


First, each three-address assignment instruction has at most one operator on the
right side. Thus, these instructions fix the order in which operations are to be
done; t h e multiplication precedes t h e addition in t h e source program (1.1). Sec-
ond, t h e compiler must generate a temporary name to hold the value computed
by a three-address instruction. Third, some "three-address instructions" like
the first and last in the sequence (1.3), above, have fewer t h a n three operands.
In Chapter 6, we cover the principal intermediate representations used in
compilers. Chapters 5 introduces techniques for syntax-directed translation
t h a t are applied in Chapter 6 to type checking and intermediate-code generation
for typical programming language constructs such as expressions, flow-of-control
constructs, and procedure calls.
10 CHAPTER 1. INTRODUCTION

1.2.5 Code Optimization


The machine-independent code-optimization phase a t t e m p t s to improve t h e
intermediate code so t h a t better target code will result. Usually better means
faster, but other objectives may be desired, such as shorter code, or target code
t h a t consumes less power. For example, a straightforward algorithm generates
the intermediate code (1.3), using an instruction for each operator in the tree
representation t h a t comes from t h e semantic analyzer.
A simple intermediate code generation algorithm followed by code optimiza-
tion is a reasonable way to generate good target code. T h e optimizer can deduce
t h a t the conversion of 60 from integer to floating point can be done once and for
all at compile time, so the inttofloat operation can be eliminated by replacing
the integer 60 by t h e floating-point number 60.0. Moreover, t3 is used only
once to transmit its value to i d l so t h e optimizer can transform (1.3) into t h e
shorter sequence

tl = id3 * 60.0
(1.4)
idl = id2 + tl

There is a great variation in the amount of code optimization different com-


pilers perform. In those t h a t do the most, the so-called "optimizing compilers,"
a significant amount of time is spent on this phase. There are simple opti-
mizations t h a t significantly improve the running time of the target program
without slowing down compilation too much. T h e chapters from 8 on discuss
machine-independent and machine-dependent optimizations in detail.

1.2.6 Code Generation


The code generator takes as input an intermediate representation of t h e source
program and maps it into the target language. If the target language is machine
code, registers Or memory locations are selected for each of t h e variables used by
the program. Then, the intermediate instructions are translated into sequences
of machine instructions t h a t perform t h e same task. A crucial aspect of code
generation is the judicious assignment of registers to hold variables.
For example, using registers Rl and R2, the intermediate code in (1.4) might
get translated into t h e machine code

LDF R2, id3


MULF R2, R2, #60.0
LDF Rl, id2 (1.5)
ADDF Rl, Rl, R2
STF idl, Rl

The first operand of each instruction specifies a destination. T h e F in each


instruction tells us t h a t it deals with floating-point numbers. T h e code in
1.2. THE STRUCTURE OF A COMPILER 11

(1.5) loads t h e contents of address i d 3 into register R2, then multiplies it with
floating-point constant 60.0. T h e # signifies t h a t 60.0 is to be treated as an
immediate constant. The third instruction moves i d 2 into register Rl and t h e
fourth adds to it the value previously computed in register R2. Finally, the value
in register Rl is stored into the address of i d l , so the code correctly implements
t h e assignment statement (1.1). Chapter 8 covers code generation.
This discussion of code generation has ignored t h e important issue of stor-
age allocation for the identifiers in the source program. As we shall see in
Chapter 7, the organization of storage at run-time depends on the language be-
ing compiled. Storage-allocation decisions are made either during intermediate
code generation or during code generation.

1.2.7 Symbol-Table Management


An essential function of a compiler is to record t h e variable names used in t h e
source program and collect information about various attributes of each name.
These attributes may provide information about the storage allocated for a
name, its type, its scope (where in the program its value may be used), and
in t h e case of procedure names, such things as the number and types of its
arguments, t h e method of passing each argument (for example, by value or by
reference), and the type returned.
T h e symbol table is a d a t a structure containing a record for each variable
name, with fields for the attributes of the name. The d a t a structure should be
designed to allow the compiler to find t h e record for each name quickly and to
store or retrieve d a t a from t h a t record quickly. Symbol tables are discussed in
Chapter 2.

1.2.8 The Grouping of Phases into Passes


T h e discussion of phases deals with the logical organization of a compiler. In
an implementation, activities from several phases may be grouped together
into a pass t h a t reads an input file and writes an output file. For example,
the front-end phases of lexical analysis, syntax analysis, semantic analysis, and
intermediate code generation might be grouped together into one pass. Code
optimization might be an optional pass. T h e n there could be a back-end pass
consisting of code generation for a particular target machine.
Some compiler collections have been created around carefully designed in-
termediate representations t h a t allow the front end for a particular language to
interface with the back end for a certain target machine. W i t h these collections,
we can produce compilers for different source languages for one target machine
by combining different front ends with the back end for t h a t target machine.
Similarly, we can produce compilers for different target machines, by combining
a front end with back ends for different target machines.
12 CHAPTER 1. INTRODUCTION

1.2.9 Compiler-Construction Tools


T h e compiler writer, like any software developer, can profitably use modern
software development environments containing tools such as language editors,
debuggers, version managers, profilers, test harnesses, and so on. In addition
to these general software-development tools, other more specialized tools have
been created to help implement various phases of a compiler.
These tools use specialized languages for specifying and implementing spe-
cific components, and many use quite sophisticated algorithms. T h e most suc-
cessful tools are those t h a t hide the details of t h e generation algorithm and
produce components t h a t can be easily integrated into the remainder of t h e
compiler. Some commonly used compiler-construction tools include

1. Parser generators t h a t automatically produce syntax analyzers from a


grammatical description of a programming language.

2. Scanner generators t h a t produce lexical analyzers from a regular-expres-


sion description of t h e tokens of a language.

3. Syntax-directed translation engines t h a t produce collections of routines


for walking a parse tree and generating intermediate code.

4. Code-generator generators t h a t produce a code generator from a collection


of rules for translating each operation of the intermediate language into
the machine language for a target machine.

5. Data-flow analysis engines t h a t facilitate the gathering of information


about how values are transmitted from one p a r t of a program to each
other part. Data-flow analysis is a key part of code optimization.

6. Compiler-construction toolkits t h a t provide an integrated set of routines


for constructing various phases of a compiler.

We shall describe many of these tools throughout this book.

1.3 The Evolution of Programming Languages


The first electronic computers appeared in t h e 1940's and were programmed in
machine language by sequences of O's and l ' s t h a t explicitly told t h e computer
what operations to execute and in what order. T h e operations themselves
were very low level: move d a t a from one location to another, add t h e contents
of two registers, compare two values, and so on. Needless to say, this kind
of programming was slow, tedious, and error prone. And once written, t h e
programs were hard to understand and modify.

You might also like