Chapter One-Introduction
Chapter One-Introduction
Compiler and its various phases-Cousins of Compiler-The Grouping of Phases-Compiler Construction Tools
A compiler also reports any errors in the source program that it detects during the translation process
If the target program is an executable machine-language program, it can then be called the user to process
input and produce output
There are two parts responsible for mapping source program into a semantically equivalent target program:
analysis and synthesis
The analysis part breaks up the program into constituent pieces and imposes grammatical structure on
them
It then uses this structure to create an intermediate representation of the source program
If analysis part detects errors (syntax and semantic), it provides informative messages
The analysis part also collects information about the source program and stores it in a data structure called
symbol table, which is passed along with an intermediate representation to the synthesis part
The synthesis part constructs the desired target program from the intermediate representation and the
information in the symbol table
The analysis part is often called the front end of the compiler; the synthesis part the back end
The compilation process operates as a sequence of phases each of which transforms one representation of
the source program into another
A typical decomposition of a compiler into phases is shown in Figure 1.3
In practice several phases may be grouped together and the intermediate representations between need not
be constructed explicitly
The symbol table, which stores information about the entire source program, is used by all phases of the
compiler
Lexical Analysis
1
The characters in this assignment could be grouped into the following lexemes and mapped into the
following tokens passed on to the syntax analyzer:
1. position is a lexeme that would be mapped into a token <id, 1>, where id is an abstract symbol
standing for identifier and 1 points to the symbol table entry for position. The symbol table entry holds
information about the identifier, such as its name and type
2. = is a lexeme that is mapped into the token <=>. Since it needs no attribute value, the second
component is omitted
3. initial is a lexeme that would be mapped into a token <id, 2>, where 2 points to the symbol table entry
for position
4. + is a lexeme that is mapped into the token <+>
5. rate is a lexeme that would be mapped into a token <id, 3>, where 3 points to the symbol table entry
for rate
6. * is a lexeme that is mapped into the token <*>
2
7. 60 is a lexeme that is mapped into the token <60>
After lexical analysis, the sequence of tokens in equation 1.1 are
In this representation, the token names =, +, and * are abstract symbols for the assignment, addition, and
multiplication operators, respectively
Syntax Analysis
Semantic Analysis
The semantic analyzer uses the syntax tree and the information in the symbol table to check the source
program for semantic consistency with the language definition
It also gathers type information and saves it in either the syntax tree or the symbol table, for subsequent
use during intermediate code generation
An important part of semantic analysis is type checking, where the compiler checks that each operator as
matching operands, e.g., many programming language definitions require array index to be an integer; the
compiler must report error if floating-point number is used instead
A language specification may permit type coercion, e.g., if binary arithmetic operator is applied to integer
and floating point operands, the compiler may convert or coerce the integer into a floating-point number
Suppose that position, initial and rate have been declared to be floating-point numbers, and lexeme 60 by
itself forms an integer
Semantic analyzer first converts integer 60 to a floating point number before applying *
3
Intermediate Code Generation
After syntax and semantic analysis, many compilers generate an explicit low-level or machine-like, which
we can think of as a program for an abstract machine
This intermediate representation should have two properties: it should be easy to produce and it should be
easy to translate it into the target machine
One of the intermediate representations called three-address code consists of an assembly like instructions
with a maximum of three operands per instruction (or at most one operator at the right side of an
assignment operator)
Each operand can act like a register
The output of the intermediate code generator can consist of the three-address code sequence
t1 = int to float (60)
t2 = id3 * t1
t3 = id2 + t2
id1 = t3 (1.3)
The compiler must also generate temporary name to hold the value computed by a three-address
instruction
Code Optimization
The machine-independent code-optimization phase attempts to improve the intermediate code so that
better target code will result
Usually better means faster, but other objectives may be desired, such as shorter code or target code that
consumes less power
For example, an algorithm generates the intermediate code (1.3), using an instruction per each operator in
the tree representation that comes from the semantic analyzer
The optimizer can deduce that the conversion of 60 from integer to floating point can be done once for all
at compile time, so the int to float operation can be eliminated by replacing the integer 60 by the floating-
point number 60.0
Moreover, t3 is used only once to transmit its value to id1 so that the optimizer can transform (1.3) into the
shorter sequence
t1 = id3 * 60.0
id1 = id2 + t1 (1.4)
Code Generation
The code generator takes as an input intermediate representation of the source program and maps it into
the target language
If the target language is machine language, registers or memory locations are selected for each of the
variables used by the program
Then, the intermediate instructions are translated into sequences of machine instructions to perform the
same task
A crucial aspect of code generation is the judicious assignment of registers to hold variables
For example, using registers R1 and R2, the intermediate code in (1.4) might get translated into the
machine code
LDF R2, id3
4
MULF R2, R2, #60.0
LDF R1, id2
ADDF R1, R1, R2
STF id1, R1 (1.5)
The first operand of each instruction specifies a destination
The F in each instruction tells us that it deals with floating-point numbers
The code in (1.5) loads the contents of address id3 into register R2, then multiplies it with floating-point
constant 60.0
The # signifies that 60.0 is to be treated as an immediate constant
The third instruction moves id2 into register R1 and the fourth adds to it the value previously computed in
register R2
Finally, the value in register R1 is stored into the address id1, so the code correctly implements the
assignment statement (1.1)
Symbol-Table Management
An essential function of a compiler is to record the variables names used in the source program and collect
information about various attributes of each name
These attributes may provide information about the storage allocated for a name, its type, its scope (where
in the program its program can be used), and in the case of procedure names, such things as the number
and types of its arguments, the method of passing each argument (e.g., by value or by reference), and the
type returned
The symbol table is a data structure containing a record for each variable name, with fields for attributes of
the name
The data structure should be designed to allow the compiler to find the record for each name quickly and
to store or retrieve data from that record quickly
Cousins of Compiler
5
An interpreter is another common kind of language processor that instead of producing a target program as
a translation, an interpreter appears to directly execute the operations specified in the source program on
input supplied by the user
The machine-language target produced by a compiler is usually much faster than an interpreter at mapping
inputs to outputs
An interpreter can usually give better error diagnostics than a compiler, because it executes the source
program statement by statement
Several other programs may be needed in addition to a compiler to create an executable program as shown
in Figure 1.2.
The task of a preprocessor (a separate program) is collecting modules of a program stored in separate files
It may also expand short hands, called macros, into source language statements
The modified source program is fed to a compiler
The compiler may produce an assembly-language program as its output, because assembly language is
easier to produce as an output and easier to debug
The assembly language program is then processed by a program called assembler that produces a
relocatable machine code as its output
Large programs are often compiled in pieces, so that the relocatable machine code may have to be linked
with other relocatable object files and library files into the code actually runs on the machine
The linker resolves external memory addresses, where the code in one file may refer to a location in
another file
The loader then puts together all executable object files into memory for execution