Compiler Construction Lectures
Compiler Construction Lectures
COMPILER
INTRODUCTION
The name "compiler" is primarily used for programs that translate source code from
a high-level programming language to a lower level language (e.g., assembly
language or machine code). If the compiled program can run on a computer
whose CPU or operating system is different from the one on which the compiler runs,
the compiler is known as a cross-compiler. A program that translates from a low level
language to a higher level one is a decompiler. A program that translates between
high-level languages is usually called a language translator, source to source
translator, or language converter. A language rewriter is usually a program that
translates the form of expressions without a change of language.
Program faults caused by incorrect compiler behavior can be very difficult to track
down and work around; therefore, compiler implementors invest significant effort to
ensure the correctness of their software.
Structure of a compiler
The front end checks whether the program is correctly written in terms of the
programming language syntax and semantics. Here legal and illegal programs are
recognized. Errors are reported, if any, in a useful way. Type checking is also
performed by collecting type information. The frontend then generates
an intermediate representation or IR of the source code for processing by the middle-
end.
The middle end is where optimization takes place. Typical transformations for
optimization are removal of useless or unreachable code, discovery and propagation
of constant values, relocation of computation to a less frequently executed place (e.g.,
out of a loop), or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization efforts are focused
on this part.
Compiler output
A native or hosted compiler is one which output is intended to directly run on the
same type of computer and operating system that the compiler itself runs on. The
output of a cross compiler is designed to run on a different platform. Cross compilers
are often used when developing software for embedded systems that are not intended
to support a software development environment.
The output of a compiler that produces code for a virtual machine (VM) may or may
not be executed on the same platform as the compiler that produced it. For this reason
such compilers are not usually classified as native or cross compilers.
The lower level language that is the target of a compiler may itself be a high-level
programming language. C, often viewed as some sort of portable assembler, can also
be the target language of a compiler. E.g.: Cfront, the original compiler for C++ used
C as target language. The C created by such a compiler is usually not intended to be
read and maintained by humans. So indent style and pretty C intermediate code are
irrelevant. Some features of C turn it into a good target language. E.g.: C code
with #line directives can be generated to support debugging of the original source.
Compiler construction
In the early days, the approach taken to compiler design used to be directly affected
by the complexity of the processing, the experience of the person(s) designing it, and
the resources available.
A compiler for a relatively simple language written by one person might be a single,
monolithic piece of software. When the source language is large and complex, and
high quality output is required, the design may be split into a number of relatively
independent phases. Having separate phases means development can be parceled up
into small parts and given to different people. It also becomes much easier to replace a
single phase by an improved one or to insert new phases later (e.g., additional
optimizations).
All but the smallest of compilers have more than two phases. However, these phases
are usually regarded as being part of the front end or the back end. The point at which
these two ends meet is open to debate. The front end is generally considered to be
where syntactic and semantic processing takes place, along with translation to a lower
level of representation (than source code).
The back end takes the output from the middle. It may perform more analysis,
transformations and optimizations that are for a particular computer. Then, it
generates code for a particular processor and OS.
The ability to compile in a single pass has classically been seen as a benefit because it
simplifies the job of writing a compiler and one-pass compilers generally perform
compilations faster than multi-pass compilers. Thus, partly driven by the resource
limitations of early systems, many early languages were specifically designed so that
they could be compiled in a single pass (e.g., Pascal).
In some cases the design of a language feature may require a compiler to perform
more than one pass over the source. For instance, consider a declaration appearing on
line 20 of the source which affects the translation of a statement appearing on line 10.
In this case, the first pass needs to gather information about declarations appearing
after statements that they affect, with the actual translation happening during a
subsequent pass.
While the typical multi-pass compiler outputs machine code from its final pass, there
are several other types:
Front end
The compiler frontend analyzes the source code to build an internal representation of
the program, called the intermediate representation or IR. It also manages the symbol
table, a data structure mapping each symbol in the source code to associated
information such as location, type and scope. This is done over several phases, which
includes some of the following:
Back end
The term back end is sometimes confused with code generator because of the
overlapped functionality of generating assembly code. Some literature uses middle
end to distinguish the generic analysis and optimization phases in the back end from
the machine-dependent code generators.
Compiler analysis is the prerequisite for any compiler optimization, and they tightly
work together. For example, dependence analysis is crucial for loop transformation.
In addition, the scope of compiler analysis and optimizations vary greatly, from as
small as a basic block to the procedure/function level, or even over the whole program
(inter-procedural optimization). Obviously, a compiler can potentially do a better job
using a broader view. But that broad view is not free: large scope analysis and
optimizations are very costly in terms of compilation time and memory space; this is
especially true for inter-procedural analysis and optimizations.
Due to the extra time and space needed for compiler analysis and optimizations, some
compilers skip them by default. Users have to use compilation options to explicitly
tell the compiler which optimizations should be enabled.
Compiler correctness
Compiler correctness is the branch of software engineering that deals with trying to
show that a compiler behaves according to its language specification. Techniques
include developing the compiler using formal methods and using rigorous testing
(often called compiler validation) on an existing compiler.
is university
E OP E
E OP E * int
int - int 5
10 2
E
E OP E
int - E OP E
int * int
10 2 5
Recursive productions
Productions are often defined in terms of themselves. For example a list of variables
in a programming language grammar could be specified by this production:
COMPILED BY ABERE REUBEN Page 17
variable_list –> variable | variable_list , variable
Such productions are said to be recursive. If the recursive nonterminal is at the left of
the right-side of the production, e.g. A –> u | Av, we call the production left-recursive.
Similarly, we can define a right-recursive production: A –> u | vA. Some parsing
techniques have trouble with one or the other variants of recursive productions and so
sometimes we have to massage the grammar into a different but equivalent form. Left-
recursive productions can be especially troublesome in the top-down parsers. Handily,
there is a simple technique for rewriting the grammar to move the
recursion to the other side. For example, consider this left-recursive rule:
X –> Xa | Xb | AB | C | DEF
To convert the rule, we introduce a new nonterminal X' that we append to the end of
all non-left-recursive productions for X. The expansion for the new nonterminal is
basically the reverse of the original left-recursive rule. The re-written productions are:
X –> ABX' | CX' | DEFX'
X' –> aX' | bX' | ε
It appears we just exchanged the left-recursive rules for an equivalent right-recursive
version. This might seem pointless, but some parsing algorithms prefer or even
require only left or right recursion.
Left-factoring
The parser usually reads tokens from left to right and it is convenient if, upon reading
a token, it can make an immediate decision about which production from the grammar
to expand. However, this can be trouble if there are productions that have common
first symbol(s) on the right side of the productions. Here is an example we often see in
programming language grammars:
Stmt –> if Cond then Stmt else Stmt | if Cond then Stmt | Other | ....
The common prefix is if Cond then Stmt. This causes problems because when a parser
encounter an “if”, it does not know which production to use. A useful technique called
COMPILED BY ABERE REUBEN Page 18
left-factoring allows us to restructure the grammar to avoid this situation. We rewrite
the productions to defer the decision about which of the options to choose until we
have seen enough of the input to make the appropriate choice. We factor out the
common part of the two options into a shared rule that both will use and then add a
new rule that picks up where the tokens diverge.
Stmt –> if Cond then Stmt OptElse | Other | …
OptElse –> else S | ε
In the re-written grammar, upon reading an “if” we expand first production and wait
until if Cond then Stmt has been seen to decide whether to expand OptElse to else or
ε.