What Are Compilers?
What Are Compilers?
Compiler is a program which translates a program written in one language (the source language)
to an equivalent program in other language (the target language). Usually the source language is
a high level language like Java, C, Fortran etc. whereas the target language is machine code or
"code" that a computer's processor understands. The source language is optimized for humans. It
is more user-friendly, to some extent platform-independent. They are easier to read, write, and
maintain and hence it is easy to avoid errors. Ultimately, programs written in a high-level
language must be translated into machine language by a compiler. The target machine language
is efficient for hardware but lacks readability.
Compilers
. Typically from high level source code to low level machine code or object code
- Redundancy is reduced
How to translate?
The high level languages and machine languages differ in level of abstraction. At machine level
we deal with memory locations, registers whereas these resources are never accessed in high
level languages. But the level of abstraction differs from language to language and some
languages are farther from machine code than others
. Goals of translation
Good performance for generated code : The metric for the quality of the generated code is the
ratio between the size of handwritten code and compiled machine code for same program. A
better compiler is one which generates smaller code. For optimizing compilers this ratio will be
lesser.
- Maintainable code
Correctness : A compiler's most important goal is correctness - all valid programs must compile
correctly. How do we check if a compiler is correct i.e. whether a compiler for a programming
language generates correct machine code for programs in the language. The complexity of
writing a correct compiler is a major limitation on the amount of optimization that can be done.
Many modern compilers share a common 'two stage' design. The "front end" translates the
source language or the high level program into an intermediate representation. The second stage
is the "back end", which works with the internal representation to produce code in the output
language which is a low level code. The higher the abstraction a compiler can support, the better
it is.
All development systems are essentially a combination of many tools. For compiler, the other
tools are debugger, assembler, linker, loader, profiler, editor etc. If these tools have support for
each other than the program development becomes a lot easier.
This is how the various tools work in coordination to make programming easier and better. They
all have a specific task to accomplish in the process, from writing a code to compiling it and
running/debugging it. If debugged then do manual correction in the code if needed, after getting
debugging results. It is the combined contribution of these tools that makes programming a lot
easier and efficient.
In order to translate a high level code to a machine code one needs to go step by step, with each
step doing a particular task and passing out its output for the next step in the form of another
program representation. The steps can be parse tree generation, high level intermediate code
generation, low level intermediate code generation, and then the machine language conversion.
As the translation proceeds the representation becomes more and more machine specific,
increasingly dealing with registers, memory locations etc.
. Translate in steps. Each step handles a reasonably simple, logical, and well defined task
. Representations become more machine specific and less language specific as the translation
proceeds
The first few steps :
The first few steps of compilation like lexical, syntax and semantic analysis can be understood
by drawing analogies to the human way of comprehending a natural language. The first step in
understanding a natural language will be to recognize characters, i.e. the upper and lower case
alphabets, punctuation marks, alphabets, digits, white spaces etc. Similarly the compiler has to
recognize the characters used in a programming language. The next step will be to recognize the
words which come from a dictionary. Similarly the programming language have a dictionary as
well as rules to construct words (numbers, identifiers etc).
- English text consists of lower and upper case alphabets, digits, punctuations and white spaces
- Written programs consist of characters from the ASCII characters set (normally 9-13, 32-126)
. The next step to understand the sentence is recognizing words (lexical analysis)
- Programming languages have a dictionary (keywords etc.) and rules for constructing words
(identifiers, numbers etc.)
Lexical Analysis
. The language must define rules for breaking a sentence into a sequence of words.
. In programming languages a character from a different class may also be treated as word
separator.
. The lexical analyzer breaks a sentence into a sequence of words or tokens: - If a == b then a = 1
; else a = 2 ; - Sequence of words (total 14 words) if a == b then a = 1 ; else a = 2 ;
In simple words, lexical analysis is the process of identifying the words from an input string of
characters, which may be handled more easily by a parser. These words must be separated by
some predefined delimiter or there may be some rules imposed by the language for breaking the
sentence into tokens or words which are then passed on to the next phase of syntax analysis. In
programming languages, a character from a different class may also be considered as a word
separator.
The next step
. Once the words are understood, the next step is to understand the structure of the sentence
Syntax analysis (also called as parsing) is a process of imposing a hierarchical (tree like)
structure on the token stream. It is basically like generating sentences for the language using
language specific grammatical rules as we have in our natural language
Ex. sentence subject + object + subject The example drawn above shows how a sentence in
English (a natural language) can be broken down into a tree form depending on the construct of
the sentence.
Parsing
Just like a natural language, a programming language also has a set of grammatical rules and
hence can be broken down into a parse tree by the parser. It is on this parse tree that the further
steps of semantic analysis are carried out. This is also used during generation of the intermediate
language code. Yacc (yet another compiler compiler) is a program that generates parsers in the C
programming language.
Understanding the meaning
. Once the sentence structure is understood we try to understand the meaning of the sentence
(semantic analysis)
. How many Amits are there? Which one left the assignment?
Semantic analysis is the process of examining the statements and to make sure that they make
sense. During the semantic analysis, the types, values, and other required information about
statements are recorded, checked, and transformed appropriately to make sure the program
makes sense. Ideally there should be no ambiguity in the grammar of the language. Each
sentence should have just one meaning.
Semantic Analysis
. Too hard for compilers. They do not have capabilities similar to human understanding
. However, compilers do perform analysis to understand the meaning and catch inconsistencies
{ int Amit = 4;
Since it is too hard for a compiler to do semantic analysis, the programming languages define
strict rules to avoid ambiguities and make the analysis easier. In the code written above, there is
a clear demarcation between the two instances of Amit. This has been done by putting one
outside the scope of other so that the compiler knows that these two Amit are different by the
virtue of their different scopes.
Lexical analysis is based on the finite state automata and hence finds the lexicons from the
input on the basis of corresponding regular expressions. If there is some input which it
can't recognize then it generates error. In the above example, the delimiter is a blank space.
See for yourself that the lexical analyzer recognizes identifiers, numbers, brackets etc.
Syntax Analysis
Syntax Analysis is modeled on the basis of context free grammars. Programming languages
can be written using context free grammars. Based on the rules of the grammar, a syntax
tree can be made from a correct code of the language. A code written in a CFG is
recognized using Push Down Automata. If there is any error in the syntax of the code then
an error is generated by the compiler. Some compilers also tell that what exactly is the
error, if possible.
Semantic Analysis
. Check semantics
. Error reporting
. Disambiguate overloaded
operators
.Type coercion
. Static checking
- Type checking
- Control flow checking
- Unique ness checking
- Name checks
Semantic analysis should ensure that the code is unambiguous. Also it should do the type checking
wherever needed. Ex. int y = "Hi"; should generate an error. Type coercion can be explained by the
following example:- int y = 5.6 + 1; The actual value of y used will be 6 since it is an integer. The
compiler knows that since y is an instance of an integer it cannot have the value of 6.6 so it down-
casts its value to the greatest integer less than 6.6. This is called type coercion.
Code Optimization
- Run faster
- Copy propagation
- Code motion
- Strength reduction
- Constant folding
. Example: x = 15 * 3 is transformed to x = 45
There is no strong counterpart in English, this is similar to precise writing where one cuts
down the redundant words. It basically cuts down the redundancy. We modify the
compiled code to make it more efficient such that it can - Run faster - Use less resources,
such as memory, register, space, fewer fetches etc.
Example of Optimizations
PI = 3.14159 3A+4M+1D+2E
Area = 4 * PI * R^2
Volume = (4/3) * PI * R^3
--------------------------------
X = 3.14159 * R * R 3A+5M
Area = 4 * X
Volume = 1.33 * X * R
--------------------------------
Area = 4 * 3.14159 * R * R
2A+4M+1D
2A+4M+1D
Volume = ( Area / 3 ) * R
--------------------------------
Area = 12.56636 * R * R 2A+3M+1D
Volume = ( Area /3 ) * R
--------------------------------
X=R*R 3A+4M
A : assignment M : multiplication
D : division E : exponent
int x = 2;
int y = 3;
int *array[5];
for (i=0; i<5;i++)
*array[i] = x + y;
Because x and y are invariant and do not change inside of the loop, their addition doesn't need to
be performed for each loop iteration. Almost any good compiler optimizes the code. An
optimizer moves the addition of x and y outside the loop, thus creating a more efficient loop.
Thus, the optimized code in this case could look like the following:
int x = 5;
int y = 7;
int z = x + y;
int *array[10];
*array[i] = z;
Example of Optimizations
PI = 3.14159 3A+4M+1D+2E
Area = 4 * PI * R^2
Volume = (4/3) * PI * R^3
--------------------------------
X = 3.14159 * R * R 3A+5M
Area = 4 * X
Volume = 1.33 * X * R
--------------------------------
Area = 4 * 3.14159 * R * R
2A+4M+1D
2A+4M+1D
Volume = ( Area / 3 ) * R
--------------------------------
Area = 12.56636 * R * R 2A+3M+1D
Volume = ( Area /3 ) * R
--------------------------------
X=R*R 3A+4M
A : assignment M : multiplication
D : division E : exponent
int x = 2;
int y = 3;
int *array[5];
*array[i] = x + y;
Because x and y are invariant and do not change inside of the loop, their addition doesn't need to
be performed for each loop iteration. Almost any good compiler optimizes the code. An
optimizer moves the addition of x and y outside the loop, thus creating a more efficient loop.
Thus, the optimized code in this case could look like the following:
int x = 5;
int y = 7;
int z = x + y;
int *array[10];
*array[i] = z;
Code Generation
. Intermediate languages are generally ordered in decreasing level of abstraction from highest
(source) to lowest (machine)
. However, typically the one after the intermediate code generation is the most important
The final phase of the compiler is generation of the relocatable target code. First of all,
Intermediate code is generated from the semantic representation of the source program, and this
intermediate code is used to generate machine code.
. Abstraction at the target level memory locations, registers, stack, opcodes, addressing modes,
system libraries, interface to the operating systems
. Code generation is mapping from source level abstractions to target machine abstractions
. Layout parameter passing protocols: locations for parameters, return values, layout of
activations frame etc.
Thus it must not only relate to identifiers, expressions, functions & classes but also to opcodes,
registers, etc. Then it must also map one abstraction to the other.
These are some of the things to be taken care of in the intermediate code generation.
Post translation Optimizations
. Multiplication by 1
. Multiplication by 0
. Addition with 0
Instruction selection
- Opcode selection
- Peephole optimization
If (false)
a = 1;
else
a = 2;
with a = 2;
5) Strength Reduction - replacing more expensive expressions with cheaper ones - like pow(x,2)
with x*x
6) Common Sub expression elimination - like a = b*c, f= b*c*d with temp = b*c, a= temp, f=
temp*d;
Code Generation
There is a clear intermediate code optimization - with 2 different sets of codes having 2
different parse trees.The optimized code does away with the redundancy in the original
code and produces the same result.
Compiler structure
These are the various stages in the process of generation of the target code from the source code
by the compiler. These stages can be broadly classified into
- Scope information
- At a central repository and every phase refers to the repository whenever information is
required
This diagram elaborates what's written in the previous slide. You can see that each stage
can access the Symbol Table. All the relevant information about the variables, classes,
functions etc. are stored in it.
Advantages of the model
The front end phases are Lexical, Syntax and Semantic analyses. These form the "analysis
phase" as you can well see these all do some kind of analysis. The Back End phases are called
the "synthesis phase" as they synthesize the intermediate and the target language and hence the
program from the representation created by the Front End phases. The advantages are that not
only can lots of code be reused, but also since the compiler is well structured - it is easy to
maintain & debug.
. Compiler is retargetable.
. Optimization phase can be inserted after the front and back end phases have been developed
and deployed
Also since each phase handles a logically different phase of working of a compiler parts of the
code can be reused to make new compilers. E.g., in a C compiler for Intel & Athlon the front
ends will be similar. For a same language, lexical, syntax and semantic analyses are similar, code
can be reused. Also in adding optimization, improving the performance of one phase should not
affect the same of the other phase; this is possible to achieve in this model.
- Compilers are required for all the languages and all the machines
- However, there is lot of repetition of work because of similar activities in the front ends and
back ends
- Can we design only M front ends and N back ends, and some how link them to get all M*N
compilers?
The compiler should fit in the integrated development environment. This opens many challenges
in design e.g., appropriate information should be passed on to the debugger in case of erroneous
programs. Also the compiler should find the erroneous line in the program and also make error
recovery possible. Some features of programming languages make compiler design difficult, e.g.,
Algol68 is a very neat language with most good features. But it could never get implemented
because of the complexities in its compiler design.
. However, program proving techniques do not exist at a level where large and complex
programs like compilers can be proven to be correct
. Regression testing
- All the test programs are compiled using the compiler and deviations are reported to the
compiler writer
- Test programs should exercise every statement of the compiler at least once