Module 1
Module 1
Compiler Design
BCSSS602
3:0:2
Pre- requisites
• Computer Organization
• Any programming language
• Data Structures
• Automata Theory
Course Learning Objectives (CLO)
• Understand the phases of the compiler
• Generate parse table, Intermediate Code, and Target Code
• Learn the concepts of System software – Assemblers and Loaders
MODULE 1
• Introduction
• Language Processors
• Structure of Compiler
• Evolution of programming languages
• Science of building a compiler
• Lexical Analysis
• Role of Lexical Analyzer
• Input Buffering
• Specifications of Token
• Recognition of Tokens
Language Processors
• A compiler is a program that can read a program
in one language - the source language - and
translate it into an equivalent program in another
language - the target language;
• An important role of the compiler is to report any
errors in the source program that it detects during the
translation process.
• In this representation, the token names =, +, and * are abstract symbols for the assignment,
addition, and multiplication operators, respectively.
Syntax Analysis
• The second phase of the compiler is syntax analysis or parsing.
• The parser uses the first components of the tokens produced by the lexical
analyzer to create a tree-like intermediate representation that depicts the
grammatical structure of the token stream.
• A typical representation is a syntax tree in which each interior node represents
an operation and the children of the node represent the arguments of the
operation.
• A syntax tree for the token stream (2) is shown as the output of the syntactic
analyzer in Figure
• This tree shows the order in which the operations in the assignment are to be
performed.
• The tree has an interior node labeled * with
<id, 3> as its left child and the integer 60 as
its right child.
• The node <id, 3> represents the identifier -
rate.
• The node labeled * makes it explicit that
first multiply the value of rate by 60.
• The node labeled + indicates that the result
of this multiplication is added to the value
initial.
• The root of the tree, labeled = , indicates that the result of addition is stored into
the location for the identifier position.
• This ordering of operations is consistent with the usual conventions of
arithmetic, where multiplication has higher precedence than addition, and hence
that the multiplication is to be performed before the addition.
Semantic Analysis
• The semantic analyzer uses the syntax tree and the information in the symbol table to
check the source program for semantic consistency with the language definition.
• It gathers type information and saves it in either the syntax tree or the symbol table, for
subsequent use during intermediate-code generation.
• An important part of semantic analysis is type checking, where the compiler checks that
each operator has matching operands.
• For example, the compiler reports an error if a floating-point number is used to index an
array.
• The language specification may permit some type conversions called coercions.
• For example, a binary arithmetic operator may be applied to either a pair of integers or
to a pair of floating-point numbers.
• If the operator is applied to a floating-point number and an integer, the compiler may
convert or coerce the integer into a floating-point number.
• Suppose that position, initial, and rate have been declared to be floating-point
numbers, and that the lexeme 60 by itself forms an integer.
• The type checker in the semantic analyzer in Figure discovers that the operator *
is applied to a floating-point number rate and an integer 60.
• In this case, the integer is converted into a floating-point number.
• In Figure, notice that the output of the semantic analyzer has an extra node for
the operator inttofloat, which explicitly converts its integer argument into a
floating-point number.
Intermediate Code Generation
• In the process of translating a source program into target code, a compiler may
construct one or more intermediate representations, which can have a variety of
forms.
• Syntax trees are a form of intermediate representation; they are commonly used
during syntax and semantic analysis.
• After syntax and semantic analysis of the source program, many compilers
generate an explicit low-level or machine-like intermediate representation.
• This intermediate representation should have two important properties: it should
be easy to produce and it should be easy to translate into the target machine.
• An intermediate form called three-address code, consists of a sequence of
assembly-like instructions with three operands per instruction
• Each operand can act like a register.
• The output of the intermediate code generator in Figure consists of the three-
address code sequence
tl = inttofloat(60)
t2 = id3 * tl
t3 = id2 + t2 ---- (3)
id1 = t3
• Each three-address assignment instruction has at most one operator on the right
side.
• Thus, these instructions fix the order in which operations are to be done; the
multiplication precedes the addition in the source program).
• The compiler then must generate a temporary name to hold the value computed
by a three-address instruction.
• Finally, some "three-address instructions" like the first and last in the sequence
(3), above, have fewer than three operands.
Code Optimization
• The machine-independent code-optimization phase attempts to improve the
intermediate code so that better target code will result.
• Better target code is faster, shorter code, or that consumes less power.
• For example, an algorithm generates the intermediate code (3), using an
instruction for each operator in the tree representation that comes from the
semantic analyzer.
• The optimizer can deduce that the conversion of 60 from integer to floating
point can be done once and for all at compile time, so the inttofloat operation
can be eliminated by replacing the integer 60 by the floating-point number 60.0.
• Moreover, t3 is used only once to transmit its value to id1 so the optimizer can
transform (3) into the shorter sequence
t1 = id3 * 60.0 -------------(4)
id1 = id2 + t1
Code Generation
• The code generator takes as input an intermediate representation of the source program and
maps it into the target language.
• If the target language is machine code, registers or memory locations are selected for each of
the variables used by the program.
• A crucial aspect of code generation is the judicious assignment of registers to hold variables.
• For example, using registers R1 and R2, the intermediate code in (4) might get translated
into the machine code
LDF R2, id3
MULF R2 , R2 , #60.0 ------ (5)
LDF R1 , id2
ADDF Rl , R l , R2
STF idl , Rl
• The first operand of each instruction specifies a destination, The F in each instruction tells us
that it deals with floating-point numbers.
• The code in (5) loads the contents of address id3 into register R2
• Then multiplies it with floating-point constant 60.0. ( The # signifies that 60.0 is
to be treated as an immediate constant).
• The third instruction moves id2 into register R1
• The fourth adds to it the value previously computed in register R2.
• Finally, the value in register R1 is stored into the address of idl, so the code
correctly implements the assignment statement (1).
Symbol-Table Management
• Symbol tables are data structures that are used by compilers to hold information about
source-program constructs.
• The symbol table is a data structure containing a record for each variable name, with fields
for the attributes of the name.
• The data structure should be designed to allow the compiler to find the record for each name
quickly and to store or retrieve data from that record quickly
• The information is collected incrementally by the analysis phases of a compiler and used by
the synthesis phases to generate the target code.
• Entries in the symbol table contain information about an identifier such as its character
string (or lexeme), its type, its position in storage, and any other relevant information.
• Symbol tables typically need to support multiple declarations of the same identifier within a
program (scope - where in the program its value may be used), and in the case of procedure
names, such things as the number and types of its arguments, the method of passing each
argument (for example, by value or by reference), and the type returned.
The Grouping of Phases into Passes
• In an implementation, activities from several phases may be grouped together into a
pass that reads an input file and writes an output file.
• For example, the front-end phases of lexical analysis, syntax analysis, semantic
analysis, and intermediate code generation might be grouped together into one pass.
• Code optimization might be an optional pass.
• The back-end pass consisting of code generation for a particular target machine.
• Some compiler collections have been created around carefully designed intermediate
representations that allow the front end for a particular language to interface with the
back end for a certain target machine.
• With these collections, compilers for different source languages for one target
machine can be designed by combining different front ends with the back end for that
target machine.
• Similarly, compilers for different target machines, by combining a front end with back
ends for different target machines can be designed.
Compiler-Construction Tools
• Some commonly used compiler-construction tools include
1. Parser generators that automatically produce syntax analyzers from a grammatical
description of a programming language.
2. Scanner generators that produce lexical analyzers from a regular-expression description
of the tokens of a language.
3. Syntax-directed translation engines that produce collections of routines for walking a
parse tree and generating intermediate code.
4. Code-generator generators that produce a code generator from a collection of rules for
translating each operation of the intermediate language into the machine language for a
target machine.
5. Data-flow analysis engines that facilitate the gathering of information about how values
are transmitted from one part of a program to each other part and it is a key part of code
optimization.
6. Compiler-construction toolkits that provide an integrated set of routines for
constructing various phases of a compiler
The Evolution of Programming Languages
• The first programming was machine language, which used sequences of 0s and
1s that explicitly told the computer what operations to execute and in what
order.
• Programming languages can be classified in a variety of ways –
• by generation.
• First-generation languages are the machine languages,
• second-generation the assembly languages, and
• third-generation the higher-level languages like Fortran, Cobol, Lisp, C, C++, C#, and
Java.
• Fourth-generation languages are languages designed for specific applications like
NOMAD for report generation, SQL for database queries, and Postscript for text
formatting.
• The fifth-generation language is logic- and constraint-based languages like Prolog and
OPS5.
• By Computation
• Imperative for languages in which a program specifies how a computation is to be done
• Languages such as C, C++, C#, and Java are imperative languages, which have a notion of program state
and statements that change the state.
• Declarative for languages in which a program specifies what computation is to be done.
• Functional languages such as ML and Haskell and constraint logic languages such as Prolog are often
considered to be declarative languages.
• By Architecture
• Von Neumann language is a programming language whose computational model is based
on the von Neumann computer architecture.
• Languages, such as Fortran and C are von Neumann languages.
• An object-oriented language is one that supports object-oriented programming, a
programming style in which a program consists of a collection of objects that interact with
one another.
• Simula 67 and Smalltalk are the earliest major object-oriented languages. Languages such as C++, C#,
Java, and Ruby are more recent object-oriented languages.
• Scripting languages are interpreted languages with high-level operators designed for "gluing
together" computations.
• Awk, JavaScript, Perl, PHP, Python, Ruby, and Tcl are examples of scripting languages
Modeling in Compiler Design and Implementation
• The study of compilers is mainly a study of how to design the right
mathematical models and choose the right algorithms, while balancing the need
for generality and power against simplicity and efficiency.
• Some of the most fundamental models are:
• finite-state machines and regular expressions, which are useful for describing the lexical
units of programs (keywords, identifiers, and such) and for describing the algorithms
used by the compiler to recognize those units.
• context-free grammar, used to describe the syntactic structure of programming
languages such as the nesting of parentheses or control constructs.
• trees for representing the structure of programs and their translation into object code.
The Science of Code Optimization
• The term "optimization" in compiler design refers to the attempts that a compiler makes to
produce code that is more efficient than the obvious code.
• "Optimization” is thus a misnomer, since there is no way that the code produced by a
compiler can be guaranteed to be as fast or faster than any other code that performs the same
task.
• In modern times, the optimization of code that a compiler performs has become both more
important and more complex.
WHY?
• It is more complex because processor architectures have become more complex, so, more
opportunities to improve the way code executes.
• It is more important because massively parallel computers require substantial optimization,
or their performance suffers by
• orders of magnitude..
• The use of a rigorous mathematical foundation shows that optimization is
correct and that it produces the desirable effect for all possible inputs.
• Models such as graphs, matrices, and linear programs are necessary for the
compiler to produce optimized code.
• Compiler optimizations must meet the following design objectives:
• The optimization must be correct semantically, that is, preserve the meaning of the
compiled program,
• The optimization must improve the performance of many programs- high speed, low
power consumption
• The compilation time must be kept reasonable – debugging & testing cannot be
exhaustive,
• The engineering and maintenance effort required must be manageable
Lexical Analysis – Role of Lexical Analyzer
• First phase of a compiler
• The main task is to read the input characters of the source program, group them into lexemes,
and produce as output a sequence of tokens for each lexeme in the source program and store
it in the symbol table.
• The stream of tokens is sent to the parser for syntax analysis.
• When the lexical analyzer discovers a lexeme constituting an identifier, it stores that lexeme
into the symbol table.
• It also reads information regarding the kind of identifier from the symbol table to assist it in
determining the proper token it must pass to the parser.
• Lexical analyzers are divided into a cascade of two processes:
a) Scanning consists of the simple processes such as deletion of comments and compaction of
consecutive whitespace characters into one.
b) Lexical analysis proper is the more complex portion, where the scanner produces the
sequence of tokens as output.
Interactions between the lexical analyzer and the parser
• The interaction is implemented by having the parser call the lexical analyzer.
• The call, suggested by the getNextToken command, causes the lexical analyzer
to read characters from its input until it can identify the next lexeme and
produce for it the next token, which it returns to the parser.
• The lexical analyser also removes comments and whitespace (blank, newline,
tab, and other characters that are used to separate tokens in the input).
• It also does correlating error messages generated by the compiler with the
source program.
• For instance, the lexical analyzer keeps track of the number of newline characters seen,
and associates a line number with each error message.
• The lexical analyzer makes a copy of the source program with the error
messages inserted at the appropriate positions.
• If the source program uses a macro-preprocessor, the expansion of macros is
performed by the lexical analyzer.
Tokens, Patterns, and Lexemes
• A token is a pair consisting of a token name and an optional attribute value.
• The token name is an abstract symbol representing a kind of lexical unit, e.g., a
particular keyword, or a sequence of input characters denoting an identifier.
• The token names are the input symbols that the parser processes and the token is
refered its token name.
• A pattern is a description of the form that the lexemes of a token may take.
• In the case of a keyword as a token, the pattern is just the sequence of characters
that form the keyword.
• For identifiers and some other tokens, the pattern is a more complex structure that
is matched by many strings.
• A lexeme is a sequence of characters in the source program that matches the
pattern for a token and is identified by the lexical analyzer as an instance of that
token.
Example
Attributes for Tokens
• When more than one lexeme can match a pattern, the lexical analyzer provides the
subsequent compiler phases additional information about the particular lexeme that
matched.
• For example, the pattern for token number matches both 0 and 1, but it is extremely
important for the code generator to know which lexeme was found in the source program.
• Thus, the lexical analyzer returns to the parser not only a token name, but an attribute
value that describes the lexeme represented by the token;
• Information about an identifier - e.g., its lexeme, its type, and the location at which it is
first found (in case an error message about that identifier must be issued) - is kept in the
symbol table.
• Thus, the appropriate attribute value for an identifier is a pointer to the symbol-table
entry for that identifier.
• The token name influences parsing decisions, while the attribute value influences
translation of tokens after the parse.
• The token names and associated attribute values for the Fortran statement are
written below as a sequence of pairs.
E = M * C ** 2
<id, pointer to symbol-table entry for E>
< assign-op >
<id, pointer to symbol-table entry for M>
<mult -op>
<id, pointer to symbol-table entry for C>
<exp-op>
<number , integer value 2 >
Lexical Errors