Compiler Design I MSC
Compiler Design I MSC
Core Course IV
COMPILER DESIGN
Subject Code : P22CSCC22
Prepared by
COURSE OBJECTIVES:
Lexical analysis - Role of lexical analyzer - Tokens, Patterns and lexemes - Input
buffering - Specification of tokens - Regular expressions - Recognition of tokens -
Transition diagrams - Implementing a transition diagram - Finite Automata -
Regular expression to NFA - Conversion of NFA to DFA
REFERENCES:
COURSE OUTCOMES:
*****
26
UNIT – I
Introduction to compilers:
Compiler Design is the structure and set of principles that guide the translation, analysis,
and optimization process of a compiler.
A Compiler is computer software that transforms program source code which is written
in a high-level language into low-level machine code. It essentially translates the code
written in one programming language to another language without changing the logic of
the code.
The Compiler also makes the code output efficient and optimized for execution time and
memory space. The compiling process has basic translation mechanisms and error
detection; it can’t compile code if there is an error. The compiler process runs through
syntax, lexical, and semantic analysis in the front end and generates optimized code in the
back end.
When executing, the compiler first analyzes the entire language statements one after
the other syntactically and then, if it’s successful, builds the output code, making sure that
statements that refer to other statements are referred to appropriately, traditionally; the
output code is called Object Code.
Types of Compiler
1. Cross Compiler: This enables the creation of code for a platform other than the one
on which the compiler is running. For instance, it runs on a machine ‘A’ and produces
code for another machine ‘B’.
3. Single Pass Compiler: This directly transforms source code into machine code. For
instance, Pascal programming language.
4. Two-Pass Compiler: This goes through the code to be translated twice; on the first
pass it checks the syntax of statements and constructs a table of symbols, while on the
second pass it actually translates program statements into machine language.
5. Multi-Pass Compiler: This is a type of compiler that processes the source code or
abstract syntax tree of a program multiple times before translating it to machine
language.
2. They are closer to human’s language but far from machines. The (#) tags are referred
to as preprocessor directives. They tell the pre-processor about what to do.
3. Pre-Processor: This produces input for the compiler and also deals with file
inclusion, augmentation, macro-processing, language extension, etc. It removes all
the #include directives by including the files called file inclusion and all
the #define directives using macro expansion.
6. Reloadable Machine Code: This can be loaded at any point in time and can be run.
This enables the movement of a program using its unique address identifier.
7. Linker: It links and merges a variety of object files into a single file to make it
executable. The linker searches for defined modules in a program and finds out the
memory location where all modules are stored.
8. Loader: It loads the output from the Linker in memory and executes it. It basically
loads executable files into memory and runs them
Features of a Compiler
Correctness: A major feature of a compiler is its correctness, and accuracy to compile
the given code input into its exact logic in the output object code due to its being
3
Recognize legal and illegal program constructs: Compilers are designed in such a way
that they can identify which part of the program formed from one or more lexical tokens
using the appropriate rules of the language is syntactically allowable and which is not.
Good Error reporting/handling: A compiler is designed to know how to parse the error
encountered from lack be it a syntactical error, insufficient memory errors, or logic errors
are meticulously handled and displayed to the user.
The Speed of the target code: Compilers make sure that the target code is fast because
in huge size code its a serious limitation if the code is slow, some compilers do so by
translating the byte code into target code to run in the specific processor using classical
compiling methods.
Preserve the correct meaning of the code: A compiler makes sure that the code logic is
preserved to the tiniest detail because a single loss in the code logic can change the whole
code logic and output the wrong result, so during the design process, the compiler goes
through a whole lot of testing to make sure that no code logic is lost during the compiling
process.
Code debugging help: Compilers make help the debugging process easier by pointing
out the error line to the programmer and telling them the type of error that is encountered
so they would know how to start fixing it.
Reduced system load: Compilers make your program run faster than interpreted
programs because it compiles the program only once, hence reducing system load and
response time when next you run the program.
Protection for source code and programs: Compilers protect your program source by
discouraging other users from making unauthorized changes to your programs, you as the
author can distribute your programs in object code.
Linear Analysis: In which the stream of characters making up the source program is
read from left to right and grouped into tokens that are sequences of characters having a
collective meaning.
Semantic Analysis: In which certain checks are performed to ensure that the
components of a program fit together meaningfully.
Phases of compiler
Compiler operates in various phases each phase transforms the source program from one
representation to another. Every phase takes inputs from its previous stage and feeds its
output to the next phase of the compiler.
There are 6 phases in a compiler. Each of this phase help in converting the high-level
langue the machine code. The phases of a compiler are:
Lexical analysis
Syntax analysis
Semantic analysis
Intermediate code generator
Code optimizer
Code generator
Here, the character stream from the source program is grouped in meaningful sequences
by identifying the tokens. It makes the entry of the corresponding tickets into the symbol
table and passes that token to next phase.
Example:
x = y + 10
6
Tokens
X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number
Syntax analysis is based on the rules based on the specific programming language by
constructing the parse tree with the help of tokens. It also determines the structure of
source language and grammar or syntax of the language.
Example
Any identifier/number is an expression
If x is an identifier and y+10 is an expression, then x= y+10 is a statement.
Consider parse tree for the following example
(a+b)*c
In Parse Tree
· Interior node: record with an operator filed and two files for children
· Leaf: records with 2/more fields; one for token and other information about the
token
· Ensure that the components of the program fit together meaningfully
· Gathers type information and checks for type compatibility
· Checks operands are permitted by the source language
· Helps you to store type information gathered and save it in symbol table or syntax
tree
· Allows you to perform type checking
· In the case of type mismatch, where there are no exact type correction rules which
satisfy the desired operation a semantic error is shown
· Collects type information and checks for type compatibility
· Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before
multiplication.
Example
For example,
t1 := int (5)
t2 := rate * t1
t3 := count + t2
total := t3
Example:
Consider the following code
a = intofloat(10)
b=c*a
d=e+b
f=d
Can become
b =c * 10.0
f = e+b
The target language is the machine code. Therefore, all the memory locations and
registers are also selected and allotted during this phase. The code generated by this
phase is executed to take inputs and generate expected outputs.
Example:
a = b + 60.0
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
10
Most common errors are invalid character sequence in scanning, invalid token sequences
in type, scope error, and parsing in semantic analysis.
The error may be encountered in any of the above phases. After finding errors, the phase
needs to deal with the errors to continue with the compilation process. These errors need
to be reported to the error handler which handles the error to perform the compilation
process. Generally, the errors are reported in the form of message.
GROUPING OF PHASES
The phases of a compiler can be grouped as Front end and Back end.
Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (target language). It includes lexical and syntactic
analysis, symbol table management, semantic analysis and the generation of
intermediate code. Code optimization can also be done by the front end. • It also
includes error handling at the phases concerned.
Front end of a compiler consists of the phases
• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.
11
Back end
Back end comprises of those phases of the compiler that are dependent on the target
machine and independent on the source language. This includes code optimization,
code generation. In addition to this, it also encompasses error handling and symbol
table management operations.
Back end of a compiler contains
• Code optimization.
• Code generation.
Passes
• The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the operations
in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.
Reducing the Number of Passes
• Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept in memory to
12
ensure proper information flow to each phase because one phase may
need information in a different order than the information produced in previous phase.
The source program or target program differs from its internal representation. So,
the memory for internal form may be larger than that of input and output.
COUSINS OF COMPILER
Cousins of compiler contains
1. Preprocessor
2. Compiler
3. Assembler
4. Linker
5. Loader
6. Memory
1) Preprocessor
A preprocessor is a program that processes its input data to produce output that is used
as input to another program. The output is said to be a preprocessed form of the input
data, which is often used by some subsequent programs like compilers. They may
perform the following functions.
1. Macro processing
2. File Inclusion
3. Rational Preprocessors
4. Language extension
1. Macro processing: A macro is a rule or pattern that specifies how a certain input
sequence should be mapped to an output sequence according to a defined procedure. The
mapping process that instantiates a macro into a specific output sequence is known as
macro expansion.
2. File Inclusion: Preprocessor includes header files into the program text. When the
preprocessor finds an #include directive it replaces it by the entire content of the
specified file.
3. Rational Preprocessors: These processors change older languages with more modern
flow-of-control and data-structuring facilities.
4. Language extension: These processors attempt to add capabilities to the language by
what amounts to built-in macros. For example, the language Equel is a database query
language embedded in C.
2) Compiler
13
It takes pure high level language as a input and convert into assembly code.
3) Assembler
It takes assembly code as an input and converts it into assembly code. Assembler creates
object code by translating assembly instruction mnemonics into machine code. There are
two types of assemblers. One-pass assemblers go through the source code once and
assume that all symbols will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code.
4) Linker
1. Allocation: It means get the memory portions from operating system and storing
the object data.
2. Relocation: It maps the relative address to the physical address and relocating the
object code.
3. Linker: It combines the entire executable object module to pre single executable
file.
5) Loader
A loader is the part of an operating system that is responsible for loading programs in
memory, one of the essential stages in the process of starting a program.
6) Memory
14
Language Definition
15
Syntax definition
Language
A0={0,1} L0={0,1,100,101,...}
Example :Grammar for expressions consisting of digits and plus and minus signs.
16
digit → 0|1|2|3|4|5|6|7|8|9
list, digit : Grammar variables, Grammar symbols.
0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols
Convention specifying grammar
· Terminal symbols : bold face string if, num, id
· Nonterminal symbol, grammar symbol : italicized names, list, digit ,A,B·
Grammar G=(N,T,P,S)
· N : a set of nonterminal symbols
· T : a set of terminal symbols, tokens
· P : a set of production rules
· S : a start symbol, S אN
Grammar G for a language L = { 9-5+2, 3-1,….}
· G=(N,T,P,S)
· N={list,digit}
· T={0,1,2,3,4,5,6,7,8,9,-,+}
· P= list → list + digit
· list → list - digit
· list → digit
· digit → 0|1|2|3|4|5|6|7|8|9
Ambiguity
A grammar is said to be ambiguous if the grammar has more than one parsev tree for a
given string of tokens.
18
Associativity of operator
digit →0|1|…|9
letter → a|b|…|z
Precedence of operators
19
We say that aoperator(*) has higher precedence than other operator(+) if the operator(*)
takes operands before other operator(+) does.
LEXICAL ANALYSIS
Lexical Analysis:
· reads and converts the input into a stream of tokens to be analyzed by parser.
· lexeme : a sequence of characters which comprises a single token.
· Lexical Analyzer →Lexeme / Token → Parser
20
Contsants
· Constants: For a while, consider only integers
· Example :x for input 31 + 28, output(token
representation)? input : 31 + 28
output: <num, 31><+,
><num, 28> num + :token
31 28 : attribute, value(or lexeme) of integer token num
Recognizing
· Identifiers
o Identifiers are names of variables, arrays, functions...
o A grammar treats an identifier as a token.
o eg) input : count = count + increment; output : <id,1><=, ><id,1><+, ><id,
2>;
Symbol table
· Keywords are reserved, i.e., they cannot be used as identifiers. Then a character
string forms an identifier only if it is not a keyword.
· punctuation symbols
operators : + - * / := <> …
Lexical Analysis is the very first phase in the compiler designing. A Lexer takes the
modified source code which is written in the form of sentences . In other words, it helps
you to convert a sequence of characters into a sequence of tokens. The lexical analyzer
breaks this syntax into a series of tokens. It removes any extra space or comment written
in the source code.
Programs that perform Lexical Analysis in compiler design are called lexical analyzers
or lexers. A lexer contains tokenizer or scanner. If the lexical analyzer detects that the
21
token is invalid, it generates an error. The role of Lexical Analyzer in compiler design is
to read character streams from the source code, check for legal tokens, and pass the data
to the syntax analyzer when it demands.
Example
How Pleasant Is The Weather?
See this Lexical Analysis example; Here, we can easily recognize that there are five
words How Pleasant, The, Weather, Is. This is very natural for us as we can recognize
the separators, blanks, and the punctuation symbol.
Basic Terminologies
What’s a lexeme?
A lexeme is a sequence of characters that are included in the source program according
to the matching pattern of a token. It is nothing but an instance of a token.
What’s a token?
Tokens in compiler design are the sequence of characters which represents a unit of
information in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which
uses as a token, the pattern is a sequence of characters.
Lexical analyzer scans the entire source code of the program. It identifies each token one
by one. Scanners are usually implemented to produce tokens only when requested by a
parser. Here is how recognition of tokens in compiler design works-
1. “Get next token” is a command which is sent from the parser to the lexical
analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the
next token.
3. It returns the token to Parser.
22
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any
error is present, then Lexical analyzer will correlate that error with the source file and
line number.
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
23
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
· Lexical errors are not very common, but it should be managed by a scanner
· Misspelling of identifiers, operators, keyword are considered as lexical errors
· Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.
· The simplicity of design: It eases the process of lexical analysis and the syntax
analysis by eliminating unwanted tokens
· To improve compiler efficiency: Helps you to improve compiler efficiency
24
· Lexical analyzer method is used by programs like compilers which can use the
parsed data from a programmer’s code to create a compiled binary executable
code
· It is used by web browsers to format and display a web page with the help of
parsed data from JavsScript, HTML, CSS
· A separate lexical analyzer helps you to construct a specialized and potentially
more efficient processor for the task
· You need to spend significant time reading the source program and partitioning it
in the form of tokens
· Some regular expressions are quite difficult to understand compared to PEG or
EBNF rules
· More effort is needed to develop and debug the lexer and its token descriptions
· Additional runtime overhead is required to generate the lexer tables and construct
the tokens
REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string. Component of
regular expression..
X - the character x
. any character, usually accept a new line [x y z] any of the characters x, y, z,
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
25
REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and to
define regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following
regular definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?)) Pascal identifier
Letter - A | B | ……| Z | a | b |……| z| Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to
take the patterns for all the needed tokens and build a piece of code that examins the
input string and finds a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt
|є
Expr →term relop term
| term Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names
of tokens as far as the lexical analyzer is concerned, the patterns for the tokens are
26
In addition, we assign the lexical analyzer the job stripping out white space, by
recognizing the “token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when
we recognize it, we do not return it to parser ,but rather restart the lexical analysis from
the character that follows the white space . It is the following token that gets returned to
the parser.
TRANSITION DIAGRAM:
27
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking
for a lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is
labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s
labeled by a. if we find such an edge ,we advance the forward pointer and enter the state
of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a
lexeme has been found, although the actual lexeme may not consist of all positions b/w
the lexeme Begin and forward pointers we always indicate an accepting state by a double
circle.
2. In addition, if it is necessary to return the forward pointer one position, then we
shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled
“start” entering from nowhere .the transition diagram always begins in the state before
any input symbols have been used.
28
FINITE AUTOMATON
· A recognizer for a language is a program that takes a string x, and answers “yes”
if x is a sentence of that language, and “no” otherwise.
· We call the recognizer of the tokens as a finite automaton.
· A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
· This means that we may use a deterministic or non-deterministic automaton as a
lexical analyzer.
· Both deterministic and non-deterministic finite automaton recognize regular sets.
· Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
· First, we define regular expressions for tokens; Then we convert them into a DFA
to get a lexical analyzer for our tokens.
one of accepting states such that edge labels along this path spell out x.
Example:
Example:
30
Converting RE to NFA
· This is one way to convert a regular expression into a NFA.
· There can be other ways (much efficient) for the conversion.
· Thomson’s Construction is simple and systematic method.
· It guarantees that the resulting NFA will have exactly one final state, and one start
state.
· Construction starts from simplest parts (alphabet symbols).
· To create a NFA for a complex regular expression, NFAs of its sub-expressions
are combined to create its NFA.
· To recognize an empty string ε:
31
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
Example:
For a RE (a|b) * a, the NFA construction is shown below.
· From the point of view of the input, any two states that are connected by an –
transition may as well be the same, since we can move from one to the other without
consuming any character. Thus states which are connected by an -transition will be
represented by the same states in the DFA.
· If it is possible to have multiple transitions based on the same symbol, then we
can regard
a transition on a symbol as moving from a state to a set of states (ie. the union of all
32
those states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
To perform this operation, let us define two functions:
· The -closure function takes a state and returns the set of states reachable from it
based on (one or more) -transitions. Note that this will always include the state itself. We
should be able to get from a state to any state in its -closure without consuming any input.
· The function move takes a state and a character, and returns the set of states
reachable by one transition on this character.
Wecan generalise both these functions to apply to sets of states by taking the union of the
application to individual states.
put ε-closure({s0}) as an unmarked state into the set of DFA (DS) while (there is one
unmarked S1 in DS) do
begin
mark S1
for each input symbol a do begin
end
end
33
Lex specifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are
34
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the
book:
Input buffering
Lexical Analysis has to access secondary memory each time to identify tokens. It is
time-consuming and costly. So, the input strings are stored into a buffer and then
scanned by Lexical Analysis.
Lexical Analysis scans input string from left to right one character at a time to identify
tokens. It uses two pointers to scan tokens −
· Begin Pointer (bptr) − It points to the beginning of the string to be read.
· Look Ahead Pointer (lptr) − It moves ahead to search for the end of the token.
Example − For statement int a, b;
1. Both pointers start at the beginning of the string, which is stored in the buffer.
35
· After processing token ("int") both pointers will set to the next token ('a'), & this
process will be repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards
halfway in First Half, the second half is filled with new characters to be read. If the look
Ahead pointer moves towards the right end of the buffer of the second half, the first half
will be filled with new characters, and it goes on.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is
converted, a check is completed to provide that one half of the buffer has not converted
off. If it is completed, then the other half should be reloaded.
Buffer Pairs − A specialized buffering technique can decrease the amount of overhead,
which is needed to process an input character in transferring characters. It includes two
buffers, each includes N-character size which is reloaded alternatively.
There are two pointers such as the lexeme Begin and forward are supported. Lexeme
Begin points to the starting of the current lexeme which is discovered. Forward scans
ahead before a match for a pattern are discovered. Before a lexeme is initiated, lexeme
begin is set to the character directly after the lexeme which is only constructed, and
forward is set to the character at its right end.
Preliminary Scanning − Certain processes are best performed as characters are moved
from the source file to the buffer. For example, it can delete comments. Languages like
36
FORTRAN which ignores blank can delete them from the character stream. It can also
collapse strings of several blanks into one blank. Pre-processing the character stream
being subjected to lexical analysis saves the trouble of moving the look ahead pointer
back and forth over a string of blanks.
37
UNIT - II
UNIT II
Symbol tables: Symbol table entries – List data structures for symbol table – - Hash
tables – Representation of scope information – Syntax Analysis: Role of parser –
Context free grammar – Writing a grammar – Top down parsing – Simple bottom up
parsing – Shift reducing parsing.
Symbol Table
Symbol table is an important data structure created and maintained by compilers in order
to store information about the occurrence of various entities such as variable names,
function names, objects, classes, interfaces, etc. Symbol table is used by both the
analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
A symbol table is simply a table which can be either linear or a hash table. It maintains
an entry for each name in the following format:
For example, if a symbol table has to store information about the following variable
declaration:
38
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be
implemented as an unordered list, which is easy to code, but it is only suitable for small
tables only. A symbol table can be implemented in one of the following ways:
Among all, symbol tables are mostly implemented as hash tables, where the source code
symbol itself is treated as a key for the hash function and the return value is the
information about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the
compiler where tokens are identified and names are stored in the table. This operation is
used to add information in the symbol table about unique names occurring in the source
code. The format or structure in which the names are stored depends upon the compiler
in hand.
An attribute for a symbol in the source code is the information associated with that
symbol. This information contains the value, state, scope, and type about the symbol.
The insert() function takes the symbol and its attributes as arguments and stores the
information in the symbol table.
For example:
int a;
insert(a, int);
lookup()
The format of lookup() function varies according to the programming language. The
basic format should match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the
symbol exists in the symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be
accessed by all the procedures and scope symbol tables that are created for each scope
in the program.
To determine the scope of a name, symbol tables are arranged in hierarchical structure as
shown in the example below:
...
int value=10;
void pro_one()
{
int one_1;
int one_2;
{ \
int one_3; |_ inner scope 1
int one_4; |
} /
int one_5;
{ \
int one_6; |_ inner scope 2
int one_7; |
} /
}
void pro_two()
{
int two_1;
40
int two_2;
{ \
int two_3; |_ inner scope 3
int two_4; |
} /
int two_5;
}
...
The global symbol table contains names for one global variable (int value) and two
procedure names, which should be available to all the child nodes shown above. The
names mentioned in the pro_one symbol table (and all its child tables) are not available
for pro_two symbols and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyzer and
whenever a name needs to be searched in a symbol table, it is searched using the
following algorithm:
· first a symbol will be searched in the current scope, i.e. current symbol table.
41
· if a name is found, then search is completed, else it will be searched in the parent
symbol table until,
· either the name is found or global symbol table has been searched for the name.
Operations of Symbol table – The basic operations defined on a symbol table include:
Implementation of Symbol table
Following are commonly used data structures for implementing symbol table:-
1. List
oIn this method, an array is used to store names and associated information.
o A pointer “available” is maintained at end of all stored records and new
names are added in the order as they arrive
o To search for a name we start from the beginning of the list till available
pointer and if not found we get an error “use of the undeclared name”
o While inserting a new name we must ensure that it is not already present
otherwise an error occurs i.e. “Multiple defined names”
o Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
o The advantage is that it takes a minimum amount of space.
2. Linked List
o This implementation is using a linked list. A link field is added to each
record.
o Searching of names is done in order pointed by the link of the link field.
o A pointer “First” is maintained to point to the first record of the symbol
table.
o Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
3. Hash Table
o In hashing scheme, two tables are maintained – a hash table and symbol
table and are the most commonly used method to implement symbol tables.
o A hash table is an array with an index range: 0 to table size – 1. These
entries are pointers pointing to the names of the symbol table.
o To search for a name we use a hash function that will result in an integer
between 0 to table size – 1.
o Insertion and lookup can be made very fast – O(1).
o The advantage is quick to search is possible and the disadvantage is that
hashing is complicated to implement.
4. Binary Search Tree
o Another approach to implementing a symbol table is to use a binary search
tree i.e. we add two link fields i.e. left and right child.
42
o All names are created as child of the root node that always follows the
property of the binary search tree.
o Insertion and lookup are O(log2 n) on average.
SYNTAX ANALYSIS
Parser for any grammar is program that takes as input string w (obtain set of
strings tokens from the lexical analyzer) and produces as output either a
parse tree for w , if w is a valid sentences of grammar or error message
indicating that w is not a valid sentences of given grammar. The goal of the
parser is to determine the syntactic validity of a source string is valid, a tree
is built for use by the subsequent phases of the computer. The tree reflects
the sequence of derivations or reduction used during the parser. Hence, it is
called parse tree. If string is invalid, the parse has to issue diagnostic
message identifying the nature and cause of the errors in string. Every
elementary subtree in the parse tree corresponds to a production of the
grammar.
top(root) to bottom(leaves)
b. Bottom up parser: which build parse trees from leaves and
43
44
Derivations
In general a derivation step is
αAβ α䳑
γβ is sentential form and if there is a production rule A→γ in our
grammar. where α and β are arbitrary strings of terminal and non-terminal
symbols α1 䳑α2 䳑... 䳑αn (αn derives from α1 or α1 derives αn ). There are
two types of derivaion
1At each derivation step, we can choose any of the non-terminal in the
sentential form of G for the replacement.
2If we always choose the left-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
Example:
E→E+E|E–E|E*E|E/
E | - EE → ( E )
E → id
Leftmost derivation :
E→E+E
→E * E+E →id* E+E→id*id+E→id*id+id
The string is derive from the grammar w= id*id+id, which is consists of all
terminal symbols
Rightmost
derivation E → E
+E
→E+E * E→E+
E*id→E+id*id→id+id*id Given grammar
G : E → E+E | E*E | ( E ) | - E | id Sentence
to be derived : – (id+id)
LEFTMOST DERIVATION RIGHTMOST
DERIVATION E → - E E→-E
E→-(E) E→-(E)
E → - ( E+E ) E → - (E+E )
E → - ( id+E ) E → - ( E+id )
45
E → - ( id+id ) E → - ( id+id )
· String that appear in leftmost derivation are called left sentinel forms.
· String that appear in rightmost derivation are called right
sentinel forms. Sentinels:
· Given a grammar G with start symbol S, if S → α , where α may
contain non- terminals or terminals, then α is called the sentinel form
of G.
Yield or frontier of tree:
· Each interior node of a parse tree is a non-terminal. The children of
node can be a terminal or non-terminal of the sentinel forms that are
read from left to right. The sentinel form in the parse tree is called yield
or frontier of the tree.
PARSE TREE
· Inner nodes of a parse tree are non-terminal symbols.
· The leaves of a parse tree are terminal symbols.
· A parse tree can be seen as a graphical representation of a derivation.
Ambiguity:
A grammar that produces more than one parse for some sentence is said to e
ambiguous grammar.
46
Example:
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use
precedence of operators as follows:
^ (right to left)
/,* (left to right)
-,+ (left to
right) We get the following unambiguous
grammar:
E → E+T | T
T → T*F |
F F → G^F
| G G → id
| (E)
47
Consider this example, G: stmt → if expr then stmt |if expr then stmt
elsestmt | other This grammar is ambiguous since the string if E1 then if
E2 then S1 else S2 has the following
49
id
Algorithm to eliminate left recursion:
1. Arrange the non-terminals in some order A1, A2
Factoring:
50
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to
the second symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first
alternative.
51
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance
the input pointer to third symbol of w ‘d’. But the third leaf of tree is b which
does not match with the input symbol d.
Hence discard the chosen production and reset the pointer to second position. This
is called
backtracking.
Step4:
Now try the second alternative for A.
T( );
EPRIME( );
End
Procedure EPRIME( )
begin
end
T( );
EPRIME( );
Procedure T( )
begin
End
F( );
53
TPRIME( );
Procedure TPRIME( )
begin
F( );
TPRIME( );
ocedure F( )
begin
end
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
else ERROR( );
Stack implementation:
F( ) id+id*id
ADVANCE( ) id id*id
TPRIME( ) id id*id
EPRIME( ) id id*id
ADVANCE( ) id+id*id
T( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
ADVANCE( ) id+id*id
F( ) id+id*id
ADVANCE( ) id+id*id
TPRIME( ) id+id*id
PREDICTIVE PARSING
The table-driven predictive parser has an input buffer, stack, a parsing table and an
output stream.
Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
55
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of
the stack. Initially, the stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a
terminal.
Predictive parsing program:
The parser is controlled by a program that considers X, the symbol on top of
stack, and a, the current input symbol. These two symbols determine the parser
action. There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of
parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the
56
if X is a terminal or
$then if X = a
then
else/* X is a non-terminal */
end
elseerror
()
until X = $
57
Example:
Consider the following
grammar : E → E+T | T
T→T*F |
F F → (E)
| id
After eliminating left-recursion the
grammar is E → TE’
E’ → +TE’
|ε T → FT’
T’ → *FT’
| ε F → (E) |
id First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ ,ε}
FIRST(T) = { ( , id}
58
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
59
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than
one entry. This type of grammar is called LL(1) grammar.
Consider this following
60
Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.
61
BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going
towards the root is called bottom-up parsing.
A general type of bottom-up parser is a shift-reduce parser.
SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a
parse tree for an input string beginning at the leaves (the bottom) and
working up towards the root (the top).
Example:
Consider the grammar:
S → aABe
A → Abc |
bB → d
The sentence to be recognized is abbcde.
abbcde (A → b) S→ aABe
aAde (B → d) → aAbcde
Handles:
Example:
62
E → E+E
E →
E*E E
→ (E)
E → id
derivation is :
E →E+E
→ E+E*E
→ E+E*id3
→ E+id2*id3
→id 1+id2*id3
63
• shift – The next input symbol is shifted onto the top of the stack.
• reduce – The parser replaces the handle within a stack with a non-terminal.
• accept – The parser announces successful completion of parsing.
• error – The parser discovers that a syntax error has occurred and calls an
error recovery routine.
Conflicts in shift-reduce parsing:
1. Shift-reduce
conflict: Example:
Consider the grammar:
2. Reduce-reduce conflict:
Consider the
grammar: M →
R+R | R+c | R
R→c
and input c+c
$c +c $ Reduce $c +c $ Reduce
by R→c by R→c
$R +c $ Shift $R +c $ Shift
$ R+ c$ Shift $ R+ c$ Shift
65
Viable prefixes:
偪a is a viable prefix of the grammar if there iswsuch that awis a right sentinel form.
偪 The set of prefixes of right sentinel forms that can appear on the stack of a
shift-reduce parser are called viable prefixes.
偪 The set of viable prefixes is a regular language.
OPERATOR-PRECEDENCE PARSING
Example:
E → EAE | (E) | -
E | id A → + | - | * |
/|↑
Since the right side EAE has three consecutive non-terminals, the grammar can be
written as follows:
=- equal to
.
>- greater than
66
) . >θ , θ . >)
θ . >$ , $< . θ
Also make
.
( = ) , (< ( , ) . >) , (< .
id , id . >) , $< . id , id .
>$ , $<
.
( , ) . >$
Example:
E → E+E | E-E | E*E | E/E | E↑E | (E) | -E | id is given in the following table
assuming
67
Method :Initially the stack contains $ and the input buffer the stringw$. To
parse, we execute the following program :
end;
.
(9)elseifa >bthen /*reduce*/
(10)repeat
(11) pop the stack
.
(12)untilthe top stack terminal is related by <
to the terminal most recently
popped (13)elseerror( )
end
Operator precedence parsing uses a stack and precedence relation table for
its implementation of above algorithm. It is a shift-reduce parsing containing all
four actions shift, reduce, accept and error.
The initial configuration of an operator precedence parsing is
STACK INPUT
$ w$
Example:
Consider the grammar E → E+E | E-E | E*E | E/E | E↑E | (E) | id. Input string
isid+id*id.The implementation is as follows:
69
$ + * id ·> $ pop id
$+* ·> $ pop *
$+ ·> $ pop +
$ $ accept
1. It is easy to implement.
2. Once an operator precedence relation is made between all pairs of terminals
of a grammar , the grammar can be ignored. The grammar is not referred
anymore during implementation.
1. It is hard to handle tokens like the minus sign (-) which has two different
precedence.
2. Only a small class of grammar can be parsed using operator-precedence parser.
LR PARSERS
Advantages of LR parsing:
俵 It
recognizes virtually all programming language constructs for which
CFG can be written.
俵 It is an efficient non-backtracking shift-reduce parsing method.
俵 A grammar that can be parsed using LR method is a proper superset of
a grammar that can be parsed with predictive parser.
俵 It detects a syntactic error as soon as possible.
Drawbacks of LR method:
1. SLR- Simple LR
伴 Easiest to implement, least powerful.
2. CLR- Canonical LR
伴 Most powerful, most expensive.
3. LALR- Look-Ahead LR
伴 Intermediate in size and cost between the other two methods.
INPUT a1 … ai … an $
Sm LR parsing program
Xm OUTPUT
S m-1
X m-1
… action goto
S0
STACK
71
Goto: The function goto takes a state and grammar symbol as arguments and
produces a state.
LR Parsing algorithm:
Method: Initially, the parser has s0 on its stack, where s0 is the initial state,
andw$ in the input buffer. The parser then executes the following program :
ifaction[s,a] = shifts’then
begin pushathens’ on top of
the stack; advanceipto the
next input symbol
end
else ifaction[s,a] = reduce A→βthen begin
72
LR(O) items:
A
→.XYZ
A →
X.YZ A
→ XY.Z
A →
XYZ.
Closure operation:
Goto operation:
Method:
1. Construct C = {I0, I1, …. In}, the collection of sets of LR(0) items for G’.
2. Stateiis constructed from I i.. The parsing functions for stateiare determined as
follows:
(a) If [A→a·aβ] is in Ii and goto(Ii,a) = Ij, then setaction[i,a] to “shift j”.
Hereamust be terminal.
(b) If [A→a·] is in Ii , then setaction[i,a] to “reduce A→a” for allain
FOLLOW(A).
(c) If [S’→S.] is in Ii, then setaction[i,$] to “accept”.
If any conflicting actions are generated by the above rules, we say grammar is not
SLR(1).
74
4. All entries not defined by rules (2) and (3) are made “error”
5. The initial state of the parser is the one constructed from the set of items
containing [S’→.S].
T→T*F
| F F → (E)
| id
Augmented grammar :
E’ → E
E→E+T
E→T
T→T*
FT → F
75
F → (E)
F → id
items. I0 : E’ →.E
E →.E + T
E →.T
T →.T *
F T →.F
F →.(E)
F →.id
GOTO ( I0 , E) I1 : E’ → E.
E → E.+ T
GOTO ( I4 , id )
I5 : F → id.
GOTO ( I6 , T )
GOTO ( I0 , T) I9 : E → E + T.
I2 : E → T. T → T.* F
T → T.* F
GOTO ( I6 ,
GOTO ( I0 , F ) I3 : T →
F) I3 : T → F.
F.
GOTO ( I6 , ( )
I4 : F → (.E )
76
GOTO ( I0 , ( E → E.+ T
)
GOTO ( I6 , id)
I4 : F → (.E)
I5 : F → id.
E →.E +
T E →.T
GOTO ( I7 , F ) I10 : T → T * F.
T →.T *
F T →.F GOTO ( I7 , ( )
F →.(E) I4 : F → (.E )
F →.id
E →.E +
T E →.T
GOTO ( I0 ,
id ) T →.T *
F T →.F
I5 : F → id.
F →.(E)
GOTO ( I1 , F →.id
+ ) I6 : E → E
+.T GOTO ( I7 , id )
T →.T * I5 : F → id.
F T →.F
F →.(E) GOTO ( I8 , ) )
F →.id I11 : F → (
E ).
GOTO ( I2 ,
* ) I7 : T → GOTO ( I8 ,
T *.F + ) I6 : E →
E +.T
F →.(E)
T →.T *
F →.id
F T →.F
GOTO ( I4 , F →.( E )
E ) I8 : F → ( F →.id
E.)
77
78
GOTO ( I4 , T) I2 : E →T.
T → T.* F
GOTO ( I4 , F) I3 : T → F.
GOTO ( I9 , *) I7 : T → T *.F
F →.( E )
F →.id
79
GOTO ( I4 , ( )
I4 : F → (.E)
E →.E + T
E →.T
T →.T * F
T →.F
FOLLOW (E) = { $ , ) , +)
FOLLOW (T) = { $ , + , ) , * }
FOOLOW (F) = { * , + , ) , $ }
ACTIO GOT
N O
id + * ( ) $ E T F
IO s s 1 2 3
5 4
I1 s AC
6 C
I2 r2 s7 r2 r2
I3 r4 r4 r4 r4
I4 s s 8 2 3
5 4
I5 r6 r6 r6 r6
I6 s s 9 3
5 4
I7 s s 10
5 4
I8 s s11
6
I9 r1 s7 r1 r1
I1O r3 r3 r3 r3
I11 r5 r5 r5 r5
80
Stack implementation:
81
82
UNIT – III
Syntax-Directed Translation -Definition
The translation techniques in this chapter will be applied to type checking and
intermediate-code generation. The techniques are also useful for implementing little
languages for specialized tasks; this chapter includes an example from typesetting.
This production has two nonterminals, E and T; the subscript in E1 distinguishes the
occurrence of E in the production body from the occurrence of E as the head. Both E
and T have a string-valued attribute code. The semantic rule specifies that the string
E.code is formed by concatenating Ei.code, T.code, and the character '+'. While the rule
makes it explicit that the translation of E is built up from the translations of E1, T, and
'+', it may be inefficient to implement the translation directly by manipulating strings.
E -» Ei + T { print '+' }
By convention, semantic actions are enclosed within curly braces. (If curly braces occur
as grammar symbols, we enclose them within single quotes, as in ' { ' and '}'.) The
83
position of a semantic action in a production body determines the order in which the
action is executed. In production (5.2), the action occurs at the end, after all the
grammar symbols; in general, semantic actions may occur at any position in a
production body.
Between the two notations, syntax-directed definitions can be more readable, and
hence more useful for specifications. However, translation schemes can be more
efficient, and hence more useful for implementations.
84
Example
E -> E+T | T
T -> T*F | F
F -> INTLIT
For understanding translation rules further, we take the first SDT augmented to [ E ->
E+T ] production rule. The translation rule in consideration has val as an attribute for
both the non-terminals – E & T. Right-hand side of the translation rule corresponds to
attribute values of right-side nodes of the production rule and vice-versa. Generalizing,
SDT are augmented rules to a CFG that associate 1) set of attributes to every node of
the grammar and 2) set of translation rules to every production rule using attributes,
constants, and lexical values.
Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree
corresponding to S would be
85
To evaluate translation rules, we can employ one depth-first search traversal on the
parse tree. This is possible only because SDT rules don’t impose any specific order on
evaluation until children’s attributes are computed before parents for a grammar having
all synthesized attributes. Otherwise, we would have to figure out the best-suited plan to
traverse through the parse tree and evaluate all the attributes in one or more traversals.
For better understanding, we will move bottom-up in the left to right fashion for
computing the translation rules of our example.
The above diagram shows how semantic analysis could happen. The flow of
information happens bottom-up and all the children’s attributes are computed before
parents, as discussed above. Right-hand side nodes are sometimes annotated with
86
Synthesized Attributes are such attributes that depend only on the attribute values of
children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val
corresponding to node E. If all the semantic attributes in an augmented grammar are
synthesized, one depth-first search traversal in any order is sufficient for the semantic
analysis phase.
Inherited Attributes are such attributes that depend on parent and/or sibling’s
attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E & Ep are same
production symbols annotated to differentiate between parent and child, has an inherited
attribute val corresponding to node T.
87
88
1. always be evaluated
2. be evaluated only if the definition is L-attributed .
3. be evaluated only if the definition has synthesized attributes.
4. never be evaluated.
89
TYPE CHECKING
A compiler must check that the source program follows both syntactic and semantic
conventions of the source language. This checking, called static checking, detects and
reports programming errors.
A type checker verifies that the type of a construct matches that expected by its
context. For example: arithmetic operator mod in Pascal requires integer operands, so a
type checker verifies that the operands of mod have type integer. Type information
gathered by a type checker may be needed when code is generated.
Type Systems
The design of a type checker for a language is based on information about the syntactic
constructs in the language, the notion of types, and the rules for assigning types to
90
language constructs.
For example : “ if both operands of the arithmetic operators of +,- and * are of type
integer, then the result is of type integer ”
Type Expressions
Constructors include:
Arrays : If T is a type expression then array (I,T) is a type expression denoting the type
of an array with elements of type T and index set I.
Records : The difference between a record and a product is that the names. The record
type constructor will be applied to a tuple formed from field names and field types.
For example:
91
declares the type name row representing the type expression record((address X integer)
X (lexeme X array(1..15,char))) and the variable table to be an array of records of this
type.
Type systems
A type system is a collection of rules for assigning type expressions to the various parts
of a program. A type checker implements a type system. It is specified in a syntax-
directed manner. Different type systems may be used by different compilers or
processors of the same language.
Static and Dynamic Checking of Types
Checking done by a compiler is said to be static, while checking done when the target
program runs is termed dynamic. Any check can be done dynamically, if the target code
carries the type of an element along with the value of that element.
92
Error Recovery
Since type checking has the potential for catching errors in program, it is desirable for
type checker to recover from errors, so it can check the rest of the input. Error handling
has to be designed into the type system right from the start; the type checking rules must
be prepared to cope with errors.
A type checker for a simple language checks the type of each identifier. The type
checker is a translation scheme that synthesizes the type of each expression from the
types of its subexpressions. The type checker can handle arrays, pointers, statements
and functions.
A Simple Language
P→D;E
D → D ; D | id : T
Translation scheme:
P→D;E
D→D;D
93
T → ↑ T1 { T.type : = pointer(T1.type) }
In the following rules, the attribute type for E gives the type expression assigned to the
expression generated by E.
Here, constants represented by the tokens literal and num have type char and integer.
lookup ( e ) is used to fetch the type saved in the symbol table entry pointed to by e.
3. E → E1 mod E2 { E.type : = if E1. type = integer and E2. type = integer then integer
else type_error }
The expression formed by applying the mod operator to two subexpressions of type
integer has type integer; otherwise, its type is type_error.
In an array reference E1 [ E2 ] , the index expression E2 must have type integer. The
result is the element type t obtained from the type array(s,t) of E1.
The postfix operator ↑ yields the object pointed to by its operand. The type of E ↑
is the type t of the object pointed to by the pointer E.
Statements do not have values; hence the basic type void can be assigned to them. If an
error is detected within a statement, then type_error is assigned.
3 While statement:
S → while E do S1
4. Sequence of statements:
S → S1 ; S2 { S.type : = if S1.type = void and S1.type = void then void else type_error
}
UNIT –IV
Run-Time Environment
A program as a source code is merely a collection of text (code, statements etc.) and to
make it alive, it requires actions to be performed on the target machine. A program
needs memory resources to execute instructions. A program contains names for
95
procedures, identifiers etc., that require mapping with the actual memory location at
runtime.
Runtime support system is a package, mostly generated with the executable program
itself and facilitates the process communication between the process and the runtime
environment. It takes care of memory allocation and de-allocation while the program is
being executed.
Activation Trees
The execution of a procedure is called its activation. An activation record contains all
the necessary information required to call a procedure. An activation record may
contain the following units (depending upon the source language used).
Whenever a procedure is executed, its activation record is stored on the stack, also
known as control stack. When a procedure calls another procedure, the execution of the
caller is suspended until the called procedure finishes execution. At this time, the
activation record of the called procedure is stored on the stack.
96
We assume that the program control flows in a sequential manner and when a procedure
is called, its control is transferred to the called procedure. When a called procedure is
executed, it returns the control back to the caller. This type of control flow makes it
easier to represent a series of activations in the form of a tree, known as the activation
tree.
...
printf(“Enter Your Name: “);
scanf(“%s”, username);
show_data(username);
printf(“Press any key to continue…”);
...
int show_data(char *user)
{
printf(“Your name is %s”, username);
return 0;
}
...
Now we understand that procedures are executed in depth-first manner, thus stack
allocation is the best suitable form of storage for procedure activations.
Storage Allocation
Runtime environment manages runtime memory requirements for the following entities:
· Code : It is known as the text part of a program that does not change at runtime.
Its memory requirements are known at the compile time.
· Procedures : Their text part is static but they are called in a random manner.
That is why, stack storage is used to manage procedure calls and activations.
· Variables : Variables are known at the runtime only, unless they are global or
constant. Heap memory allocation scheme is used for managing allocation and
de-allocation of memory for variables in runtime.
97
Static Allocation
In this allocation scheme, the compilation data is bound to a fixed location in the
memory and it does not change when the program executes. As the memory
requirement and storage locations are known in advance, runtime support package for
memory allocation and de-allocation is not required.
Stack Allocation
Procedure calls and their activations are managed by means of stack memory allocation.
It works in last-in-first-out (LIFO) method and this allocation strategy is very useful for
recursive procedure calls.
Heap Allocation
Variables local to a procedure are allocated and de-allocated only at runtime. Heap
allocation is used to dynamically allocate memory to the variables and claim it back
when the variables are no more required.
Except statically allocated memory area, both stack and heap memory can grow and
shrink dynamically and unexpectedly. Therefore, they cannot be provided with a fixed
amount of memory in the system.
As shown in the image above, the text part of the code is allocated a fixed amount of
memory. Stack and heap memory are arranged at the extremes of total memory
allocated to the program. Both shrink and grow against each other.
Parameter Passing
r-value
The value of an expression is called its r-value. The value contained in a single variable
also becomes an r-value if it appears on the right-hand side of the assignment operator.
r-values can always be assigned to some other variable.
98
l-value
The location of memory (address) where an expression is stored is known as the l-value
of that expression. It always appears at the left hand side of an assignment operator.
For example:
day = 1;
week = day * 7;
month = 1;
year = month * 12;
From this example, we understand that constant values like 1, 7, 12, and variables like
day, week, month and year, all have r-values. Only variables have l-values as they also
represent the memory location assigned to them.
For example:
7 = x + y;
is an l-value error, as the constant 7 does not represent any memory location.
Formal Parameters
Variables that take the information passed by the caller procedure are called formal
parameters. These variables are declared in the definition of the called function.
Actual Parameters
Variables whose values or addresses are being passed to the called procedure are called
actual parameters. These variables are specified in the function call as arguments.
Example:
fun_one()
{
int actual_parameter = 10;
call fun_two(int actual_parameter);
}
fun_two(int formal_parameter)
{
99
print formal_parameter;
}
Formal parameters hold the information of the actual parameter, depending upon the
parameter passing technique used. It may be a value or an address.
Pass by Value
In pass by value mechanism, the calling procedure passes the r-value of actual
parameters and the compiler puts that into the called procedure’s activation record.
Formal parameters then hold the values passed by the calling procedure. If the values
held by the formal parameters are changed, it should have no impact on the actual
parameters.
Pass by Reference
In pass by reference mechanism, the l-value of the actual parameter is copied to the
activation record of the called procedure. This way, the called procedure now has the
address (memory location) of the actual parameter and the formal parameter refers to
the same memory location. Therefore, if the value pointed by the formal parameter is
changed, the impact should be seen on the actual parameter as they should also point to
the same value.
Pass by Copy-restore
This parameter passing mechanism works similar to ‘pass-by-reference’ except that the
changes to actual parameters are made when the called procedure ends. Upon function
call, the values of actual parameters are copied in the activation record of the called
procedure. Formal parameters if manipulated have no real-time effect on actual
parameters (as l-values are passed), but when the called procedure ends, the l-values of
formal parameters are copied to the l-values of actual parameters.
Example:
int y;
calling_procedure()
{
y = 10;
copy_restore(y); //l-value of y is passed
printf y; //prints 99
100
}
copy_restore(int x)
{
x = 99; // y still has value 10 (unaffected)
y = 0; // y is now 0
}
When this function ends, the l-value of formal parameter x is copied to the actual
parameter y. Even if the value of y is changed before the procedure ends, the l-value of
x is copied to the l-value of y making it behave like call by reference.
Pass by Name
Languages like Algol provide a new kind of parameter passing mechanism that works
like preprocessor in C language. In pass by name mechanism, the name of the procedure
being called is replaced by its actual body. Pass-by-name textually substitutes the
argument expressions in a procedure call for the corresponding parameters in the body
of the procedure so that it can now work on actual parameters, much like pass-by-
reference.
101
said to be
called at that point.
Activation Tree
quicksort(int m, int n)
{
int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
102
}
First
st m
main function as root then main calls readarray and quicksort.
Quicksort
ck in turn calls partition and quicksort again. The flow of control in a
program corresponds to the depth first traversal of activation tree which starts at the
root.
Control Stack
Control
ntr stack or runtime stack is used to keep track of the live procedure activations
i.e the procedures whose execution have not been completed.
pro
A procedure name is pushed on to the stack when it is called (activation
begins) and it is popped when it returns (activation ends).
Information
orm needed by a single execution of a procedure is managed using
103
an activation record.
Whenen a procedure is called, an activation record is pushed into the stack and
as soon as the control returns to the caller function the activation record is
popped.
Then
en the contents of the control stack are related to paths to the root of the
activation tree. When node n is at the top of the control stack, the stack
contains the nodes along the path from n to the root.
Consider
nsid the above activation tree, when quicksort(4,4) gets executed, the
contents of control stack were main() quicksort(1,10) quicksort(1,4),
quicksort(4,4)
TheScopeofDecl
arati
on
A declaration
ec is a syntactic construct that associates information with a name.
Declaration may be explicit such as
vari:integer;
Bi
ndi
ngOfNames
Even
en if each name is declared once in a program, the same name may denote
different data object at run time. “Data objects” corresponds to a storage
location that hold values.
The tterm environment refers to a function that maps a name to a storage
location.
atio The term state refers to a function that maps a storage location to the
value held there.
104
STORAGE ORGANIZATION
Thee eexecuting target program runs in its own logical address space in which
each program value has a location
The mmanagement and organization of this logical address space is shared
between the compiler, operating system and target machine. The operating
system maps the logical address into physical addresses, which are usually
spread through memory.
Typical subdivision of run time memory.
It is L
LIFO structure used to hold information about each instantiation.
Procedure
ce calls and returns are usually managed by a run time stack called
control stack.
Eachh live activation has an activation record on control stack, with the root
of the activation tree at the bottom, the latter activation has its record at the
top of the stack
The ccontents
o of the activation record vary with the language being
l
implemented. The diagram below shows the contents of an activation
record.
The purpose
p of the fields of an activation record is as follows, starting from
the field for temporaries.
1. Temporary values, such as those arising in the evaluation of
expressions, are stored in the field for temporaries.
2. The field for local data holds data that is local to an execution of a
procedure.
3. The field for saved machine status holds information about the
state of the machine just before the procedure is called. This
information includes the values of the program counter and
machine registers that have to be restored when control returns
from the procedure.
4. The optional access link is used to refer to nonlocal data held in
other activation records.
5. The optional control /ink paints to the activation record of the caller
106
Returned value
Actual parameters
Optional control
link
Optional access
link
Saved machine
status
Local data
temporaries
General Activation Record
Heap allocation - allocates and deallocates storage as needed at run time from
a data area known as heap.
Static Allocation
In sstatic
ta allocation, names bound to storage as the program is compiled, so
there is no need for a run-time support package.
Since
ce the bindings do not change at runtime, every time a procedure activated,
its run- time, names bounded to the same storage location.
Therefore,
e re values of local names retained across activations of a procedure.
That is when control returns to a procedure the value of the local are the
107
All ccompilers
o for languages that use procedures, functions or methods as units
of user functions define actions manage at least part of their runtime memory
as a stack run- time stack.
Eachh time a procedure called, space for its local variables is pushed onto a
stack, and when the procedure terminates, space popped off from the stack
Calling Sequences
Procedures
ce called implemented in what is called as calling sequence, which
consists of code that allocates an activation record on the stack and enters
information into its fields.
etu sequence is similar to code to restore the state of a machine so the
A return
calling procedure can continue its execution after the call.
The ccode is calling sequence of often divided between the calling procedure
(caller) and a procedure is calls (callee)(callee).
Whenen designing calling sequences and the layout of activation record, the
following principles are helpful:
length fields in the activation record. Fixed length data can then
be accessed by fixed offsets, known to the intermediate code
generator, relative to the top of the stack pointer.
The calling sequence and its division between caller and callee are as follows:
1. The caller evaluates the actual parameters.
2. The caller stores a return address and the old value of top_sp
into the callee’s activation record. The caller then increments
the top_sp to the respective positions.
3. The callee-saves the register values and other status information.
4. The callee initializes its local data and begins execution.
compile time, but which are local to a procedure and thus may be
allocated on the stack.
A common
com strategy for allocating variable-length arrays is shown in following
figure
110
Heap Allocation
Heap
ap allocation parcels out pieces of contiguous storage, as needed for
activation records or other objects.
Pieces
ces may be deallocated in any order, so over the time the heap will consist
of alternate areas that are free and in use.
The rrecord
e for an activation of procedure r is retained when the activation ends.
re
Therefore, the record for the new activation q(1 , 9) cannot follow that for s physically.
he retained activation record for r is deallocated, there will be free space in
If the
the heap between the activation records for s and q.
111
A source program can be translated directly into the target language, but
some benefits of using intermediate form are:
Ø Retargeting is facilitated: a compiler for a different machine can be
created by attaching a Back-end (which generate Target Code) for the
new machine to an existing Front-end (which generate Intermediate
Code).
Ø A machine Independent Code-Optimizer can be applied to the
Intermediate Representation.
112
INTERMEDIATE LANGUAGES
The most commonly used intermediate representations were:-
Ø Syntax Tree
Ø DAG (Direct Acyclic Graph)
Ø Postfix Notation
Ø 3 Address Code
GRAPHICAL REPRESENTATION
Includes both
Ø Syntax Tree
Ø DAG (Direct Acyclic Graph)
Syntax Tree Or Abstract Syntax Tree (AST)
Graphical
ph Intermediate Representation
Syntax
nta Tree depicts the hierarchical structure of a source program.
tax tree (AST) is a condensed form of parse tree useful for representing language
Syntax
constructs.
EXAMPLE
Parse tree and syntax tree for 3 * 5 + 4 as follows.
113
E E+ E +
TE E
E + T * 4
-T
E T
T F 3 5
T T*
FT F T * F digit
F
digit F digit 4
digit 5
3
Parse Tree VS Syntax Tree
114
Each
h node in a syntax tree can be implemented in arecord with several fields.
In the
he node of an operator, one field contains operator and remaining field
contains pointer to the nodes for the operands.
Whenen used for translation, the nodes in a syntax tree may contain addition of
fields to hold the values of attributes attached to the node.
Following
lo w functions are used to create syntax tree
1. mknode(op,left,right): creates an operator node with label
op and two fields containing pointers to left and right.
2. mkleaf(id,entry): creates an identifier node with label id
and a field containing entry, a pointer to the symbol table
entry for identifier
3. mkleaf(num,val): creates a number node with label num
and a field containing val, the value of the number.
Such
h functions return a pointer to a newly created node.
EXAMPLE
a–4+c
The tree is
constructed bottom
up
P1 =
mkleaf(id,entry a)
P2 = mkleaf(num,
4) P3 = mknode(-, Syntax
Tree
P1, P2) P4 =
mkleaf(id,entry c)
P5 = mknode(+, P3,
P4)
Syntax
nta trees for assignment statements are produced by the syntax-directed
115
definition.
Non
n terminal
te S generates an assignment statement.
The ttwo
w binary operators + and * are examples of the full operator set in a
typical language. Operator associates and precedences are the usual ones, even
though they have not been put into the grammar. This definition constructs the
tree from the input a:=b* -c + b* -c
Thee ttoken
o id has an attribute place that points to the symbol-table entry for
the identifier.
A symbol-table
ym entry can be found from an attribute id.name, representing the
lexeme associated with that occurrence of id.
If the
he lexical analyser holds all lexemes in a single array of characters, then
attribute name might be the index of the first character of the lexeme.
Twoo rrepresentations of the syntax tree are as follows.
116
In (a),
a) each node is represented as a record with a field for its operator and
additional fields for pointers to its children.
In F
Fig
ig (b), nodes are allocated from an array of records and the index or
position of the node serves as the pointer to the node.
All tthe
h nodes in the syntax tree can be visited by following winters, starting
from the root at position 10.
Direct Acyclic Graph (DAG)
Graphical
ph Intermediate Representation
Dag
g also
a gives the hierarchical structure of source program but in a more
compact way because common sub expressions are identified.
EXAMPLE
a=b*-c + b*-c
117
Postfix Notation
Linearized
eaar representation of syntax tree
In postfix
os notation, each operator appears immediately after its last operand.
Operators can be evaluated in the order in which they appear in the string
EXAMPLE
Source String : a := b * -c + b * -c
Postfix String: a b c uminus * b c uminus * + assign
Postfix Rules
1. If E is a variable or constant, then the postfix notation for E is E itself.
2. If E is an expression of the form E1 op E2 then postfix notation for E is
E1’ E2’ op, here E1’ and E2’ are the postfix notations for E1and E2,
respectively
3. If E is an expression of the form (E), then the postfix notation for E is
the same as the postfix notation for E.
4. For unary operation –E the
Postfix
tfi notation of an infix expression can be obtained using stack
118
THREE-ADDRESS CODE
In Three
Thr address statement, at most 3 addresses are used to represent any statement.
The rreason
e for the term “three address code” is that each statement contains 3
addresses at most. Two for the operands and one for the result.
General Form Of 3 Address Code
a = b op c
where,
a, b, c are the operands that can be names,
constants or compiler generated temporaries.
op represents operator, such as fixed or floating
point arithmetic operator or a logical operator on
Boolean valued data. Thus a source language
expression like x + y * z might be translated into
a sequence
t1 := y*z
119
1. Assignment statements
x := y op z, where op is a binary arithmetic or logical operation.
2. Assignment instructions
x : = op y, where op is a unary operation . Essential unary operations include
unary minus, logical negation, shift operators, and conversion operators that
for example, convert a fixed-point number to a floating-point number.
3. Copy statements
x : = y where the value of y is assigned to x.
4. Unconditional jump
goto L The three-address statement with label L is the next to be executed
5. Conditional jump
if x relop y goto L This instruction applies a relational operator ( <, =,
=, etc,) to x and y, and executes the statement with label L next if x
stands in relation relop to y. If not, the three-address statement following
if x relop y goto L is executed next, as in the usual sequence.
120
param
x1
param
x2
……….
param xn
call p,n
generated as part of the call procedure p( xl , x2, . . . , xn ) . The integer
n indicating the number of actual-parameters in ''call p , n" is not
redundant because calls can be
nested.
7. Indexed Assignments
Indexed assignments of the form x = y[i] or x[i] = y
Whenen three-address code is generated, temporary names are made up for the
interior nodes of a syntax tree. for example id : = E consists of code to evaluate
E into some temporary t, followed by the assignment id.place : = t.
Given
en input a:= b * - c + b + - c, it produces the three address code in given
above (page no: ) The synthesized attribute S.code represents the three
address code for the assignment S. The nonterminal E has two attributes:
1. E.place the name that will hold the value of E, and
2. E.code. the sequence of three-address statements evaluating E.
121
Thee ffunction
u newtemp returns a sequence of distinct names t1, t2,……… in
respose of successive calls. Notation gen(x ‘:= ‘y ‘+’ z is used to represent
the three address statement x := y + z.
Expressions
pre appearing instead of variables like x, y and z are evaluated when
passed to gen, and quoted operators or operand, like ‘+’ are taken literally.
Flow
w of control statements can be added to the language of assignments. The
code for S while E do S1 is generated using new attributes S.begin and S.after
to mark the first statement in the code for E and the statement following the
122
The o
op field contains an internal code for the operator. The three address statement
x:= y op z is represented by placing y in arg1, z in arg2 and x in result.
The ccontents
o of arg1, arg2, and result are normally pointers to the symbol table
entries for the names represented by these fields. If so temporary names must be
entered into the symbol table as they are created.
EXAMPLE 1
Translate the following expression to quadruple triple and indirect triple
a+b*c|e^f+b*a
For the first construct the three address code for the expression
t1 = e ^ f
t2 = b *
c t3 = t2 /
t1 t4 = b
* a t5 =
a + t3 t6
= t5 + t4
123
a+b*c|e^f+b*a
124
t1= e ^ft2
= b*ct3 =
t2 / t1 t4 = b
* a t5 = a
+ t3 t6 = t5
+ t4
EXAMPLE 2
A ternary operation like x[i] : = y requires two entries in the triple structure while x
: = y[i] is naturally represented as two operations.
x[i] := y x := y[i]
INDIRECT TRIPLES
Comparison
Whenen we ultimately produce the target code each temporary and programmer
defined name will assign runtime memory location
Thiss llocation will be entered into symbol table entry of that data.
Using
ng the quadruple notation, a three address statement containing a temporary can
immediately access the location for that temporary via symbol table.
But tthis
h is not possible with triples notation.
Withh quadruple notation, statements can often move around which makes
optimization easier.
Thiss iis achieved because using quadruple notation the symbol table
interposes high degree of indirection between computation of a value and its
use.
With
h quadruple notation, if we move a statement computing x, the statement using x
requires no change.
But w
with triples, moving a statement that defines a temporary value requires
us to change all references to that statement in arg1 and arg2 arrays. This
makes triples difficult to use in optimizing compiler
With
h indirect triples also, there is no such problem.
tat
A statement can be moved by reordering the statement list.
Space Utilization
Quadruples
adr and indirect triples requires same amount of space for storage
126
(normal case).
But iiff same temporary value is used more than once indirect triples can save
some space. This is bcz, 2 or more entries in statement array can point to the
same line of op-arg1-arg2 structure.
Triples
ple requires less space for storage compared to above 2.
Quadruples
ad r
PROBLEM 1
Translate the following expression to quadruple tuples & indirect tuples
a=b*-c+b*-c
TAC
t1 = uniminus c t2 =
b* t1
t3 = uniminus
c t4 = b* t3
t5 = t2 + t4 Q = t5
QUADRUPLES
(4) * b t3 t4
(5) + t2 t4 t5
(6) = t5 a
TRIPLES
INDIRECT TRIPLES
ASSIGNMENT STATEMENTS
128
129
130
BOOLEAN EXPRESSIONS
Boolean expressions have two primary purposes.
Ø They are used to compute logical values.
Ø But more often they are used as conditional expressions in
statements that alter the flow of control, such as if-then-else,
or while-do statements.
Boolean expressions are composed of the Boolean operators (and, or, and
not) applied to elements that are Boolean variables or relational
expressions.
Relational expressions are of the form E1 relop E2, where E1 and E2 are
arithmetic expressions and relop can be <, <=, =!, =, > or >=
Here we consider Boolean expressions generated by the following grammar :
t2 : = b
and t1 t3
: = a or
t2
where the function emit( ) output the three address statement into the output
file and nextstat( ) gives the index of the next three address statement in the
output sequence and emit increments nextstat after producing each three
address statement.
100 if a < b
132
=0
S → if E then S1
| if E then S1 else S2
| while E do S1
133
newlabel; E.false
:= S.next; S1.next
:= S.next;
:= newlabel;
S1.next := S.next;
S2.next := S.next;
134
UNIT - V
CODE GENERATION
Issues in the design of a code generator
Code generator converts the intermediate representation of source code into a form
that can be readily executed by the machine. A code generator is expected to
generate a correct code. Designing of code generator should be done in such a way
so that it can be easily implemented, tested and maintained.
Target Program
Target program is the output of the code generator. The output may be absolute
machine language, relocatable machine language, assembly language.
1. Absolute machine language as an output has advantages that it can be
placed in a fixed memory location and can be immediately executed.
2. Relocatable machine language as an output allows subprograms and
subroutines to be compiled separately. Relocatable object modules can be
linked together and loaded by linking loader.
3. Assembly language as an output makes the code generation easier. We can
generate symbolic instructions and use macro-facilities of assembler in
generating code.
Memory Management
135
Mapping the names in the source program to addresses of data objects is done by the
front end and the code generator. A name in the three addressstatement refers to the
symbol table entry for name. Then from the symbol table entry, a relative address can
be determined for the name
Instruction selection
Selecting best instructions will improve the efficiency of the program. It includes the
instructions that should be complete and uniform. Instruction speeds and machine
idioms also plays a major role when efficiency is considered. But if we do not care
about the efficiency of the target program then instruction selection is straight-forward.
For example, the respective three-address statements would be translated into latter
code sequence as shown below: P:=Q+R
S:=P+T
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S
Here the fourth statement is redundant as the value of the P is loaded again in that
statement that just has been stored in the previous statement. It leads to an inefficient
code sequence. A given intermediate representation can be translated into many
code sequences, with significant cost differences between the different
implementations. A prior knowledge of instruction costis needed in order to design
good sequences, but accurate cost information is difficult to predict.
· Register allocation issues –
Use of registers make the computations faster in comparison to that of memory, so
efficient utilization of registers is important. The use of registers are subdivided into
two subproblems:
136
machine requires register pairs consist of an even and next odd-numbered register.
For example
M a, b
These types of multiplicative instruction involve register pairs where a, the
multiplicand is an even register and b, the multiplier is the odd register of the
even/odd register pair.
· Evaluation order –
The code generator decides the order in which the instruction will be executed. The
order of computations affects the efficiency of the target code. Among many
computational orders, some will require only fewer registers to hold the intermediate
results. However, picking the best order in general case is a difficult NP-complete
program.
Approaches to code generation issues: Code generator must always generate the
correct code. It is essential because of the number of special cases that a code
generator might face. Some of the design goals of code generator are:
§ Correct
§ Easily maintainable
§ Testable
§ Maintainable
v Target Machine
The various addressing modes associated with the target machine are discussed
below:
· In instruction, a variable name x means there is a location in memory reserved for x.
· An indexed address in the form a(r), where ‘a’ is a variable and r is a register, can
also be a form of a location. By taking the l-value of ‘a’ and adding it with the value
in the register, the value of memory location denoted by a(r) can be computed.
· An integer indexed by a register can be a memory location. For example, LD R1,
100(R2) has the effect of setting R1 = contents (100 + contents (R2)).
· There are two indirect addressing modes: *r and *100(r). *r has the address of
contents(r), and *100(r) has the address for adding 100 to the contents(r).
· The immediate constant addressing mode is the last addressing mode, which is
denoted by prefix #.
Program and Instruction Costs
The cost refers to compiling and running a program. There are some aspects of the
program on which we optimize the program. The program’s cost can be determined
by the compilation time’s length and the size, execution time, and power
consumption of the target program. Finding the actual cost of the program is a tough
task. Therefore, code generation use heuristic techniques to produce a good target
program. Each target-machine instruction has an associated cost. The instruction
cost is one plus the cost associated with the addressing modes of the operands.
Example
LD R0, R1: This instruction copies the contents of register R1 into register R0. The
cost of this instruction is one because no additional memory is required.
LD R0, M: This instruction’s role is to load the contents of memory location M into
R0. So the cost will be two due to the address of memory location M is found in the
word following the instruction.
LD R1, *100(R2): The role of this instruction is to load the value given by contents
(contents (100 + contents (R2))) into register R1. This instruction’s cost will be two
due to the constant 100 is stored in the word following the instruction.
138
Optimized code :
Example 1 :
L1: a = b + c * d
optimization :
t0 = c * d
a = b + t0
Example 2 :
L2: e = f - g / d
optimization :
t0 = g / d
e = f - t0
Register Allocation :
Register allocation is the process of assigning program variables to registers and
reducing the number of swaps in and out of the registers. Movement of variables
139
across memory is time consuming and this is the main reason why registers are used
as they available within the memory and they are the fastest accessible storage
location.
Example 1:
R1<--- a
R2<--- b
R3<--- c
R4<--- d
MOV R3, c
MOV R4, d
MUL R3, R4
MOV R2, b
ADD R2, R3
MOV R1, R2
MOV a, R1
Example 2:
R1<--- e
R2<--- f
R3<--- g
R4<--- h
MOV R3, g
MOV R4, h
140
DIV R3, R4
MOV R2, f
SUB R2, R3
MOV R1, R2
MOV e, R1
Advantages :
Disadvantages :
During the execution of a program, the same name in the source can denote different
data objects in the computer. The allocation and deallocation of data objects is
managed by the run-time support package . Terminologies:
• Storage space → value: the current value of a storage space is called its state.
• If it is a recursive procedure, then several of its activations may exist at the same
time.
• Life time: the time between the first and last steps in a procedure.
141
General run time storage layout code static data stack heap dynamic space storage
space that won’t change: global data, constant, ... lower memory address higher
memory address For activation records: local data, parameters, control info, ... For
dynamic memory allocated by the program
Activation record
ü returned value
ü actual parameters
ü optional control link
ü optional access link
ü saved machine status
ü local data
ü temporaries
Activation record:
• Parameters:
• Links:
. Control (or dynamic) link: a pointer to the activation record of the caller.
Static storage allocation (1/3) There are two different approaches for run time
storage allocation.
• Static allocation.
• Dynamic allocation.
• Every time a procedure is called, its names refer to the same preassigned location.
• Disadvantages:
ü No recursion.
ü Waste lots of space when inactive.
ü No dynamic allocation.
142
• Advantage:
On procedure calls:
First evaluate arguments. Copies arguments into parameter space in the A.R. of
called procedure. Convention: call that which is passed to a procedure arguments
from the calling side, and parameters from the called side. May save some registers
in its own A.R.Jump and link: jump to the first instruction of called procedure and
put address of next instruction (return address) into register RA (the return address
register).
Copies return address from RA into its A.R.’s return address field. May save some
registers. May initialize local data.
On procedure returns,
Restores values of saved registers. Jump to address in the return address field.
• The calling procedure: May restore some registers. If the called procedure was
actually a function, put return value in an appropriate place.
In this section, we are going to learn how to work with basic block and flow graphs
in compiler design.
Basic Block
The basic block is a set of statements. The basic blocks do not have any in and out
branches except entry and exit. It means the flow of control enters at the beginning
and will leave at the end without any halt. The set of instructions of basic block
executes in sequence.
Here, the first task is to partition a set of three-address code into the basic block. The
new basic block always starts from the first instruction and keep adding instructions
until a jump or a label is met. If no jumps or labels are found, the control will flow
in sequence from one instruction to another.
The algorithm for the construction of basic blocks is given below:
143
Input: The input for the basic blocks will be a sequence of three-address code.
Output: The output is a list of basic blocks with each three address statements in
exactly one block.
METHOD: First, we will identify the leaders in the intermediate code. There are
some rules for finding leaders, which are given below:
Each leader’s basic block will have all the instructions from the leader itself until the
instruction, which is just before the starting of the next leader.
Example:
So there are six basic blocks for the above code, which are given below:
B1 for statement 1
B2 for statement 2
B5 for statement 12
Flow Graph
It is a directed graph. After partitioning an intermediate code into basic blocks, the
flow of control among basic blocks is represented by a flow graph. An edge can
flow from one block X to another block Y in such a case when the Y block’s first
instruction immediately follows the X block’s last instruction. The following ways
will describe the edge:
· Block B1 is the entry point for the flow graph because B1 contains starting
instruction.
145
· B3 block has two successors. One is a block B3 itself because the first
instruction of the B3 block is the target for the conditional jump in the last
instruction of block B3. Another successor is block B4 due to conditional
jump at the end of B3 block.
· The optimization must be correct, it must not, in any way, change the
meaning of the program.
· The optimization process should not delay the overall compiling process.
When to Optimize?
Why Optimize?
146
(i) A = 2*(22.0/7.0)*r
(ii) x = 12.4
y = x/2.3
Variable Propagation :
//Before Optimization
c=a*b
x=a
till
d=x*b+4
//After Optimization
c=a*b
x=a
till
147
d=a*b+4
Hence, after variable propagation, a*b and x*b will be identified as common
sub-expression.
c=a*b
x=a
till
d=a*b+4
//After elimination :
c=a*b
till
d=a*b+4
Code Motion :
a = 200;
while(a>0)
b = x + y;
if (a % b == 0}
printf(“%d”, a);
a = 200;
148
b = x + y;
while(a>0)
if (a % b == 0}
printf(“%d”, a);
• Strength reduction means replacing the high strength operator by the low
strength.
i = 1;
while (i<10)
y = i * 4;
//After Reduction
i=1
t=4
while( t<40)
y = t;
t = t + 4;
149
Now that we learned the need for optimization and its two types,now let’s see
where to apply these optimization.
Source program
Intermediate Code
Target Code
Phases of Optimization
Global Optimization:
Local Optimization:
Function-Preserving Transformations
Copy propagation,
Dead-code elimination
Constant folding
For example
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
t6: = b [t4] +t5
The above code can be optimized using the common sub-expression
elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5: = n
151
Copy Propagation:
• For example:
x=Pi;
A=x*r*r;
Dead-Code Eliminations:
Example:
i=0;
if(i=1)
a=b+5;
Here, ‘if’ statement is dead code because this condition will never get
satisfied.
Constant folding:
152
For example,
Loop Optimizations:
In loops, especially in the inner loops, programs tend to spend the bulk of
their time. The running time of a program may be improved if the number of
instructions in an inner loop is decreased, even if we increase the amount of
code outside that loop.
153
Code Motion:
t= limit-2;
Induction Variables :
Loops are usually processed inside out. For example consider the loop
around B3. Note that the values of j and t4 remain in lock-step; every time the
value of j decreases by 1, that of t4 decreases by 4 because 4*j is assigned to
t4. Such identifiers are called induction variables.
When there are two or more induction variables in a loop, it may be possible
to get rid of all but one, by the process of induction-variable elimination. For
the inner loop around B3 in Fig.5.3 we cannot get rid of either j or t4
completely; t4 is used in B3 and j in B4.
Example:
154
Reduction In Strength:
There are two type of basic block optimization. These are as follows:
1. Structure-Preserving Transformations
2. Algebraic Transformations
155
1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = a - d
In the above expression, the second and forth expression computed the same
expression. So the block can be transformed as follows:
1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = b
o This can be caused when once declared and defined once and forget to
remove them in this case they serve no purpose.
o Suppose the statement x:= y + z appears in a block and x is dead symbol that
means it will never subsequently used. Then without changing the value of
the basic block you can safely remove this statement.
1. t1 : = b + c
2. t2 : = x + y
2. Algebraic transformations:
1. a:= b + c
2. e:= c +d +b
1. a:= b + c
2. t:= c +d
3. e:= t + b
157