Manjakkudi
Manjakkudi
Manjakkudi
UNIT I
Introduction to compilers – Analysis of source program – Phase of compiler – Cousins of
compilers – Grouping of phases – Simple one pass compiler: overview – Syntax
definition Lexical analysis: removal of white space and comments – Constants –
Recognizing identifiers and keywords – Lexical analysis – Role of a lexical analyzer –
Input buffering –Specification of tokens – Recognition tokens.
UNIT II
Symbol tables: Symbol table entries – List data structures for symbol table – - Hash
tables – Representation of scope information – Syntax Analysis: Role of parser –
Context free grammar – Writing a grammar – Top down parsing – Simple bottom up
parsing – Shift reducing parsing.
UNIT III
Syntax directed definition: Construction of syntax trees – Bottom up evaluation of S-
Attributed definition – L-Attributed definitions – Top down translation - Type checking:
Type systems – Specifications of simple type checker.
UNIT IV
Run-time environment: Source language issues – Storage organizations – Storage
allocation strategies - Intermediate code generation: Intermediate languages –
Declarations – Assignment statements.
UNIT V
Code generation: Issue in design of code generator – The target machine – Runtime
storage management – Basic clocks and flow graphs – Code optimization: Introduction –
Principle source of code optimization – Optimization of basic blocks
Text Books:
1
UNIT – I
Introduction to compilers:
Compiler Design is the structure and set of principles that guide the translation, analysis,
and optimization process of a compiler.
A Compiler is computer software that transforms program source code which is written
in a high-level language into low-level machine code. It essentially translates the code
written in one programming language to another language without changing the logic of
the code.
The Compiler also makes the code output efficient and optimized for execution time and
memory space. The compiling process has basic translation mechanisms and error
detection; it can’t compile code if there is an error. The compiler process runs through
syntax, lexical, and semantic analysis in the front end and generates optimized code in the
back end.
When executing, the compiler first analyzes the entire language statements one after
the other syntactically and then, if it’s successful, builds the output code, making sure that
statements that refer to other statements are referred to appropriately, traditionally; the
output code is called Object Code.
Types of Compiler
1. Cross Compiler: This enables the creation of code for a platform other than the one
on which the compiler is running. For instance, it runs on a machine ‘A’ and produces
code for another machine ‘B’.
3. Single Pass Compiler: This directly transforms source code into machine code. For
instance, Pascal programming language.
4. Two-Pass Compiler: This goes through the code to be translated twice; on the first
pass it checks the syntax of statements and constructs a table of symbols, while on the
second pass it actually translates program statements into machine language.
2
5. Multi-Pass Compiler: This is a type of compiler that processes the source code or
abstract syntax tree of a program multiple times before translating it to machine
language.
2. They are closer to human’s language but far from machines. The (#) tags are referred
to as preprocessor directives. They tell the pre-processor about what to do.
3. Pre-Processor: This produces input for the compiler and also deals with file
inclusion, augmentation, macro-processing, language extension, etc. It removes all
the #include directives by including the files called file inclusion and all
the #define directives using macro expansion.
6. Reloadable Machine Code: This can be loaded at any point in time and can be run.
This enables the movement of a program using its unique address identifier.
7. Linker: It links and merges a variety of object files into a single file to make it
executable. The linker searches for defined modules in a program and finds out the
memory location where all modules are stored.
8. Loader: It loads the output from the Linker in memory and executes it. It basically
loads executable files into memory and runs them
Features of a Compiler
Correctness: A major feature of a compiler is its correctness, and accuracy to compile
the given code input into its exact logic in the output object code due to its being
3
developed using rigorous testing techniques (often called compiler validation) on an
existing compiler.
Recognize legal and illegal program constructs: Compilers are designed in such a way
that they can identify which part of the program formed from one or more lexical tokens
using the appropriate rules of the language is syntactically allowable and which is not.
Good Error reporting/handling: A compiler is designed to know how to parse the error
encountered from lack be it a syntactical error, insufficient memory errors, or logic errors
are meticulously handled and displayed to the user.
The Speed of the target code: Compilers make sure that the target code is fast because
in huge size code its a serious limitation if the code is slow, some compilers do so by
translating the byte code into target code to run in the specific processor using classical
compiling methods.
Preserve the correct meaning of the code: A compiler makes sure that the code logic is
preserved to the tiniest detail because a single loss in the code logic can change the whole
code logic and output the wrong result, so during the design process, the compiler goes
through a whole lot of testing to make sure that no code logic is lost during the compiling
process.
Code debugging help: Compilers make help the debugging process easier by pointing
out the error line to the programmer and telling them the type of error that is encountered
so they would know how to start fixing it.
Reduced system load: Compilers make your program run faster than interpreted
programs because it compiles the program only once, hence reducing system load and
response time when next you run the program.
Protection for source code and programs: Compilers protect your program source by
discouraging other users from making unauthorized changes to your programs, you as the
author can distribute your programs in object code.
Linear Analysis: In which the stream of characters making up the source program is
read from left to right and grouped into tokens that are sequences of characters having a
collective meaning.
Semantic Analysis: In which certain checks are performed to ensure that the
components of a program fit together meaningfully.
Phases of compiler
Compiler operates in various phases each phase transforms the source program from one
representation to another. Every phase takes inputs from its previous stage and feeds its
output to the next phase of the compiler.
There are 6 phases in a compiler. Each of this phase help in converting the high-level
langue the machine code. The phases of a compiler are:
Lexical analysis
Syntax analysis
Semantic analysis
Intermediate code generator
Code optimizer
Code generator
5
Phase 1: Lexical Analysis
Lexical Analysis is the first phase when compiler scans the source code. This process can
be left to right, character by character, and group these characters into tokens.
Here, the character stream from the source program is grouped in meaningful sequences
by identifying the tokens. It makes the entry of the corresponding tickets into the symbol
table and passes that token to next phase.
Example:
x = y + 10
6
Tokens
X identifier
= Assignment operator
Y identifier
+ Addition operator
10 Number
Syntax analysis is based on the rules based on the specific programming language by
constructing the parse tree with the help of tokens. It also determines the structure of
source language and grammar or syntax of the language.
Example
Any identifier/number is an expression
If x is an identifier and y+10 is an expression, then x= y+10 is a statement.
Consider parse tree for the following example
(a+b)*c
7
In Parse Tree
Interior node: record with an operator filed and two files for children
Leaf: records with 2/more fields; one for token and other information about the
token
Ensure that the components of the program fit together meaningfully
Gathers type information and checks for type compatibility
Checks operands are permitted by the source language
Helps you to store type information gathered and save it in symbol table or syntax
tree
Allows you to perform type checking
In the case of type mismatch, where there are no exact type correction rules which
satisfy the desired operation a semantic error is shown
Collects type information and checks for type compatibility
Checks if the source language permits the operands or not
Example
float x = 20.2;
float y = x*30;
8
In the above code, the semantic analyzer will typecast the integer 30 to float 30.0 before
multiplication.
Example
For example,
t1 := int (5)
t2 := rate * t1
t3 := count + t2
total := t3
Example:
Consider the following code
a = intofloat(10)
b=c*a
d=e+b
f=d
Can become
b =c * 10.0
f = e+b
The target language is the machine code. Therefore, all the memory locations and
registers are also selected and allotted during this phase. The code generated by this
phase is executed to take inputs and generate expected outputs.
Example:
a = b + 60.0
MOVF a, R1
MULF #60.0, R2
ADDF R1, R2
10
Symbol Table Management
A symbol table contains a record for each identifier with fields for the attributes of the
identifier. This component makes it easier for the compiler to search the identifier record
and retrieve it quickly. The symbol table also helps you for the scope management. The
symbol table and error handler interact with all the phases and symbol table update
correspondingly.
Most common errors are invalid character sequence in scanning, invalid token sequences
in type, scope error, and parsing in semantic analysis.
The error may be encountered in any of the above phases. After finding errors, the phase
needs to deal with the errors to continue with the compilation process. These errors need
to be reported to the error handler which handles the error to perform the compilation
process. Generally, the errors are reported in the form of message.
GROUPING OF PHASES
The phases of a compiler can be grouped as Front end and Back end.
Front end comprises of phases which are dependent on the input (source language) and
independent on the target machine (target language). It includes lexical and syntactic
analysis, symbol table management, semantic analysis and the generation of
intermediate code. Code optimization can also be done by the front end. • It also
includes error handling at the phases concerned.
Front end of a compiler consists of the phases
• Lexical analysis.
• Syntax analysis.
• Semantic analysis.
• Intermediate code generation.
11
Back end
Back end comprises of those phases of the compiler that are dependent on the target
machine and independent on the source language. This includes code optimization,
code generation. In addition to this, it also encompasses error handling and symbol
table management operations.
Back end of a compiler contains
• Code optimization.
• Code generation.
Passes
• The phases of compiler can be implemented in a single pass by marking the primary
actions viz. reading of input file and writing to the output file.
• Several phases of compiler are grouped into one pass in such a way that the operations
in each and every phase are incorporated during the pass.
• (eg.) Lexical analysis, syntax analysis, semantic analysis and intermediate code
generation might be grouped into one pass. If so, the token stream after lexical analysis
may be translated directly into intermediate code.
Reducing the Number of Passes
• Minimizing the number of passes improves the time efficiency as reading from and
writing to intermediate files can be reduced.
• When grouping phases into one pass, the entire program has to be kept in memory to
12
ensure proper information flow to each phase because one phase may
need information in a different order than the information produced in previous phase.
The source program or target program differs from its internal representation. So,
the memory for internal form may be larger than that of input and output.
COUSINS OF COMPILER
Cousins of compiler contains
1. Preprocessor
2. Compiler
3. Assembler
4. Linker
5. Loader
6. Memory
1) Preprocessor
A preprocessor is a program that processes its input data to produce output that is used
as input to another program. The output is said to be a preprocessed form of the input
data, which is often used by some subsequent programs like compilers. They may
perform the following functions.
1. Macro processing
2. File Inclusion
3. Rational Preprocessors
4. Language extension
1. Macro processing: A macro is a rule or pattern that specifies how a certain input
sequence should be mapped to an output sequence according to a defined procedure. The
mapping process that instantiates a macro into a specific output sequence is known as
macro expansion.
2. File Inclusion: Preprocessor includes header files into the program text. When the
preprocessor finds an #include directive it replaces it by the entire content of the
specified file.
3. Rational Preprocessors: These processors change older languages with more modern
flow-of-control and data-structuring facilities.
4. Language extension: These processors attempt to add capabilities to the language by
what amounts to built-in macros. For example, the language Equel is a database query
language embedded in C.
2) Compiler
13
It takes pure high level language as a input and convert into assembly code.
3) Assembler
It takes assembly code as an input and converts it into assembly code. Assembler creates
object code by translating assembly instruction mnemonics into machine code. There are
two types of assemblers. One-pass assemblers go through the source code once and
assume that all symbols will be defined before any instruction that references them.
Two-pass assemblers create a table with all symbols and their values in the first pass,
and then use the table in a second pass to generate code.
4) Linker
1. Allocation: It means get the memory portions from operating system and storing
the object data.
2. Relocation: It maps the relative address to the physical address and relocating the
object code.
3. Linker: It combines the entire executable object module to pre single executable
file.
5) Loader
A loader is the part of an operating system that is responsible for loading programs in
memory, one of the essential stages in the process of starting a program.
6) Memory
14
Simple one pass compiler: overview – Syntax definition
Language Definition
15
Syntax definition
Language
A0={0,1} L0={0,1,100,101,...}
Example :Grammar for expressions consisting of digits and plus and minus signs.
16
digit → 0|1|2|3|4|5|6|7|8|9
list, digit : Grammar variables, Grammar symbols.
0,1,2,3,4,5,6,7,8,9,-,+ : Tokens, Terminal symbols
Convention specifying grammar
Terminal symbols : bold face string if, num, id
Nonterminal symbol, grammar symbol : italicized names, list, digit ,A,B
Grammar G=(N,T,P,S)
N : a set of nonterminal symbols
T : a set of terminal symbols, tokens
P : a set of production rules
S : a start symbol, S ∈ N
Grammar G for a language L = { 9-5+2, 3-1,….}
G=(N,T,P,S)
N={list,digit}
T={0,1,2,3,4,5,6,7,8,9,-,+}
P= list → list + digit
list → list - digit
list → digit
digit → 0|1|2|3|4|5|6|7|8|9
Ambiguity
A grammar is said to be ambiguous if the grammar has more than one parse tree for a
given string of tokens.
18
Two Parse tree for 9-5+2
Associativity of operator
digit →0|1|…|9
letter → a|b|…|z
Precedence of operators
19
We say that aoperator(*) has higher precedence than other operator(+) if the operator(*)
takes operands before other operator(+) does.
LEXICAL ANALYSIS
Lexical Analysis:
reads and converts the input into a stream of tokens to be analyzed by parser.
lexeme : a sequence of characters which comprises a single token.
Lexical Analyzer →Lexeme / Token → Parser
20
Removal of White Space and Comments
Remove white space(blank, tab, new line etc.) and comments
Contsants
Constants: For a while, consider only integers
Example :x for input 31 + 28, output(token
representation)? input : 31 + 28
output: <num, 31><+,
><num, 28> num + :token
31 28 : attribute, value(or lexeme) of integer token num
Recognizing
Identifiers
o Identifiers are names of variables, arrays, functions...
o A grammar treats an identifier as a token.
o eg) input : count = count + increment; output : <id,1><=, ><id,1><+, ><id,
2>;
Symbol table
Keywords are reserved, i.e., they cannot be used as identifiers. Then a character
string forms an identifier only if it is not a keyword.
punctuation symbols
operators : + - * / := <> …
Lexical Analysis is the very first phase in the compiler designing. A Lexer takes the
modified source code which is written in the form of sentences . In other words, it helps
you to convert a sequence of characters into a sequence of tokens. The lexical analyzer
breaks this syntax into a series of tokens. It removes any extra space or comment written
in the source code.
Programs that perform Lexical Analysis in compiler design are called lexical analyzers
or lexers. A lexer contains tokenizer or scanner. If the lexical analyzer detects that the
21
token is invalid, it generates an error. The role of Lexical Analyzer in compiler design is
to read character streams from the source code, check for legal tokens, and pass the data
to the syntax analyzer when it demands.
Example
How Pleasant Is The Weather?
See this Lexical Analysis example; Here, we can easily recognize that there are five
words How Pleasant, The, Weather, Is. This is very natural for us as we can recognize
the separators, blanks, and the punctuation symbol.
Basic Terminologies
What’s a lexeme?
A lexeme is a sequence of characters that are included in the source program according
to the matching pattern of a token. It is nothing but an instance of a token.
What’s a token?
Tokens in compiler design are the sequence of characters which represents a unit of
information in the source program.
What is Pattern?
A pattern is a description which is used by the token. In the case of a keyword which
uses as a token, the pattern is a sequence of characters.
Lexical analyzer scans the entire source code of the program. It identifies each token one
by one. Scanners are usually implemented to produce tokens only when requested by a
parser. Here is how recognition of tokens in compiler design works-
1. “Get next token” is a command which is sent from the parser to the lexical
analyzer.
2. On receiving this command, the lexical analyzer scans the input until it finds the
next token.
3. It returns the token to Parser.
22
Lexical Analyzer skips whitespaces and comments while creating these tokens. If any
error is present, then Lexical analyzer will correlate that error with the source file and
line number.
#include <stdio.h>
int maximum(int x, int y) {
// This will compare 2 numbers
if (x > y)
return x;
else {
return y;
}
}
Examples of Tokens created
Lexeme Token
int Keyword
maximum Identifier
( Operator
int Keyword
x Identifier
, Operator
int Keyword
Y Identifier
) Operator
{ Operator
If Keyword
23
Comment // This will compare 2 numbers
Pre-processor directive #include <stdio.h>
Pre-processor directive #define NUMS 8,9
Macro NUMS
Whitespace /n /b /t
Lexical Errors
A character sequence which is not possible to scan into any valid token is a lexical error.
Important facts about the lexical error:
Lexical errors are not very common, but it should be managed by a scanner
Misspelling of identifiers, operators, keyword are considered as lexical errors
Generally, a lexical error is caused by the appearance of some illegal character,
mostly at the beginning of a token.
The simplicity of design: It eases the process of lexical analysis and the syntax
analysis by eliminating unwanted tokens
To improve compiler efficiency: Helps you to improve compiler efficiency
24
Specialization: specialized techniques can be applied to improves the lexical
analysis process
Portability: only the scanner requires to communicate with the outside world
Higher portability: input-device-specific peculiarities restricted to the lexer
Lexical analyzer method is used by programs like compilers which can use the
parsed data from a programmer’s code to create a compiled binary executable
code
It is used by web browsers to format and display a web page with the help of
parsed data from JavsScript, HTML, CSS
A separate lexical analyzer helps you to construct a specialized and potentially
more efficient processor for the task
You need to spend significant time reading the source program and partitioning it
in the form of tokens
Some regular expressions are quite difficult to understand compared to PEG or
EBNF rules
More effort is needed to develop and debug the lexer and its token descriptions
Additional runtime overhead is required to generate the lexer tables and construct
the tokens
REGULAR EXPRESSIONS
Regular expression is a formula that describes a possible set of string. Component of
regular expression..
X - the character x
. any character, usually accept a new line [x y z] any of the characters x, y, z,
R? a R or nothing (=optionally as R)
R* zero or more occurrences…..
25
or digits. In regular expression notation we would write.
Identifier = letter (letter | digit)*
re are the rules that define the regular expression over alphabet .
is a regular expression denoting { € }, that is, the language containing only the empty
string.
For each ‘a’ in Σ, is a regular expression denoting { a }, the language with only one
string
consisting of the single symbol ‘a’ .
If R and S are regular expressions, then
REGULAR DEFINITIONS
For notational convenience, we may wish to give names to regular expressions and to
define regular expressions using these names as if they were symbols.
Identifiers are the set or string of letters and digits beginning with a letter. The following
regular definition provides a precise specification for this class of string.
Example-1,
Ab*|cd? Is equivalent to (a(b*)) | (c(d?)) Pascal identifier
Letter - A | B | ……| Z | a | b |……| z| Digits - 0 | 1 | 2 | …. | 9
Id - letter (letter / digit)*
Recognition of tokens:
We learn how to express pattern using regular expressions. Now, we must study how to
take the patterns for all the needed tokens and build a piece of code that examins the
input string and finds a prefix that is a lexeme matching one of the patterns.
Stmt →if expr then stmt
| If expr then else stmt
|є
Expr →term relop term
| term Term →id
|number
For relop ,we use the comparison operations of languages like Pascal or SQL where = is
“equals” and < > is “not equals” because it presents an interesting structure of lexemes.
The terminal of grammar, which are if, then , else, relop ,id and numbers are the names
of tokens as far as the lexical analyzer is concerned, the patterns for the tokens are
26
described using regular definitions.
digit → [0,9] digits →digit+
number →digit(.digit)?(e.[+-]?digits)? letter → [A-Z,a-z]
id →letter(letter/digit)* if → if
then →then
else →else
relop →< | > |<= | >= | = = | < >
In addition, we assign the lexical analyzer the job stripping out white space, by
recognizing the “token” we defined by:
WS → (blank/tab/newline)+
Here, blank, tab and newline are abstract symbols that we use to express the ASCII
characters of the same names. Token ws is different from the other tokens in that ,when
we recognize it, we do not return it to parser ,but rather restart the lexical analysis from
the character that follows the white space . It is the following token that gets returned to
the parser.
TRANSITION DIAGRAM:
27
Transition Diagram has a collection of nodes or circles, called states. Each state
represents a condition that could occur during the process of scanning the input looking
for a lexeme that matches one of several patterns .
Edges are directed from one state of the transition diagram to another. each edge is
labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an edge out of state s
labeled by a. if we find such an edge ,we advance the forward pointer and enter the state
of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states indicates that a
lexeme has been found, although the actual lexeme may not consist of all positions b/w
the lexeme Begin and forward pointers we always indicate an accepting state by a double
circle.
2. In addition, if it is necessary to return the forward pointer one position, then we
shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an edge labeled
“start” entering from nowhere .the transition diagram always begins in the state before
any input symbols have been used.
28
Fig. 3.4: Transition diagram of Identifier
FINITE AUTOMATON
A recognizer for a language is a program that takes a string x, and answers “yes”
if x is a sentence of that language, and “no” otherwise.
We call the recognizer of the tokens as a finite automaton.
A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
This means that we may use a deterministic or non-deterministic automaton as a
lexical analyzer.
Both deterministic and non-deterministic finite automaton recognize regular sets.
Which one?
– deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
First, we define regular expressions for tokens; Then we convert them into a DFA
to get a lexical analyzer for our tokens.
Example:
30
Converting RE to NFA
This is one way to convert a regular expression into a NFA.
There can be other ways (much efficient) for the conversion.
Thomson’s Construction is simple and systematic method.
It guarantees that the resulting NFA will have exactly one final state, and one start
state.
Construction starts from simplest parts (alphabet symbols).
To create a NFA for a complex regular expression, NFAs of its sub-expressions
are combined to create its NFA.
To recognize an empty string ε:
31
N(r1) and N(r2) are NFAs for regular expressions r1 and r2.
Example:
For a RE (a|b) * a, the NFA construction is shown below.
From the point of view of the input, any two states that are connected by an –
transition may as well be the same, since we can move from one to the other without
consuming any character. Thus states which are connected by an -transition will be
represented by the same states in the DFA.
If it is possible to have multiple transitions based on the same symbol, then we
can regard
a transition on a symbol as moving from a state to a set of states (ie. the union of all
32
those states reachable by a transition on the current symbol). Thus these states will be
combined into a single DFA state.
To perform this operation, let us define two functions:
The -closure function takes a state and returns the set of states reachable from it
based on (one or more) -transitions. Note that this will always include the state itself. We
should be able to get from a state to any state in its -closure without consuming any input.
The function move takes a state and a character, and returns the set of states
reachable by one transition on this character.
Wecan generalise both these functions to apply to sets of states by taking the union of the
application to individual states.
put ε-closure({s0}) as an unmarked state into the set of DFA (DS) while (there is one
unmarked S1 in DS) do
begin
mark S1
for each input symbol a do begin
end
end
33
Lex specifications:
A Lex program (the .l file ) consists of three parts:
declarations
%%
translation rules
%%
auxiliary procedures
p1 {action 1}
p2 {action 2}
p3 {action 3}
……
……
Where, each p is a regular expression and each action is a program fragment describing
what action the lexical analyzer should take when a pattern p matches a lexeme. In Lex
the actions are written in C.
3. The third section holds whatever auxiliary procedures are
34
needed by the actions.Alternatively these procedures can be compiled
separately and loaded with the lexical analyzer.
Note: You can refer to a sample lex program given in page no. 109 of chapter 3 of the
book:
Compilers: Principles, Techniques, and Tools by Aho, Sethi & Ullman for more clarity.
Input buffering
Lexical Analysis has to access secondary memory each time to identify tokens. It is
time-consuming and costly. So, the input strings are stored into a buffer and then
scanned by Lexical Analysis.
Lexical Analysis scans input string from left to right one character at a time to identify
tokens. It uses two pointers to scan tokens −
Begin Pointer (bptr) − It points to the beginning of the string to be read.
Look Ahead Pointer (lptr) − It moves ahead to search for the end of the token.
Example − For statement int a, b;
1. Both pointers start at the beginning of the string, which is stored in the buffer.
35
After processing token ("int") both pointers will set to the next token ('a'), & this
process will be repeated for the whole program.
A buffer can be divided into two halves. If the look Ahead pointer moves towards
halfway in First Half, the second half is filled with new characters to be read. If the look
Ahead pointer moves towards the right end of the buffer of the second half, the first half
will be filled with new characters, and it goes on.
Sentinels − Sentinels are used to making a check, each time when the forward pointer is
converted, a check is completed to provide that one half of the buffer has not converted
off. If it is completed, then the other half should be reloaded.
Buffer Pairs − A specialized buffering technique can decrease the amount of overhead,
which is needed to process an input character in transferring characters. It includes two
buffers, each includes N-character size which is reloaded alternatively.
There are two pointers such as the lexeme Begin and forward are supported. Lexeme
Begin points to the starting of the current lexeme which is discovered. Forward scans
ahead before a match for a pattern are discovered. Before a lexeme is initiated, lexeme
begin is set to the character directly after the lexeme which is only constructed, and
forward is set to the character at its right end.
Preliminary Scanning − Certain processes are best performed as characters are moved
from the source file to the buffer. For example, it can delete comments. Languages like
36
FORTRAN which ignores blank can delete them from the character stream. It can also
collapse strings of several blanks into one blank. Pre-processing the character stream
being subjected to lexical analysis saves the trouble of moving the look ahead pointer
back and forth over a string of blanks.
37
UNIT - II
UNIT II
Symbol tables: Symbol table entries – List data structures for symbol table – - Hash
tables – Representation of scope information – Syntax Analysis: Role of parser –
Context free grammar – Writing a grammar – Top down parsing – Simple bottom up
parsing – Shift reducing parsing.
Symbol Table
Symbol table is an important data structure created and maintained by compilers in order
to store information about the occurrence of various entities such as variable names,
function names, objects, classes, interfaces, etc. Symbol table is used by both the
analysis and the synthesis parts of a compiler.
A symbol table may serve the following purposes depending upon the language in hand:
A symbol table is simply a table which can be either linear or a hash table. It maintains
an entry for each name in the following format:
For example, if a symbol table has to store information about the following variable
declaration:
38
Implementation
If a compiler is to handle a small amount of data, then the symbol table can be
implemented as an unordered list, which is easy to code, but it is only suitable for small
tables only. A symbol table can be implemented in one of the following ways:
Among all, symbol tables are mostly implemented as hash tables, where the source code
symbol itself is treated as a key for the hash function and the return value is the
information about the symbol.
Operations
A symbol table, either linear or hash, should provide the following operations.
insert()
This operation is more frequently used by analysis phase, i.e., the first half of the
compiler where tokens are identified and names are stored in the table. This operation is
used to add information in the symbol table about unique names occurring in the source
code. The format or structure in which the names are stored depends upon the compiler
in hand.
An attribute for a symbol in the source code is the information associated with that
symbol. This information contains the value, state, scope, and type about the symbol.
The insert() function takes the symbol and its attributes as arguments and stores the
information in the symbol table.
For example:
int a;
insert(a, int);
lookup()
The format of lookup() function varies according to the programming language. The
basic format should match the following:
lookup(symbol)
This method returns 0 (zero) if the symbol does not exist in the symbol table. If the
symbol exists in the symbol table, it returns its attributes stored in the table.
Scope Management
A compiler maintains two types of symbol tables: a global symbol table which can be
accessed by all the procedures and scope symbol tables that are created for each scope
in the program.
To determine the scope of a name, symbol tables are arranged in hierarchical structure as
shown in the example below:
...
int value=10;
void pro_one()
{
int one_1;
int one_2;
{ \
int one_3; |_ inner scope 1
int one_4; |
} /
int one_5;
{ \
int one_6; |_ inner scope 2
int one_7; |
} /
}
void pro_two()
{
int two_1;
40
int two_2;
{ \
int two_3; |_ inner scope 3
int two_4; |
} /
int two_5;
}
...
The global symbol table contains names for one global variable (int value) and two
procedure names, which should be available to all the child nodes shown above. The
names mentioned in the pro_one symbol table (and all its child tables) are not available
for pro_two symbols and its child tables.
This symbol table data structure hierarchy is stored in the semantic analyzer and
whenever a name needs to be searched in a symbol table, it is searched using the
following algorithm:
first a symbol will be searched in the current scope, i.e. current symbol table.
41
if a name is found, then search is completed, else it will be searched in the parent
symbol table until,
either the name is found or global symbol table has been searched for the name.
Operations of Symbol table – The basic operations defined on a symbol table include:
Implementation of Symbol table
Following are commonly used data structures for implementing symbol table:-
1. List
oIn this method, an array is used to store names and associated information.
o A pointer “available” is maintained at end of all stored records and new
names are added in the order as they arrive
o To search for a name we start from the beginning of the list till available
pointer and if not found we get an error “use of the undeclared name”
o While inserting a new name we must ensure that it is not already present
otherwise an error occurs i.e. “Multiple defined names”
o Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
o The advantage is that it takes a minimum amount of space.
2. Linked List
o This implementation is using a linked list. A link field is added to each
record.
o Searching of names is done in order pointed by the link of the link field.
o A pointer “First” is maintained to point to the first record of the symbol
table.
o Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
3. Hash Table
o In hashing scheme, two tables are maintained – a hash table and symbol
table and are the most commonly used method to implement symbol tables.
o A hash table is an array with an index range: 0 to table size – 1. These
entries are pointers pointing to the names of the symbol table.
o To search for a name we use a hash function that will result in an integer
between 0 to table size – 1.
o Insertion and lookup can be made very fast – O(1).
o The advantage is quick to search is possible and the disadvantage is that
hashing is complicated to implement.
4. Binary Search Tree
o Another approach to implementing a symbol table is to use a binary search
tree i.e. we add two link fields i.e. left and right child.
42
o All names are created as child of the root node that always follows the
property of the binary search tree.
o Insertion and lookup are O(log2 n) on average.
SYNTAX ANALYSIS
Parser for any grammar is program that takes as input string w (obtain set of
strings tokens from the lexical analyzer) and produces as output either a
parse tree for w , if w is a valid sentences of grammar or error message
indicating that w is not a valid sentences of given grammar. The goal of the
parser is to determine the syntactic validity of a source string is valid, a tree
is built for use by the subsequent phases of the computer. The tree reflects
the sequence of derivations or reduction used during the parser. Hence, it is
called parse tree. If string is invalid, the parse has to issue diagnostic
message identifying the nature and cause of the errors in string. Every
elementary subtree in the parse tree corresponds to a production of the
grammar.
top(root) to bottom(leaves)
b. Bottom up parser: which build parse trees from leaves and
43
Fig . 4.1: position of parser in compiler
model.
CONTEXT FREE GRAMMARS
Inherently recursive structures of a programming language are defined by a
context-free Grammar. In a context-free grammar, we have four triples G(
V,T,P,S).
Here , V is finite set of terminals (in our case, this will be the set
of tokens) T is a finite set of non-terminals (syntactic-
variables)
P is a finite set of productions rules in the following form
A → α where A is a non-terminal and α is a string of terminals and
non-terminals (including the empty string)
S is a start symbol (one of the non-terminal symbol)
L(G) is the language of G (the language generated by G) which is a set of
sentences.
A sentence of L(G) is a string of terminal symbols of G. If S is the start
symbol of G then ω is a sentence of L(G) iff S ⇒ω whereω is a string of
terminals of G. If G is a context- free grammar, L(G) is a context-free
language. Two grammar G1 and G2 are equivalent, if they produce same
grammar.
Consider the production of the form S ⇒α, If α contains non-terminals, it is
called as a sentential form of G. If α does not contain non-terminals, it is called as
a sentence of G.
44
Derivations
In general a derivation step is
αAβ α⇒
γβ is sentential form and if there is a production rule A→γ in our
grammar. where α and β are arbitrary strings of terminal and non-terminal
symbols α1 ⇒α2 ⇒... ⇒αn (αn derives from α1 or α1 derives αn ). There are
two types of derivaion
1At each derivation step, we can choose any of the non-terminal in the
sentential form of G for the replacement.
2If we always choose the left-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
Example:
E→E+E|E–E|E*E|E/
E | - EE → ( E )
E → id
Leftmost derivation :
E→E+E
→E * E+E →id* E+E→id*id+E→id*id+id
The string is derive from the grammar w= id*id+id, which is consists of all
terminal symbols
Rightmost
derivation E → E
+E
→E+E * E→E+
E*id→E+id*id→id+id*id Given grammar
G : E → E+E | E*E | ( E ) | - E | id Sentence
to be derived : – (id+id)
PARSE TREE
Inner nodes of a parse tree are non-terminal symbols.
The leaves of a parse tree are terminal symbols.
A parse tree can be seen as a graphical representation of a derivation.
Ambiguity:
A grammar that produces more than one parse for some sentence is said to e
ambiguous grammar.
46
Example : Given grammar G : E → E+E | E*E | ( E ) | - E | id
Example:
To disambiguate the grammar E → E+E | E*E | E^E | id | (E), we can use
precedence of operators as follows:
^ (right to left)
/,* (left to right)
-,+ (left to
right) We get the following unambiguous
grammar:
E → E+T | T
T → T*F |
F F → G^F
| G G → id
| (E)
47
Consider this example, G: stmt → if expr then stmt |if expr then stmt
elsestmt | other This grammar is ambiguous since the string if E1 then if
E2 then S1 else S2 has the following
49
id
Algorithm to eliminate left recursion:
1. Arrange the non-terminals in some order A1, A2
Factoring:
50
the root to the leaves.
Types of top-down parsing :
1. Recursive descent parsing
2. Predictive parsing
1. RECURSIVE DESCENT PARSING
Recursive descent parsing is one of the top-down parsing techniques that
Step2:
The leftmost leaf ‘c’ matches the first symbol of w, so advance the input pointer to
the second symbol of w ‘a’ and consider the next leaf ‘A’. Expand A using the first
alternative.
51
Step3:
The second symbol ‘a’ of w also matches with second leaf of tree. So advance
the input pointer to third symbol of w ‘d’. But the third leaf of tree is b which
does not match with the input symbol d.
Hence discard the chosen production and reset the pointer to second position. This
is called
backtracking.
Step4:
Now try the second alternative for A.
T( );
EPRIME( );
End
Procedure EPRIME( )
begin
end
T( );
EPRIME( );
Procedure T( )
begin
End
F( );
53
TPRIME( );
Procedure TPRIME( )
begin
F( );
TPRIME( );
ocedure F( )
begin
end
If input-symbol=’id’ then
ADVANCE( );
else if input-symbol=’(‘ then
ADVANCE( );
E( );
else if input-symbol=’)’ then
ADVANCE( );
else ERROR( );
Stack implementation:
The table-driven predictive parser has an input buffer, stack, a parsing table and an
output stream.
Input buffer:
It consists of strings to be parsed, followed by $ to indicate the end of the input string.
55
Stack:
It contains a sequence of grammar symbols preceded by $ to indicate the bottom of
the stack. Initially, the stack contains the start symbol on top of $.
Parsing table:
It is a two-dimensional array M[A, a], where ‘A’ is a non-terminal and ‘a’ is a
terminal.
Predictive parsing program:
The parser is controlled by a program that considers X, the symbol on top of
stack, and a, the current input symbol. These two symbols determine the parser
action. There are three possibilities:
1. If X = a = $, the parser halts and announces successful completion of
parsing.
2. If X = a ≠ $, the parser pops X off the stack and advances the
56
if X is a terminal or
$then if X = a
then
else/* X is a non-terminal */
end
elseerror
()
until X = $
57
Rules for follow( ):
1. If S is a start symbol, then FOLLOW(S) contains $.
Example:
Consider the following
grammar : E → E+T | T
T→T*F |
F F → (E)
| id
After eliminating left-recursion the
grammar is E → TE’
E’ → +TE’
|ε T → FT’
T’ → *FT’
| ε F → (E) |
id First( ) :
FIRST(E) = { ( , id}
FIRST(E’) ={+ ,ε}
FIRST(T) = { ( , id}
58
FIRST(T’) = {*, ε }
FIRST(F) = { ( , id }
Follow( ):
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
59
FOLLOW(T) = { +, $, ) }
FOLLOW(T’) = { +, $, ) }
FOLLOW(F) = {+, * , $ , ) }
LL(1) grammar:
The parsing table entries are single entries. So each location has not more than
one entry. This type of grammar is called LL(1) grammar.
Consider this following
60
grammar: S → iEtS | iEtSeS |
a
E→b
After eliminating left factoring,
we have S→iEtSS’ | a
S’→eS
|ε E→b
To construct a parsing table, we need FIRST() and FOLLOW() for all the non-
terminals. FIRST(S) = { i, a }
FIRST(S’) =
{e,ε} FIRST(E)
= { b}
FOLLOW(S) = { $ ,e }
FOLLOW(S’) = { $
,e } FOLLOW(E) =
{t}
Since there are more than one production, the grammar is not LL(1) grammar.
Actions performed in predictive parsing:
1. Shift
2. Reduce
3. Accept
4. Error
Implementation of predictive parser:
1. Elimination of left recursion, left factoring and ambiguous grammar.
2. Construct FIRST() and FOLLOW() for all non-terminals.
3. Construct predictive parsing table.
4. Parse the given input string using stack and parsing table.
61
BOTTOM-UP PARSING
Constructing a parse tree for an input string beginning at the leaves and going
towards the root is called bottom-up parsing.
A general type of bottom-up parser is a shift-reduce parser.
SHIFT-REDUCE PARSING
Shift-reduce parsing is a type of bottom-up parsing that attempts to construct a
parse tree for an input string beginning at the leaves (the bottom) and
working up towards the root (the top).
Example:
Consider the grammar:
S → aABe
A → Abc |
bB → d
The sentence to be recognized is abbcde.
abbcde (A → b) S→ aABe
aAde (B → d) → aAbcde
Handles:
Example:
62
E → E+E
E →
E*E E
→ (E)
E → id
derivation is :
E →E+E
→ E+E*E
→ E+E*id3
→ E+id2*id3
→id 1+id2*id3
63
$ id1 +id2*id3 $ reduce by E→id
$E +id2*id3 $ shift
$ E+ id2*id3 $ shift
$ E+id2 *id3 $ reduce by E→id
$ E+E *id3 $ shift
$ E+E* id3 $ shift
$ E+E*id3 $ reduce by E→id
$ E+E*E $ reduce by E→ E *E
$ E+E $ reduce by E→ E+E
$E $ accept
• shift – The next input symbol is shifted onto the top of the stack.
• reduce – The parser replaces the handle within a stack with a non-terminal.
• accept – The parser announces successful completion of parsing.
• error – The parser discovers that a syntax error has occurred and calls an
error recovery routine.
Conflicts in shift-reduce parsing:
1. Shift-reduce
conflict: Example:
Consider the grammar:
2. Reduce-reduce conflict:
Consider the
grammar: M →
R+R | R+c | R
R→c
and input c+c
$c +c $ Reduce $c +c $ Reduce
by R→c by R→c
$R +c $ Shift $R +c $ Shift
$ R+ c$ Shift $ R+ c$ Shift
65
Viable prefixes:
➢a is a viable prefix of the grammar if there iswsuch that awis a right sentinel form.
➢ The set of prefixes of right sentinel forms that can appear on the stack of a
shift-reduce parser are called viable prefixes.
➢ The set of viable prefixes is a regular language.
OPERATOR-PRECEDENCE PARSING
Example:
E → EAE | (E) | -
E | id A → + | - | * |
/|↑
Since the right side EAE has three consecutive non-terminals, the grammar can be
written as follows:
=- equal to
.
>- greater than
66
The relations give the following
meaning: a< . b – a yields
precedence to b
) . >θ , θ . >)
θ . >$ , $< . θ
Also make
.
( = ) , (< ( , ) . >) , (< .
id , id . >) , $< . id , id .
>$ , $<
.
( , ) . >$
Example:
E → E+E | E-E | E*E | E/E | E↑E | (E) | -E | id is given in the following table
assuming
67
3. + and - are of lowest precedence and left-
associative Note that theblanksin the table
denote error entries.
Method :Initially the stack contains $ and the input buffer the stringw$. To
parse, we execute the following program :
end
Operator precedence parsing uses a stack and precedence relation table for
its implementation of above algorithm. It is a shift-reduce parsing containing all
four actions shift, reduce, accept and error.
The initial configuration of an operator precedence parsing is
STACK INPUT
$ w$
Example:
Consider the grammar E → E+E | E-E | E*E | E/E | E↑E | (E) | id. Input string
isid+id*id.The implementation is as follows:
69
$ + * id ·> $ pop id
$+* ·> $ pop *
$+ ·> $ pop +
$ $ accept
1. It is easy to implement.
2. Once an operator precedence relation is made between all pairs of terminals
of a grammar , the grammar can be ignored. The grammar is not referred
anymore during implementation.
1. It is hard to handle tokens like the minus sign (-) which has two different
precedence.
2. Only a small class of grammar can be parsed using operator-precedence parser.
LR PARSERS
Advantages of LR parsing:
✓ It
recognizes virtually all programming language constructs for which
CFG can be written.
✓ It is an efficient non-backtracking shift-reduce parsing method.
✓ A grammar that can be parsed using LR method is a proper superset of
a grammar that can be parsed with predictive parser.
✓ It detects a syntactic error as soon as possible.
Drawbacks of LR method:
1. SLR- Simple LR
▪ Easiest to implement, least powerful.
2. CLR- Canonical LR
▪ Most powerful, most expensive.
3. LALR- Look-Ahead LR
▪ Intermediate in size and cost between the other two methods.
INPUT a1 … ai … an $
Sm LR parsing program
Xm OUTPUT
S m-1
X m-1
… action goto
S0
STACK
71
➢ The parsing table consists of two parts :actionandgotofunctions.
Action: The parsing program determines s m, the state currently on top of
stack, and ai, the current input symbol. It then consultsaction[s m,ai] in the action
table which can have one of four values :
Goto: The function goto takes a state and grammar symbol as arguments and
produces a state.
LR Parsing algorithm:
Method: Initially, the parser has s0 on its stack, where s0 is the initial state,
andw$ in the input buffer. The parser then executes the following program :
ifaction[s,a] = shifts’then
begin pushathens’ on top of
the stack; advanceipto the
next input symbol
end
else ifaction[s,a] = reduce A→βthen begin
72
pop 2* | β | symbols off the stack;
LR(O) items:
A
→.XYZ
A →
X.YZ A
→ XY.Z
A →
XYZ.
Closure operation:
Goto operation:
Method:
1. Construct C = {I0, I1, …. In}, the collection of sets of LR(0) items for G’.
2. Stateiis constructed from I i.. The parsing functions for stateiare determined as
follows:
(a) If [A→a·aβ] is in Ii and goto(Ii,a) = Ij, then setaction[i,a] to “shift j”.
Hereamust be terminal.
(b) If [A→a·] is in Ii , then setaction[i,a] to “reduce A→a” for allain
FOLLOW(A).
(c) If [S’→S.] is in Ii, then setaction[i,$] to “accept”.
If any conflicting actions are generated by the above rules, we say grammar is not
SLR(1).
74
4. All entries not defined by rules (2) and (3) are made “error”
5. The initial state of the parser is the one constructed from the set of items
containing [S’→.S].
T→T*F
| F F → (E)
| id
Augmented grammar :
E’ → E
E→E+T
E→T
T→T*
FT → F
75
F → (E)
F → id
items. I0 : E’ →.E
E →.E + T
E →.T
T →.T *
F T →.F
F →.(E)
F →.id
GOTO ( I0 , E) I1 : E’ → E.
E → E.+ T
GOTO ( I4 , id )
I5 : F → id.
GOTO ( I6 , T )
GOTO ( I0 , T) I9 : E → E + T.
I2 : E → T. T → T.* F
T → T.* F
GOTO ( I6 ,
GOTO ( I0 , F ) I3 : T →
F) I3 : T → F.
F.
GOTO ( I6 , ( )
I4 : F → (.E )
76
GOTO ( I0 , ( E → E.+ T
)
GOTO ( I6 , id)
I4 : F → (.E)
I5 : F → id.
E →.E +
T E →.T
GOTO ( I7 , F ) I10 : T → T * F.
T →.T *
F T →.F GOTO ( I7 , ( )
F →.(E) I4 : F → (.E )
F →.id
E →.E +
T E →.T
GOTO ( I0 ,
id ) T →.T *
F T →.F
I5 : F → id.
F →.(E)
GOTO ( I1 , F →.id
+ ) I6 : E → E
+.T GOTO ( I7 , id )
T →.T * I5 : F → id.
F T →.F
F →.(E) GOTO ( I8 , ) )
F →.id I11 : F → (
E ).
GOTO ( I2 ,
* ) I7 : T → GOTO ( I8 ,
T *.F + ) I6 : E →
E +.T
F →.(E)
T →.T *
F →.id
F T →.F
GOTO ( I4 , F →.( E )
E ) I8 : F → ( F →.id
E.)
77
78
GOTO ( I4 , T) I2 : E →T.
T → T.* F
GOTO ( I4 , F) I3 : T → F.
GOTO ( I9 , *) I7 : T → T *.F
F →.( E )
F →.id
79
GOTO ( I4 , ( )
I4 : F → (.E)
E →.E + T
E →.T
T →.T * F
T →.F
F →.(E) (E) = { $ , ) , +)
FOLLOW
F → id
FOLLOW (T) = { $ , + , ) , * }
FOOLOW (F) = { * , + , ) , $ }
ACTIO GOT
N O
id + * ( ) $ E T F
IO s s 1 2 3
5 4
I1 s AC
6 C
I2 r2 s7 r2 r2
I3 r4 r4 r4 r4
I4 s s 8 2 3
5 4
I5 r6 r6 r6 r6
I6 s s 9 3
5 4
I7 s s 10
5 4
I8 s s11
6
I9 r1 s7 r1 r1
I1O r3 r3 r3 r3
I11 r5 r5 r5 r5
80
Blank entries are error entries.
Stack implementation:
81
82
UNIT – III
Syntax-Directed Translation -Definition
The translation techniques in this chapter will be applied to type checking and
intermediate-code generation. The techniques are also useful for implementing little
languages for specialized tasks; this chapter includes an example from typesetting.
This production has two nonterminals, E and T; the subscript in E1 distinguishes the
occurrence of E in the production body from the occurrence of E as the head. Both E
and T have a string-valued attribute code. The semantic rule specifies that the string
E.code is formed by concatenating Ei.code, T.code, and the character '+'. While the rule
makes it explicit that the translation of E is built up from the translations of E1, T, and
'+', it may be inefficient to implement the translation directly by manipulating strings.
E -» Ei + T { print '+' }
By convention, semantic actions are enclosed within curly braces. (If curly braces occur
as grammar symbols, we enclose them within single quotes, as in ' { ' and '}'.) The
83
position of a semantic action in a production body determines the order in which the
action is executed. In production (5.2), the action occurs at the end, after all the
grammar symbols; in general, semantic actions may occur at any position in a
production body.
Between the two notations, syntax-directed definitions can be more readable, and
hence more useful for specifications. However, translation schemes can be more
efficient, and hence more useful for implementations.
84
Example
E -> E+T | T
T -> T*F | F
F -> INTLIT
For understanding translation rules further, we take the first SDT augmented to [ E ->
E+T ] production rule. The translation rule in consideration has val as an attribute for
both the non-terminals – E & T. Right-hand side of the translation rule corresponds to
attribute values of right-side nodes of the production rule and vice-versa. Generalizing,
SDT are augmented rules to a CFG that associate 1) set of attributes to every node of
the grammar and 2) set of translation rules to every production rule using attributes,
constants, and lexical values.
Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree
corresponding to S would be
85
To evaluate translation rules, we can employ one depth-first search traversal on the
parse tree. This is possible only because SDT rules don’t impose any specific order on
evaluation until children’s attributes are computed before parents for a grammar having
all synthesized attributes. Otherwise, we would have to figure out the best-suited plan to
traverse through the parse tree and evaluate all the attributes in one or more traversals.
For better understanding, we will move bottom-up in the left to right fashion for
computing the translation rules of our example.
The above diagram shows how semantic analysis could happen. The flow of
information happens bottom-up and all the children’s attributes are computed before
parents, as discussed above. Right-hand side nodes are sometimes annotated with
86
subscript 1 to distinguish between children and parents.
Additional Information
Synthesized Attributes are such attributes that depend only on the attribute values of
children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val
corresponding to node E. If all the semantic attributes in an augmented grammar are
synthesized, one depth-first search traversal in any order is sufficient for the semantic
analysis phase.
Inherited Attributes are such attributes that depend on parent and/or sibling’s
attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E & Ep are same
production symbols annotated to differentiate between parent and child, has an inherited
attribute val corresponding to node T.
87
Difference between Synthesized and Inherited Attributes
88
C), then it is said to be a synthesized S, A, and C. Likewise, C can take values
attribute, as the values of ABC are from S, A, and B then S is said to be
synthesized to S. Inherited Attribute.
2 Design As mentioned above in case of On other hand in case of Inherited
Synthesized attribute the production attribute the production must have non-
must have non-terminal as its head. terminal as a symbol in its body.
3 Evaluati Synthesized attribute can be evaluated While on other hand Inherited attribute
on during a single bottom-up traversal of can be evaluated during a single top-
parse tree. down and sideways traversal of parse
tree.
4 Terminal Both terminal and Non terminals can On other hand only Non terminals can
contain the Synthesized attribute. contain the Inherited attribute.
5 Usage Synthesized attribute is used by both S- On other hand Inherited attribute is used
attributed SDT and L-attributed STD. by only L-attributed SDT.
1. always be evaluated
2. be evaluated only if the definition is L-attributed .
3. be evaluated only if the definition has synthesized attributes.
4. never be evaluated.
89
TYPE CHECKING
A compiler must check that the source program follows both syntactic and semantic
conventions of the source language. This checking, called static checking, detects and
reports programming errors.
A type checker verifies that the type of a construct matches that expected by its
context. For example: arithmetic operator mod in Pascal requires integer operands, so a
type checker verifies that the operands of mod have type integer. Type information
gathered by a type checker may be needed when code is generated.
Type Systems
The design of a type checker for a language is based on information about the syntactic
constructs in the language, the notion of types, and the rules for assigning types to
90
language constructs.
For example : “ if both operands of the arithmetic operators of +,- and * are of type
integer, then the result is of type integer ”
Type Expressions
Constructors include:
Arrays : If T is a type expression then array (I,T) is a type expression denoting the type
of an array with elements of type T and index set I.
Records : The difference between a record and a product is that the names. The record
type constructor will be applied to a tuple formed from field names and field types.
For example:
91
lexeme: array[1..15] of char
end;
var table: array[1...101] of row;
declares the type name row representing the type expression record((address X integer)
X (lexeme X array(1..15,char))) and the variable table to be an array of records of this
type.
Type systems
A type system is a collection of rules for assigning type expressions to the various parts
of a program. A type checker implements a type system. It is specified in a syntax-
directed manner. Different type systems may be used by different compilers or
processors of the same language.
Static and Dynamic Checking of Types
Checking done by a compiler is said to be static, while checking done when the target
program runs is termed dynamic. Any check can be done dynamically, if the target code
carries the type of an element along with the value of that element.
92
Sound type system
A sound type system eliminates the need for dynamic checking fo allows us to
determine statically that these errors cannot occur when the target program runs. That
is, if a sound type system assigns a type other than type_error to a program part, then
type errors cannot occur when the target code for the program part is run.
Strongly typed language
A language is strongly typed if its compiler can guarantee that the programs it accepts
will execute without type errors.
Error Recovery
Since type checking has the potential for catching errors in program, it is desirable for
type checker to recover from errors, so it can check the rest of the input. Error handling
has to be designed into the type system right from the start; the type checking rules must
be prepared to cope with errors.
A type checker for a simple language checks the type of each identifier. The type
checker is a translation scheme that synthesizes the type of each expression from the
types of its subexpressions. The type checker can handle arrays, pointers, statements
and functions.
A Simple Language
P→D;E
D → D ; D | id : T
Translation scheme:
P→D;E
D→D;D
93
D → id : T { addtype (id.entry , T.type) }
T → ↑ T1 { T.type : = pointer(T1.type) }
In the following rules, the attribute type for E gives the type expression assigned to the
expression generated by E.
Here, constants represented by the tokens literal and num have type char and integer.
lookup ( e ) is used to fetch the type saved in the symbol table entry pointed to by e.
3. E → E1 mod E2 { E.type : = if E1. type = integer and E2. type = integer then integer
else type_error }
The expression formed by applying the mod operator to two subexpressions of type
integer has type integer; otherwise, its type is type_error.
In an array reference E1 [ E2 ] , the index expression E2 must have type integer. The
result is the element type t obtained from the type array(s,t) of E1.
Statements do not have values; hence the basic type void can be assigned to them. If an
error is detected within a statement, then type_error is assigned.
3 While statement:
S → while E do S1
4. Sequence of statements:
S → S1 ; S2 { S.type : = if S1.type = void and S1.type = void then void else type_error
}
UNIT –IV
Run-Time Environment
A program as a source code is merely a collection of text (code, statements etc.) and to
make it alive, it requires actions to be performed on the target machine. A program
needs memory resources to execute instructions. A program contains names for
95
procedures, identifiers etc., that require mapping with the actual memory location at
runtime.
Runtime support system is a package, mostly generated with the executable program
itself and facilitates the process communication between the process and the runtime
environment. It takes care of memory allocation and de-allocation while the program is
being executed.
Activation Trees
The execution of a procedure is called its activation. An activation record contains all
the necessary information required to call a procedure. An activation record may
contain the following units (depending upon the source language used).
Whenever a procedure is executed, its activation record is stored on the stack, also
known as control stack. When a procedure calls another procedure, the execution of the
caller is suspended until the called procedure finishes execution. At this time, the
activation record of the called procedure is stored on the stack.
96
We assume that the program control flows in a sequential manner and when a procedure
is called, its control is transferred to the called procedure. When a called procedure is
executed, it returns the control back to the caller. This type of control flow makes it
easier to represent a series of activations in the form of a tree, known as the activation
tree.
...
printf(“Enter Your Name: “);
scanf(“%s”, username);
show_data(username);
printf(“Press any key to continue…”);
...
int show_data(char *user)
{
printf(“Your name is %s”, username);
return 0;
}
...
Now we understand that procedures are executed in depth-first manner, thus stack
allocation is the best suitable form of storage for procedure activations.
Storage Allocation
Runtime environment manages runtime memory requirements for the following entities:
Code : It is known as the text part of a program that does not change at runtime.
Its memory requirements are known at the compile time.
Procedures : Their text part is static but they are called in a random manner.
That is why, stack storage is used to manage procedure calls and activations.
Variables : Variables are known at the runtime only, unless they are global or
constant. Heap memory allocation scheme is used for managing allocation and
de-allocation of memory for variables in runtime.
97
Static Allocation
In this allocation scheme, the compilation data is bound to a fixed location in the
memory and it does not change when the program executes. As the memory
requirement and storage locations are known in advance, runtime support package for
memory allocation and de-allocation is not required.
Stack Allocation
Procedure calls and their activations are managed by means of stack memory allocation.
It works in last-in-first-out (LIFO) method and this allocation strategy is very useful for
recursive procedure calls.
Heap Allocation
Variables local to a procedure are allocated and de-allocated only at runtime. Heap
allocation is used to dynamically allocate memory to the variables and claim it back
when the variables are no more required.
Except statically allocated memory area, both stack and heap memory can grow and
shrink dynamically and unexpectedly. Therefore, they cannot be provided with a fixed
amount of memory in the system.
As shown in the image above, the text part of the code is allocated a fixed amount of
memory. Stack and heap memory are arranged at the extremes of total memory
allocated to the program. Both shrink and grow against each other.
Parameter Passing
r-value
The value of an expression is called its r-value. The value contained in a single variable
also becomes an r-value if it appears on the right-hand side of the assignment operator.
r-values can always be assigned to some other variable.
98
l-value
The location of memory (address) where an expression is stored is known as the l-value
of that expression. It always appears at the left hand side of an assignment operator.
For example:
day = 1;
week = day * 7;
month = 1;
year = month * 12;
From this example, we understand that constant values like 1, 7, 12, and variables like
day, week, month and year, all have r-values. Only variables have l-values as they also
represent the memory location assigned to them.
For example:
7 = x + y;
is an l-value error, as the constant 7 does not represent any memory location.
Formal Parameters
Variables that take the information passed by the caller procedure are called formal
parameters. These variables are declared in the definition of the called function.
Actual Parameters
Variables whose values or addresses are being passed to the called procedure are called
actual parameters. These variables are specified in the function call as arguments.
Example:
fun_one()
{
int actual_parameter = 10;
call fun_two(int actual_parameter);
}
fun_two(int formal_parameter)
{
99
print formal_parameter;
}
Formal parameters hold the information of the actual parameter, depending upon the
parameter passing technique used. It may be a value or an address.
Pass by Value
In pass by value mechanism, the calling procedure passes the r-value of actual
parameters and the compiler puts that into the called procedure’s activation record.
Formal parameters then hold the values passed by the calling procedure. If the values
held by the formal parameters are changed, it should have no impact on the actual
parameters.
Pass by Reference
In pass by reference mechanism, the l-value of the actual parameter is copied to the
activation record of the called procedure. This way, the called procedure now has the
address (memory location) of the actual parameter and the formal parameter refers to
the same memory location. Therefore, if the value pointed by the formal parameter is
changed, the impact should be seen on the actual parameter as they should also point to
the same value.
Pass by Copy-restore
This parameter passing mechanism works similar to ‘pass-by-reference’ except that the
changes to actual parameters are made when the called procedure ends. Upon function
call, the values of actual parameters are copied in the activation record of the called
procedure. Formal parameters if manipulated have no real-time effect on actual
parameters (as l-values are passed), but when the called procedure ends, the l-values of
formal parameters are copied to the l-values of actual parameters.
Example:
int y;
calling_procedure()
{
y = 10;
copy_restore(y); //l-value of y is passed
printf y; //prints 99
100
}
copy_restore(int x)
{
x = 99; // y still has value 10 (unaffected)
y = 0; // y is now 0
}
When this function ends, the l-value of formal parameter x is copied to the actual
parameter y. Even if the value of y is changed before the procedure ends, the l-value of
x is copied to the l-value of y making it behave like call by reference.
Pass by Name
Languages like Algol provide a new kind of parameter passing mechanism that works
like preprocessor in C language. In pass by name mechanism, the name of the procedure
being called is replaced by its actual body. Pass-by-name textually substitutes the
argument expressions in a procedure call for the corresponding parameters in the body
of the procedure so that it can now work on actual parameters, much like pass-by-
reference.
101
said to be
called at that point.
Activation Tree
quicksort(int m, int n)
{
int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
102
}
First main function as root then main calls readarray and quicksort.
Quicksort in turn calls partition and quicksort again. The flow of control in a
program corresponds to the depth first traversal of activation tree which starts at the
root.
Control Stack
Control stack or runtime stack is used to keep track of the live procedure activations
i.e the procedures whose execution have not been completed.
A procedure name is pushed on to the stack when it is called (activation
begins) and it is popped when it returns (activation ends).
Information needed by a single execution of a procedure is managed using
103
an activation record.
When a procedure is called, an activation record is pushed into the stack and
as soon as the control returns to the caller function the activation record is
popped.
Then the contents of the control stack are related to paths to the root of the
activation tree. When node n is at the top of the control stack, the stack
contains the nodes along the path from n to the root.
Consider the above activation tree, when quicksort(4,4) gets executed, the
contents of control stack were main() quicksort(1,10) quicksort(1,4),
quicksort(4,4)
Binding Of Names
Even if each name is declared once in a program, the same name may denote
different data object at run time. “Data objects” corresponds to a storage
location that hold values.
The term environment refers to a function that maps a name to a storage
location. The term state refers to a function that maps a storage location to the
value held there.
104
When an environment associates storage location s with a name x, we say
that x is bounds to s. This association is referred to as a binding of x.
STORAGE ORGANIZATION
The executing target program runs in its own logical address space in which
each program value has a location
The management and organization of this logical address space is shared
between the compiler, operating system and target machine. The operating
system maps the logical address into physical addresses, which are usually
spread through memory.
Typical subdivision of run time memory.
Code area: used to store the generated executable instructions, memory locations
for the code are determined at compile time
Static Data Area: Is the locations of data that can be determined at compile time
Stack Area: Used to store the data object allocated at runtime. eg. Activation records
Heap: Used to store other dynamically allocated data objects at runtime ( for ex:
malloac)
105
This runtime storage can be subdivided to hold the different components
of an existing system
1. Generated executable code
2. Static data objects
3. Dynamic data objects-heap
4. Automatic data objects-stack
Activation Records
record.
106
6. The field for actual parameters is used by the calling procedure
to supply parameters to the called procedure.
7. The field for the returned value is used by the called procedure
to return a value to the calling procedure, Again, in practice this
value is often returned in a register for greater efficiency.
Returned value
Actual parameters
Optional control
link
Optional access
link
Saved machine
status
Local data
temporaries
General Activation Record
Heap allocation - allocates and deallocates storage as needed at run time from
a data area known as heap.
Static Allocation
All compilers for languages that use procedures, functions or methods as units
of user functions define actions manage at least part of their runtime memory
as a stack run- time stack.
Each time a procedure called, space for its local variables is pushed onto a
stack, and when the procedure terminates, space popped off from the stack
Calling Sequences
Procedures called implemented in what is called as calling sequence, which
consists of code that allocates an activation record on the stack and enters
information into its fields.
A return sequence is similar to code to restore the state of a machine so the
calling procedure can continue its execution after the call.
The code is calling sequence of often divided between the calling procedure
(caller) and a procedure is calls (callee)(callee).
When designing calling sequences and the layout of activation record, the
following principles are helpful:
110
Heap Allocation
The record for an activation of procedure r is retained when the activation ends.
Therefore, the record for the new activation q(1 , 9) cannot follow that for s physically.
If the retained activation record for r is deallocated, there will be free space in
the heap between the activation records for s and q.
111
Need For ICG
A source program can be translated directly into the target language, but
some benefits of using intermediate form are:
Retargeting is facilitated: a compiler for a different machine can be
created by attaching a Back-end (which generate Target Code) for the
new machine to an existing Front-end (which generate Intermediate
Code).
A machine Independent Code-Optimizer can be applied to the
Intermediate Representation.
112
INTERMEDIATE LANGUAGES
The most commonly used intermediate representations were:-
Syntax Tree
DAG (Direct Acyclic Graph)
Postfix Notation
3 Address Code
GRAPHICAL REPRESENTATION
Includes both
Syntax Tree
DAG (Direct Acyclic Graph)
Syntax Tree Or Abstract Syntax Tree (AST)
113
E E+ E +
TE E
E + T * 4
-T
E T
T F 3 5
T T*
FT F T * F digit
F
digit F digit 4
digit 5
3
Parse Tree VS Syntax Tree
114
Constructing Syntax Tree For Expression
Each node in a syntax tree can be implemented in arecord with several fields.
In the node of an operator, one field contains operator and remaining field
contains pointer to the nodes for the operands.
When used for translation, the nodes in a syntax tree may contain addition of
fields to hold the values of attributes attached to the node.
Following functions are used to create syntax tree
1. mknode(op,left,right): creates an operator node with label
op and two fields containing pointers to left and right.
2. mkleaf(id,entry): creates an identifier node with label id
and a field containing entry, a pointer to the symbol table
entry for identifier
3. mkleaf(num,val): creates a number node with label num
and a field containing val, the value of the number.
Such functions return a pointer to a newly created node.
EXAMPLE
a–4+c
The tree is
constructed bottom
up
P1 =
mkleaf(id,entry a)
P2 = mkleaf(num,
4) P3 = mknode(-, Syntax
Tree
P1, P2) P4 =
mkleaf(id,entry c)
P5 = mknode(+, P3,
P4)
The token id has an attribute place that points to the symbol-table entry for
the identifier.
A symbol-table entry can be found from an attribute id.name, representing the
lexeme associated with that occurrence of id.
If the lexical analyser holds all lexemes in a single array of characters, then
attribute name might be the index of the first character of the lexeme.
Two representations of the syntax tree are as follows.
116
In (a), each node is represented as a record with a field for its operator and
additional fields for pointers to its children.
In Fig (b), nodes are allocated from an array of records and the index or
position of the node serves as the pointer to the node.
All the nodes in the syntax tree can be visited by following winters, starting
from the root at position 10.
Direct Acyclic Graph (DAG)
EXAMPLE
a=b*-c + b*-c
117
Postfix Notation
Operators can be evaluated in the order in which they appear in the string
EXAMPLE
Source String : a := b * -c + b * -c
Postfix String: a b c uminus * b c uminus * + assign
Postfix Rules
1. If E is a variable or constant, then the postfix notation for E is E itself.
2. If E is an expression of the form E1 op E2 then postfix notation for E is
E1’ E2’ op, here E1’ and E2’ are the postfix notations for E1and E2,
respectively
3. If E is an expression of the form (E), then the postfix notation for E is
the same as the postfix notation for E.
4. For unary operation –E the
a = b op c
where,
a, b, c are the operands that can be names,
constants or compiler generated temporaries.
op represents operator, such as fixed or floating
point arithmetic operator or a logical operator on
Boolean valued data. Thus a source language
expression like x + y * z might be translated into
a sequence
t1 := y*z
119
Types of Three-Address Statements
1. Assignment statements
x := y op z, where op is a binary arithmetic or logical operation.
2. Assignment instructions
x : = op y, where op is a unary operation . Essential unary operations include
unary minus, logical negation, shift operators, and conversion operators that
for example, convert a fixed-point number to a floating-point number.
3. Copy statements
x : = y where the value of y is assigned to x.
4. Unconditional jump
goto L The three-address statement with label L is the next to be executed
5. Conditional jump
if x relop y goto L This instruction applies a relational operator ( <, =,
=, etc,) to x and y, and executes the statement with label L next if x
stands in relation relop to y. If not, the three-address statement following
if x relop y goto L is executed next, as in the usual sequence.
120
param
x1
param
x2
……….
param xn
call p,n
generated as part of the call procedure p( xl , x2, . . . , xn ) . The integer
n indicating the number of actual-parameters in ''call p , n" is not
redundant because calls can be
nested.
7. Indexed Assignments
Indexed assignments of the form x = y[i] or x[i] = y
When three-address code is generated, temporary names are made up for the
interior nodes of a syntax tree. for example id : = E consists of code to evaluate
E into some temporary t, followed by the assignment id.place : = t.
Given input a:= b * - c + b + - c, it produces the three address code in given
above (page no: ) The synthesized attribute S.code represents the three
address code for the assignment S. The nonterminal E has two attributes:
1. E.place the name that will hold the value of E, and
2. E.code. the sequence of three-address statements evaluating E.
121
Semantic rule generating code for a while statement
122
code for S, respectively.
The function newlabel returns a new label every time is called. We assume that a
nonzero expression represents true; that is when the value of E becomes zero,
control laves the while statement
Implementation Of Three-Address Statements
The op field contains an internal code for the operator. The three address statement
x:= y op z is represented by placing y in arg1, z in arg2 and x in result.
The contents of arg1, arg2, and result are normally pointers to the symbol table
entries for the names represented by these fields. If so temporary names must be
entered into the symbol table as they are created.
EXAMPLE 1
Translate the following expression to quadruple triple and indirect triple
a+b*c|e^f+b*a
For the first construct the three address code for the expression
t1 = e ^ f
t2 = b *
c t3 = t2 /
t1 t4 = b
* a t5 =
a + t3 t6
= t5 + t4
123
Locati O arg arg Resu
on P 1 2 lt
(0) ^ e f t1
(1) * b c t2
(2) / t2 t1 t3
(3) * b a t4
(4) + a t3 t5
(5) + t3 t4 t6
Exceptions
The statement x := op y, where op is a unary operator is represented by
placing op in the operator field, y in the argument field & n in the
result field. The arg2 is not used
A statement like param t1 is represented by placing param in the
operator field and t1 in the arg1 field. Neither arg2 not result field is
used
Unconditional & Conditional jump statements are represented by
placing the target in the result field.
TRIPLES
In triples representation, the use of temporary variables is avoided & instead
reference to instructions are made
So three address statements can be represented by records with only there
fields OP, arg1 & arg2.
Since, there fields are used this intermediated code formal is known as triples
Advantages
a+b*c|e^f+b*a
124
t1 = e ^ f t2
= b * c t3 =
t2 / t1 t4 = b
* a t5 = a
+ t3 t6 = t5
+ t4
EXAMPLE 2
A ternary operation like x[i] : = y requires two entries in the triple structure while x
: = y[i] is naturally represented as two operations.
x[i] := y x := y[i]
INDIRECT TRIPLES
Comparison
When we ultimately produce the target code each temporary and programmer
defined name will assign runtime memory location
This location will be entered into symbol table entry of that data.
Using the quadruple notation, a three address statement containing a temporary can
immediately access the location for that temporary via symbol table.
But this is not possible with triples notation.
With quadruple notation, statements can often move around which makes
optimization easier.
This is achieved because using quadruple notation the symbol table
interposes high degree of indirection between computation of a value and its
use.
With quadruple notation, if we move a statement computing x, the statement using x
requires no change.
But with triples, moving a statement that defines a temporary value requires
us to change all references to that statement in arg1 and arg2 arrays. This
makes triples difficult to use in optimizing compiler
With indirect triples also, there is no such problem.
A statement can be moved by reordering the statement list.
Space Utilization
Quadruples and indirect triples requires same amount of space for storage
126
(normal case).
But if same temporary value is used more than once indirect triples can save
some space. This is bcz, 2 or more entries in statement array can point to the
same line of op-arg1-arg2 structure.
Triples requires less space for storage compared to above 2.
Quadruples
PROBLEM 1
Translate the following expression to quadruple tuples & indirect tuples
a=b*-c+b*-c
TAC
t1 = uniminus c t2 =
b* t1
t3 = uniminus
c t4 = b* t3
t5 = t2 + t4 Q = t5
QUADRUPLES
TRIPLES
INDIRECT TRIPLES
ASSIGNMENT STATEMENTS
128
S->id : = E { p : = lookup ( id.name);
if p ≠ nil then
emit( p ‘ : =’ E.place)
else error }
129
tax-directed definition to produce three-address code for assignments.
130
BOOLEAN EXPRESSIONS
Boolean expressions have two primary purposes.
They are used to compute logical values.
But more often they are used as conditional expressions in
statements that alter the flow of control, such as if-then-else,
or while-do statements.
Boolean expressions are composed of the Boolean operators (and, or, and
not) applied to elements that are Boolean variables or relational
expressions.
Relational expressions are of the form E1 relop E2, where E1 and E2 are
arithmetic expressions and relop can be <, <=, =!, =, > or >=
Here we consider Boolean expressions generated by the following grammar :
t2 : = b
and t1 t3
: = a or
t2
where the function emit( ) output the three address statement into the output
file and nextstat( ) gives the index of the next three address statement in the
output sequence and emit increments nextstat after producing each three
address statement.
100 if a < b
132
=0
S → if E then S1
| if E then S1 else S2
| while E do S1
133
In each of these productions, E is the Boolean expression to be translated.
In the translation, we assume that a three-address statement can be
symbolically labeled, and that the function newlabel returns a new
symbolic label each time it is called.
With each E we associate two labels E.true and E.false. E.true is the label to
which control flows if E is true, and E.false is the label to which control
flows if E is false.
The inherited attribute S.next is a label that is attached to the first three-
address instruction to be executed after the code for S and another inherited
attribute S.begin is the first instruction of S
Syntax Directed Definition for flow –of –control statements
newlabel; E.false
:= S.next; S1.next
:= S.next;
:= newlabel;
S1.next := S.next;
S2.next := S.next;
134
UNIT - V
CODE GENERATION
Issues in the design of a code generator
Code generator converts the intermediate representation of source code into a form
that can be readily executed by the machine. A code generator is expected to
generate a correct code. Designing of code generator should be done in such a way
so that it can be easily implemented, tested and maintained.
Target Program
Target program is the output of the code generator. The output may be absolute
machine language, relocatable machine language, assembly language.
1. Absolute machine language as an output has advantages that it can be
placed in a fixed memory location and can be immediately executed.
2. Relocatable machine language as an output allows subprograms and
subroutines to be compiled separately. Relocatable object modules can be
linked together and loaded by linking loader.
3. Assembly language as an output makes the code generation easier. We can
generate symbolic instructions and use macro-facilities of assembler in
generating code.
Memory Management
135
Mapping the names in the source program to addresses of data objects is done by the
front end and the code generator. A name in the three addressstatement refers to the
symbol table entry for name. Then from the symbol table entry, a relative address can
be determined for the name
Instruction selection
Selecting best instructions will improve the efficiency of the program. It includes the
instructions that should be complete and uniform. Instruction speeds and machine
idioms also plays a major role when efficiency is considered. But if we do not care
about the efficiency of the target program then instruction selection is straight-forward.
For example, the respective three-address statements would be translated into latter
code sequence as shown below: P:=Q+R
S:=P+T
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S
Here the fourth statement is redundant as the value of the P is loaded again in that
statement that just has been stored in the previous statement. It leads to an inefficient
code sequence. A given intermediate representation can be translated into many
code sequences, with significant cost differences between the different
implementations. A prior knowledge of instruction costis needed in order to design
good sequences, but accurate cost information is difficult to predict.
Register allocation issues –
Use of registers make the computations faster in comparison to that of memory, so
efficient utilization of registers is important. The use of registers are subdivided into
two subproblems:
136
machine requires register pairs consist of an even and next odd-numbered register.
For example
M a, b
These types of multiplicative instruction involve register pairs where a, the
multiplicand is an even register and b, the multiplier is the odd register of the
even/odd register pair.
Evaluation order –
The code generator decides the order in which the instruction will be executed. The
order of computations affects the efficiency of the target code. Among many
computational orders, some will require only fewer registers to hold the intermediate
results. However, picking the best order in general case is a difficult NP-complete
program.
Approaches to code generation issues: Code generator must always generate the
correct code. It is essential because of the number of special cases that a code
generator might face. Some of the design goals of code generator are:
Correct
Easily maintainable
Testable
Maintainable
Target Machine
138
Computations are generally assumed to be performed on high speed memory
locations, known as registers. Performing various operations on registers is efficient
as registers are faster than cache memory. This feature is effectively used by
compilers, However registers are not available in large amount and they are costly.
Therefore we should try to use minimum number of registers to incur overall low
cost.
Optimized code :
Example 1 :
L1: a = b + c * d
optimization :
t0 = c * d
a = b + t0
Example 2 :
L2: e = f - g / d
optimization :
t0 = g / d
e = f - t0
Register Allocation :
Register allocation is the process of assigning program variables to registers and
reducing the number of swaps in and out of the registers. Movement of variables
139
across memory is time consuming and this is the main reason why registers are used
as they available within the memory and they are the fastest accessible storage
location.
Example 1:
R1<--- a
R2<--- b
R3<--- c
R4<--- d
MOV R3, c
MOV R4, d
MUL R3, R4
MOV R2, b
ADD R2, R3
MOV R1, R2
MOV a, R1
Example 2:
R1<--- e
R2<--- f
R3<--- g
R4<--- h
MOV R3, g
MOV R4, h
140
DIV R3, R4
MOV R2, f
SUB R2, R3
MOV R1, R2
MOV e, R1
Advantages :
Disadvantages :
During the execution of a program, the same name in the source can denote different
data objects in the computer. The allocation and deallocation of data objects is
managed by the run-time support package . Terminologies:
• Storage space → value: the current value of a storage space is called its state.
• If it is a recursive procedure, then several of its activations may exist at the same
time.
• Life time: the time between the first and last steps in a procedure.
141
General run time storage layout code static data stack heap dynamic space storage
space that won’t change: global data, constant, ... lower memory address higher
memory address For activation records: local data, parameters, control info, ... For
dynamic memory allocated by the program
Activation record
returned value
actual parameters
optional control link
optional access link
saved machine status
local data
temporaries
Activation record:
• Parameters:
• Links:
. Control (or dynamic) link: a pointer to the activation record of the caller.
Static storage allocation (1/3) There are two different approaches for run time
storage allocation.
• Static allocation.
• Dynamic allocation.
• Every time a procedure is called, its names refer to the same preassigned location.
• Disadvantages:
No recursion.
Waste lots of space when inactive.
No dynamic allocation.
142
• Advantage:
On procedure calls:
First evaluate arguments. Copies arguments into parameter space in the A.R. of
called procedure. Convention: call that which is passed to a procedure arguments
from the calling side, and parameters from the called side. May save some registers
in its own A.R.Jump and link: jump to the first instruction of called procedure and
put address of next instruction (return address) into register RA (the return address
register).
Copies return address from RA into its A.R.’s return address field. May save some
registers. May initialize local data.
On procedure returns,
Restores values of saved registers. Jump to address in the return address field.
• The calling procedure: May restore some registers. If the called procedure was
actually a function, put return value in an appropriate place.
In this section, we are going to learn how to work with basic block and flow graphs
in compiler design.
Basic Block
The basic block is a set of statements. The basic blocks do not have any in and out
branches except entry and exit. It means the flow of control enters at the beginning
and will leave at the end without any halt. The set of instructions of basic block
executes in sequence.
Here, the first task is to partition a set of three-address code into the basic block. The
new basic block always starts from the first instruction and keep adding instructions
until a jump or a label is met. If no jumps or labels are found, the control will flow
in sequence from one instruction to another.
The algorithm for the construction of basic blocks is given below:
143
Algorithm: Partitioning three-address code into basic blocks.
Input: The input for the basic blocks will be a sequence of three-address code.
Output: The output is a list of basic blocks with each three address statements in
exactly one block.
METHOD: First, we will identify the leaders in the intermediate code. There are
some rules for finding leaders, which are given below:
Each leader’s basic block will have all the instructions from the leader itself until the
instruction, which is just before the starting of the next leader.
Example:
So there are six basic blocks for the above code, which are given below:
B1 for statement 1
B2 for statement 2
Flow Graph
It is a directed graph. After partitioning an intermediate code into basic blocks, the
flow of control among basic blocks is represented by a flow graph. An edge can
flow from one block X to another block Y in such a case when the Y block’s first
instruction immediately follows the X block’s last instruction. The following ways
will describe the edge:
Block B1 is the entry point for the flow graph because B1 contains starting
instruction.
145
B3 block has two successors. One is a block B3 itself because the first
instruction of the B3 block is the target for the conditional jump in the last
instruction of block B3. Another successor is block B4 due to conditional
jump at the end of B3 block.
The optimization must be correct, it must not, in any way, change the
meaning of the program.
The optimization process should not delay the overall compiling process.
When to Optimize?
Why Optimize?
146
Types of Code Optimization –The optimization process can be broadly
classified into two types :
(i) A = 2*(22.0/7.0)*r
(ii) x = 12.4
y = x/2.3
Variable Propagation :
//Before Optimization
c=a*b
x=a
till
d=x*b+4
//After Optimization
c=a*b
x=a
till
147
d=a*b+4
Hence, after variable propagation, a*b and x*b will be identified as common
sub-expression.
c=a*b
x=a
till
d=a*b+4
//After elimination :
c=a*b
till
d=a*b+4
Code Motion :
a = 200;
while(a>0)
b = x + y;
if (a % b == 0}
printf(“%d”, a);
a = 200;
148
b = x + y;
while(a>0)
if (a % b == 0}
printf(“%d”, a);
• Strength reduction means replacing the high strength operator by the low
strength.
i = 1;
while (i<10)
y = i * 4;
//After Reduction
i=1
t=4
while( t<40)
y = t;
t = t + 4;
149
Now that we learned the need for optimization and its two types,now let’s see
where to apply these optimization.
Source program
Intermediate Code
Target Code
Phases of Optimization
Global Optimization:
Local Optimization:
Function-Preserving Transformations
Dead-code elimination
Constant folding
For example
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
t6: = b [t4] +t5
The above code can be optimized using the common sub-expression
elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5: = n
151
Copy Propagation:
• For example:
x=Pi;
A=x*r*r;
Dead-Code Eliminations:
Example:
i=0;
if(i=1)
a=b+5;
Here, ‘if’ statement is dead code because this condition will never get
satisfied.
Constant folding:
152
For example,
Loop Optimizations:
In loops, especially in the inner loops, programs tend to spend the bulk of
their time. The running time of a program may be improved if the number of
instructions in an inner loop is decreased, even if we increase the amount of
code outside that loop.
153
Code Motion:
t= limit-2;
Induction Variables :
Loops are usually processed inside out. For example consider the loop
around B3. Note that the values of j and t4 remain in lock-step; every time the
value of j decreases by 1, that of t4 decreases by 4 because 4*j is assigned to
t4. Such identifiers are called induction variables.
When there are two or more induction variables in a loop, it may be possible
to get rid of all but one, by the process of induction-variable elimination. For
the inner loop around B3 in Fig.5.3 we cannot get rid of either j or t4
completely; t4 is used in B3 and j in B4.
Example:
154
The replacement of a multiplication by a subtraction will speed up the object
code if multiplication takes more time than addition or subtraction, as is the
case on many machines.
Reduction In Strength:
There are two type of basic block optimization. These are as follows:
1. Structure-Preserving Transformations
2. Algebraic Transformations
155
1. Structure preserving transformations:
1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = a - d
In the above expression, the second and forth expression computed the same
expression. So the block can be transformed as follows:
1. a : = b + c
2. b : = a - d
3. c : = b + c
4. d : = b
o This can be caused when once declared and defined once and forget to
remove them in this case they serve no purpose.
o Suppose the statement x:= y + z appears in a block and x is dead symbol that
means it will never subsequently used. Then without changing the value of
the basic block you can safely remove this statement.
1. t1 : = b + c
2. t2 : = x + y
2. Algebraic transformations:
1. a:= b + c
2. e:= c +d +b
1. a:= b + c
2. t:= c +d
3. e:= t + b
157