Introduction of Compiler Design
Introduction of Compiler Design
Introduction of Compiler, Major data Structure in compiler, types of Compilers, Front-end and
Backend of compiler, Compiler structure: analysis-synthesis model of compilation, various phases of
a compiler, Lexical analysis: Input buffering, Specification & Recognition of Tokens, Design of a
Lexical Analyzer Generator, LEX.
L1
Introduction of Compiler Design
The compiler is software that converts a program written in a high-level language (Source Language)
to a low-level language (Object/Target/Machine Language/0, 1’s).
A translator or language processor is a program that translates an input program written in a
programming language into an equivalent program in another language.
The compiler is a type of translator, which takes a program written in a high-level programming
language as input and translates it into an equivalent program in low-level languages such as machine
language or assembly language.
The program written in a high-level language is known as a source program, and the program
converted into a low-level language is known as an object (or target) program. Without compilation,
no program written in a high-level language can be executed. For every programming language, we
have a different compiler; however, the basic tasks performed by every compiler are the same.
The process of translating the source code into machine code involves several stages, including lexical
analysis, syntax analysis, semantic analysis, code generation, and optimization.
Compiler is an intelligent program as compare to an assembler. Compiler verifies all types of limits,
ranges, errors, etc.
Compiler program takes more time to run and it occupies huge amount of memory space.
The speed of compiler is slower than other system software. It takes time because it enters through the
program and then does translation of the full program.
When compiler runs on same machine and produces machine code for the same machine on which it
is running. Then it is called as self-compiler or resident compiler.
Compiler may run on one machine and produces the machine codes for other computer then in that
case it is called as cross compiler.
2. Syntax Tree
A syntax tree is a tree data structure in which a node represents an operand and each interior
node represents an operator. It is a dynamically allocated pointer-based tree data structure that
is created as parsing proceeds. If the syntax tree is generated by the parser, then it is in the tree
form.
3. Symbol Table
The symbol table is a data structure that is used to keep the information of identifiers, functions,
variables, constants, and data types. It is created and maintained by the compiler because it keeps the
information about the occurrence of entities. The symbol table is used in almost every phase of the
compiler, we can see that in the below diagram of phases of a compiler. The scanner, parser, and
semantic phase may enter identifiers into the symbol table and the optimization and code generation
phase will access the symbol table to use the information provided by the symbol table to make
appropriate decisions. Given the frequency of access to the symbol table, the insertion, deletion, and
access operations should be well-optimized and efficient. The hash table is mainly used here.
4. Literal Table
A literal table is a data structure that is used to keep track of literal variables in the program. It
holds constant and strings used in the program but it can appear only once in a literal table and its
contents apply to the whole program, which is why deletions are not necessary for it. The literal
table allows the reuse of constants and strings that plays an important role in reducing the
program size.
5. Parse Tree
A parse tree is the hierarchical representation of symbols. The symbols include terminal or
non-terminal. In the parse tree the string is derived from the starting symbol and the starting
symbol is mainly the root of the parse tree. All the leaf nodes are symbols and the inner nodes are
the operators or non-terminals. To get the output we can use Inorder Traversal.
For example: - Parse tree for a+b*c.
And there is intermediate code which also needs data structures to store the data.
6. Intermediate Code
Once the intermediate code is generated, the intermediate code can be stored as a linked list of
structures, a text file, or an array of strings that only depends on the type of intermediate code that
is generated. According to that, we choose the right data structures that will carry optimization.
L2
Types of Compilers
There are mainly three types of compilers.
● Single Pass Compilers
● Two Pass Compilers
● Multi-pass Compilers
Multi-pass Compiler
When several intermediate codes are created in a program and a syntax tree is processed many times,
it is called Multi pass Compiler. It breaks codes into smaller programs.
phases of a compiler
The process of converting the source code into machine code involves several phases or stages, which
are collectively known as the phases of a compiler. The typical phases of a compiler are:
1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as scanning.
This phase reads the source code and breaks it into a stream of tokens, which are the basic
units of the programming language. The tokens are then passed on to the next phase for
further processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also known as parsing.
This phase takes the stream of tokens generated by the lexical analysis phase and checks
whether they conform to the grammar of the programming language. The output of this phase
is usually an Abstract Syntax Tree (AST).
3. Semantic Analysis: The third phase of a compiler is semantic analysis. This phase checks
whether the code is semantically correct, i.e., whether it conforms to the language’s type
system and other semantic rules. In this stage, the compiler checks the meaning of the source
code to ensure that it makes sense. The compiler performs type checking, which ensures that
variables are used correctly and that operations are performed on compatible data types. The
compiler also checks for other semantic errors, such as undeclared variables and incorrect
function calls.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate code
generation. This phase generates an intermediate representation of the source code that can be
easily translated into machine code.
5. Optimization: The fifth phase of a compiler is optimization. This phase applies various
optimization techniques to the intermediate code to improve the performance of the generated
machine code.
6. Code Generation: The final phase of a compiler is code generation. This phase takes the
optimized intermediate code and generates the actual machine code that can be executed by
the target hardware.
Symbol Table – It is a data structure being used and maintained by the compiler, consisting of all the
identifier’s names along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
Linear Analysis- This involves a scanning phase where the stream of characters is read from
left to right. It is then grouped into various tokens having a collective meaning.
Hierarchical Analysis- In this analysis phase, based on a collective meaning, the tokens are
categorized hierarchically into nested groups.
Semantic Analysis- This phase is used to check whether the components of the source
program are meaningful or not.
L3
Front-end and Backend of compiler
The compiler has two modules namely the front end and the back end. Front-end constitutes the
Lexical analysers, semantic analysers, syntax analysers, and intermediate code generator. And the rest
are assembled to form the back end.
Applications
1. Generating machine code or executable code for a specific platform: The synthesis phase
takes the intermediate code and generates code that can be run on a specific computer
architecture.
2. Instruction selection: The compiler selects appropriate machine instructions for the target
platform to implement the intermediate code.
3. Register allocation: The compiler assigns values to registers to improve the performance of
the generated code.
4. Memory management: The compiler manages the allocation and deallocation of memory to
ensure the generated code runs efficiently.
5. Optimization: The compiler performs various optimization techniques such as dead code
elimination, constant folding, and common subexpression elimination to improve the
performance of the generated code.
6. Creating executable files: The final output of the synthesis phase is typically a file
containing machine code or assembly code that can be directly executed by the computer’s
CPU.
L5
Lexical analysis
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the
High-level input program into a sequence of Tokens.
● Lexical Analysis can be implemented with the Deterministic finite Automata.
● The output is a sequence of tokens that is sent to the parser for syntax analysis
What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages. Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of non-tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens. You can observe that we have omitted comments.
( LAPREN = ASSIGNMENT
a IDENTIFIER a IDENTIFIER
b IDENTIFIER 2 INTEGER
) RPAREN ; SEMICOLON
Advantages:
Efficiency: Lexical analysis improves the efficiency of the parsing process because it breaks down the
input into smaller, more manageable chunks. This allows the parser to focus on the structure of the
code, rather than the individual characters.
Flexibility: Lexical analysis allows for the use of keywords and reserved words in programming
languages. This makes it easier to create new programming languages and to modify existing ones.
Error Detection: The lexical analyzer can detect errors such as misspelled words, missing
semicolons, and undefined variables. This can save a lot of time in the debugging process.
Code Optimization: Lexical analysis can help optimize code by identifying common patterns and
replacing them with more efficient code. This can improve the performance of the program.
Disadvantages:
Complexity: Lexical analysis can be complex and require a lot of computational power. This can
make it difficult to implement in some programming languages.
Limited Error Detection: While lexical analysis can detect certain types of errors, it cannot detect all
errors. For example, it may not be able to detect logic errors or type errors.
Increased Code Size: The addition of keywords and reserved words can increase the size of the code,
making it more difficult to read and understand.
Reduced Flexibility: The use of keywords and reserved words can also reduce the flexibility of a
programming language. It may not be possible to use certain words or phrases in a way that is
intuitive to the programmer.
Syntax analysis: CFGs, Top-down parsing, Brute force approach, recursive descent parsing,
transformation on the grammars, predictive parsing, bottom-up parsing, operator precedence parsing,
LR parsers (SLR, LALR, LR), Parser generation. Syntax directed definitions: Construction of Syntax
trees, Bottom-up evaluation of S-attributed definition, L-attribute definition, Top-down translation,
Bottom-Up evaluation of inherited attributes Recursive Evaluation, Analysis of Syntax directed
definition.
L6
Syntax analysis :
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the syntactical
structure of the given input, i.e. whether the given input is in the correct syntax (of the language in
which the input has been written) or not. It does so by building a data structure, called a Parse tree or
Syntax tree. The parse tree is constructed by using the pre-defined Grammar of the language and the
input string. If the given input string can be produced with the help of the syntax tree (in the
derivation process), the input string is found to be in the correct syntax. if not, the error is reported by
the syntax analyser.
Syntax analysis, also known as parsing, is a process in compiler design where the compiler checks if
the source code follows the grammatical rules of the programming language. This is typically the
second stage of the compilation process, following lexical analysis.
Context Free Grammar is formal grammar, the syntax or structure of a formal language can be
described using context-free grammar (CFG), a type of formal grammar.
The grammar has four tuples: (V,T,P,S).
V - It is the collection of variables or nonterminal symbols.
T - It is a set of terminals.
P - It is the production rules that consist of both terminals and nonterminal.
S - It is the Starting symbol.
A grammar is said to be the Context-free grammar if every production is in the form of :
G -> (V∪T)*, where G ∊ V
● And the left-hand side of the G, here in the example can only be a Variable, it cannot be a
terminal.
● But on the right-hand side here it can be a Variable or Terminal or both combination of
Variable and Terminal.
Above equation states that every production which contains any combination of the ‘V’ variable or
‘T’ terminal is said to be a context-free grammar.
For example the grammar A = { S, a,b, P,S} having production :
L7
Top-down parsing:
● In top-down technique parse tree constructs from top and input will read from left to right.
● In top-down parser, It will start symbol from proceed to string.
Example –
S -> aABe
A -> Abc | b
B -> d
Now, let’s consider the input to read and to construct a parse tree with top-down approach.
Input –
abbcde$
Now, you will see that how top-down approach works. Here, you will see how you can generate a
input string from the grammar for top down approach.
● First, you can start with S -> a A B e and then you will see input string a in the beginning and
e in the end.
● Now, You have string like aAbcde and your input string is abbcde.
● Expand A->b.
Given below is the Diagram explanation for constructing top-down parse tree. You can see clearly in
the diagram how you can generate the input string using grammar with top-down approach.
Brute Force Approach and its pros and cons
some features of the brute force algorithm are:
● It is an intuitive, direct, and straightforward technique of problem-solving in which all the
possible ways or all the possible solutions to a given problem are enumerated.
● Many problems solved in day-to-day life using the brute force strategy, for example exploring
all the paths to a nearby market to find the minimum shortest path.
● Arranging the books in a rack using all the possibilities to optimize the rack spaces, etc.
● In fact, daily life activities use a brute force nature, even though optimal algorithms are also
possible.
● The brute force approach is inefficient. For real-time problems, algorithm analysis often goes
above the O(N!) order of growth.
● This method relies more on compromising the power of a computer system for solving a
problem than on a good algorithm design.
● Brute force algorithms are slow.
● Brute force algorithms are not constructive or creative compared to algorithms that are
constructed using some other design paradigms
Parsing is the process to determine whether the start symbol can derive the program or not. If
the Parsing is successful then the program is a valid program otherwise the program is
invalid.
There are generally two types of Parsers:
1. Top-Down Parsers:
● In this Parsing technique we expand the start symbol to the whole program.
● Recursive Descent and LL parsers are the Top-Down parsers.
2. Bottom-Up Parsers:
● In this Parsing technique we reduce the whole program to start symbol.
● Operator Precedence Parser, LR(0) Parser, SLR Parser, LALR Parser and CLR Parser
are the Bottom-Up parsers.
L8
Recursive Descent Parser:
It is a kind of Top-Down Parser. A top-down parser builds the parse tree from the top to down,
starting with the start non-terminal. A Predictive Parser is a special case of Recursive Descent Parser,
where no Back Tracking is required.
By carefully writing a grammar means eliminating left recursion and left factoring from it, the
resulting grammar will be a grammar that can be parsed by a recursive descent parser.
Example:
E –> T E’
E –> E + T | T E’ –> + T E’ | e
T –> T * F | F T –> F T’
F –> ( E ) | id T’ –> * F T’ | e
F –> ( E ) | id
**Here e is Epsilon
1. Useless productions – The productions that can never take part in derivation of any string ,
are called useless productions. Similarly , a variable that can never take part in derivation of any string
is called a useless variable. For eg.
S -> abS | abA | abB
A -> cd
B -> aB
C -> dc
In the example above , production ‘C -> dc’ is useless because the variable ‘C’ will never occur in
derivation of any string. The other productions are written in such a way that variable ‘C’ can never
reached from the starting variable ‘S’.
Production ‘B ->aB’ is also useless because there is no way it will ever terminate . If it never
terminates , then it can never produce a string. Hence the production can never take part in any
derivation.
To remove useless productions , we first find all the variables which will never lead to a terminal
string such as variable ‘B’. We then remove all the productions in which variable ‘B’ occurs.
So the modified grammar becomes –
S -> Aa | B
A -> b | B
B -> A | a
Lets add all the non-unit productions of ‘G’ in ‘Guf’. ‘Guf’ now becomes –
S -> Aa
A -> b
B -> a
Now we find all the variables that satisfy ‘X *=> Z’. These are ‘S*=>B’, ‘A *=> B’ and ‘B *=> A’.
For ‘A *=> B’ , we add ‘A -> a’ because ‘B ->a’ exists in ‘Guf’. ‘Guf’ now becomes
S -> Aa
A -> b | a
B -> a
For ‘B *=> A’ , we add ‘B -> b’ because ‘A -> b’ exists in ‘Guf’. The new grammar now becomes
S -> Aa
A -> b | a
B -> a | b
We follow the same step for ‘S*=>B’ and finally get the following grammar –
S -> Aa | b | a
A -> b | a
B -> a | b
Now remove B -> a|b , since it doesnt occur in the production ‘S’, then the following grammar
becomes,
S->Aa|b|a
A->b|a
Note: To remove all kinds of productions mentioned above, first remove the null productions, then the
unit productions and finally , remove the useless productions. Following this order is very important
to get the correct result.
L9
Predictive parsing
In this, we will cover the overview of Predictive Parser and mainly focus on the role of Predictive
Parser. And will also cover the algorithm for the implementation of the Predictive parser algorithm
and finally will discuss an example by implementing the algorithm for precedence parsing. Let’s
discuss it one by one.
Predictive Parser :
A predictive parser is a recursive descent parser with no backtracking or backup. It is a top-down
parser that does not require backtracking. At each step, the choice of the rule to be expanded is made
upon the next terminal symbol.
Consider
A -> A1 | A2 | ... | An
If the non-terminal is to be further expanded to ‘A’, the rule is selected based on the current input
symbol ‘a’ only.
Predictive Parser Algorithm:
1. Make a transition diagram (DFA/NFA) for every rule of grammar.
2. Optimize the DFA by reducing the number of states, yielding the final transition diagram.
3. Simulate the string on the transition diagram to parse a string.
4. If the transition diagram reaches an accept state after the input is consumed, it is parsed.
Consider the following grammar –
E->E+T|T
T->T*F|F
F->(E)|id
After removing left recursion, left factoring
E->TT'
T'->+TT'|ε
T->FT''
T''->*FT''|ε
F->(E)|id
STEP 1:
Make a transition diagram (DFA/NFA) for every rule of grammar.
● E->TT’
● T’->+TT’|ε
● T->FT”
● T”->*FT”|ε
● F->(E)|id
STEP 2:
Optimize the DFA by decreases the number of states, yielding the final transition diagram.
● T’->+TT’|ε
Action part of the table contains all the terminals of the grammar whereas the goto part contains all
the nonterminals. For every state of goto graph we write all the goto operations in the table. If goto is
applied to a terminal then it is written in the action part if goto is applied on a nonterminal it is written
in goto part. If on applying goto a production is reduced ( i.e if the dot reaches at the end of
production and no further closure can be applied) then it is denoted as Ri and if the production is not
reduced (shifted) it is denoted as Si. If a production is reduced it is written under the terminals given
by follow of the left side of the production which is reduced for ex: in I5 S->AA is reduced so R1 is
written under the terminals in follow(S)={$}) in LR(0) parser. If in a state the start symbol of
grammar is reduced it is written under $ symbol as accepted.
NOTE: If in any state both reduced and shifted productions are present or two reduced productions
are present it is called a conflict situation and the grammar is not LR grammar.
NOTE:
1. Two reduced productions in one state – RR conflict.
2. One reduced and one shifted production in one state – SR conflict. If no SR or RR conflict present
in the parsing table then the grammar is LR(0) grammar. In above grammar no conflict so it is LR(0)
grammar.
L11
operator precedence parsing
A grammar that is used to define mathematical operators is called an operator grammar or operator
precedence grammar. Such grammars have the restriction that no production has either an empty
right-hand side (null productions) or two adjacent non-terminals in its right-hand side. Examples
– This is an example of operator grammar:
E->E+E/E*E/id
However, the grammar given below is not an operator grammar because two non-terminals are
adjacent to each other:
S->SAS/a
A->bSb/b
We can convert it into an operator grammar, though:
S->SbSbS/SbS/a
A->bSb/b
Operator precedence parser – An operator precedence parser is a bottom-up parser that interprets an
operator grammar. This parser is only used for operator grammars. Ambiguous grammars are not
allowed in any parser except operator precedence parser. There are two methods for determining what
precedence relations should hold between a pair of terminals:
.
Figure – Operator precedence relation table for grammar E->E+E/E*E/id There is not given any
relation between id and id as id will not be compared and two variables can not come side by side.
There is also a disadvantage of this table – if we have n operators then size of table will be n*n and
complexity will be 0(n2). In order to decrease the size of table, we use operator function table.
Operator precedence parsers usually do not store the precedence table with the relations; rather they
are implemented in a special way. Operator precedence parsers use precedence functions that map
terminal symbols to integers, and the precedence relations between the symbols are implemented by
numerical comparison. The parsing table can be encoded by two precedence functions f and g that
map terminal symbols to integers. We select f and g such that:
1. f(a) < g(b) whenever a yields precedence to b
2. f(a) = g(b) whenever a and b have the same precedence
3. f(a) > g(b) whenever a takes precedence over b
Example – Consider the following grammar:
E -> E + E/E * E/( E )/id
This is the directed graph representing the precedence function:
Since there is no cycle in the graph, we can make this function table:
Description of LR parser:
The term parser LR(k) parser, here the L refers to the left-to-right scanning, R refers to the rightmost
derivation in reverse and k refers to the number of unconsumed “look ahead” input symbols that are
used in making parser decisions. Typically, k is 1 and is often omitted. A context-free grammar is
called LR (k) if the LR (k) parser exists for it. This first reduces the sequence of tokens to the left. But
when we read from above, the derivation order first extends to non-terminal.
1. The stack is empty, and we are looking to reduce the rule by S’→S$.
2. Using a “.” in the rule represents how many of the rules are already on the stack.
3. A dotted item, or simply, the item is a production rule with a dot indicating how much RHS
has so far been recognized. Closing an item is used to see what production rules can be used
to expand the current structure. It is calculated as follows:
Rules for LR parser:
The rules of LR parser as follows.
1. The first item from the given grammar rules adds itself as the first closed set.
2. If an object is present in the closure of the form A→ α. β. γ, where the next symbol after the
symbol is non-terminal, add the symbol’s production rules where the dot precedes the first
item.
3. Repeat steps (B) and (C) for new items added under (B).
LR parser algorithm:
LR Parsing algorithm is the same for all the parser, but the parsing table is different for each parser. It
consists following components as follows.
1. Input Buffer –
It contains the given string, and it ends with a $ symbol.
2. Stack –
The combination of state symbol and current input symbol is used to refer to the parsing table
in order to take the parsing decisions.
Parsing Table:
Parsing table is divided into two parts- Action table and Go-To table. The action table gives a
grammar rule to implement the given current state and current terminal in the input stream. There are
four cases used in action table as follows.
1. Shift Action- In shift action the present terminal is removed from the input stream and the
state n is pushed onto the stack, and it becomes the new present state.
2. Reduce Action- The number m is written to the output stream.
3. The symbol m mentioned in the left-hand side of rule m says that state is removed from the
stack.
4. The symbol m mentioned in the left-hand side of rule m says that a new state is looked up in
the goto table and made the new current state by pushing it onto the stack.
An accept - the string is accepted
No action - a syntax error is reported
Note –
The go-to table indicates which state should proceed.
LR parser diagram:
L13
Syntax directed definitions:
Construction of Syntax trees:
Syntax Directed Translation has augmented rules to the grammar that facilitate semantic analysis.
SDT involves passing information bottom-up and/or top-down to the parse tree in form of attributes
attached to the nodes. Syntax-directed translation rules use 1) lexical values of nodes, 2) constants &
3) attributes associated with the non-terminals in their definitions.
The general approach to Syntax-Directed Translation is to construct a parse tree or syntax tree and
compute the values of attributes at the nodes of the tree by visiting them in some order. In many cases,
translation can be done during parsing without building an explicit tree.
Example
E -> E+T | T
T -> T*F | F
F -> INTLIT
This is a grammar to syntactically validate an expression having additions and multiplications in it.
Now, to carry out semantic analysis we will augment SDT rules to this grammar, in order to pass some
information up the parse tree and check for semantic errors, if any. In this example, we will focus on
the evaluation of the given expression, as we don’t have any semantic assertions to check in this very
basic example.
The above diagram shows how semantic analysis could happen. The flow of information happens
bottom-up and all the children’s attributes are computed before parents, as discussed above.
Right-hand side nodes are sometimes annotated with subscript 1 to distinguish between children and
parents.
Additional Information
Synthesized Attributes are such attributes that depend only on the attribute values of children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val corresponding to node E. If
all the semantic attributes in an augmented grammar are synthesized, one depth-first search traversal
in any order is sufficient for the semantic analysis phase.
Inherited Attributes are such attributes that depend on parent and/or sibling’s attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E & Ep are same production
symbols annotated to differentiate between parent and child, has an inherited attribute val
corresponding to node T.
Terminologies:
● Parse Tree: A parse tree is a tree that represents the syntax of the production
hierarchically.
● Annotated Parse Tree: Annotated Parse tree contains the values and attributes at each
node.
● Synthesized Attributes: When the evaluation of any node’s attribute is based on children.
● Inherited Attributes: When the evaluation of any node’s attribute is based on children or
parents.
Dependency Graphs
A dependency graph provides information about the order of evaluation of attributes with the help of
edges. It is used to determine the order of evaluation of attributes according to the semantic rules of
the production. An edge from the first node attribute to the second node attribute gives the
information that first node attribute evaluation is required for the evaluation of the second node
attribute. Edges represent the semantic rules of the corresponding production.
Dependency Graph Rules: A node in the dependency graph corresponds to the node of the parse tree
for each attribute. Edges (first node from the second node) of the dependency graph represent that the
attribute of the first node is evaluated before the attribute of the second node
Production Table
S.No
. Productions Semantic Rules
3. A1 ⇢ B A1.syn = B.syn
Node Attribute
digit.lexva
1
l
digit.lexva
2
l
digit.lexva
3
l
4 B.syn
5 B.syn
6 B.syn
7 A1.syn
8 A.syn
9 A1.inh
10 S.val
Table-2
Edge
Corresponding Semantic Rule
From To (From the production table)
1 4 B.syn = digit.lexval
2 5 B.syn = digit.lexval
3 6 B.syn = digit.lexval
4 7 A1.syn = B.syn
10
6 S.val = A.syn + B.syn
8 9 A1.inh = A.syn
S-Attributed Definitions:
S-attributed SDD can have only synthesized attributes. In this type of definitions semantic rules are
placed at the end of the production only. Its evaluation is based on bottom up parsing.
Example: S ⇢ AB { S.x = f(A.x | B.x) }
L-Attributed Definitions:
L-attributed SDD can have both synthesized and inherited (restricted inherited as attributes can only
be taken from the parent or left siblings). In this type of definition, semantics rules can be placed
anywhere in the RHS of the production. Its evaluation is based on inorder (topological sorting).
Example: S ⇢ AB {A.x = S.x + 2} or S ⇢ AB { B.x = f(A.x | B.x) } or S ⇢ AB { S.x = f(A.x | B.x) }
Note:
● Every S-attributed grammar is also L-attributed.
● For L-attributed evaluation in order of the annotated parse tree is used.
● For S-attributed reverse of the rightmost derivation is used.
Semantic Rules with controlled side-effects:
Side effects are the program fragment contained within semantic rules. These side effects in SDD can
be controlled in two ways: Permit incidental side effects and constraint admissible evaluation orders
to have the same translation as any admissible order.
L15
Analysis of Syntax directed definition
Syntax Directed Definition (SDD) is a kind of abstract specification. It is generalization of context
free grammar in which each grammar production X –> a is associated with it a set of production rules
of the form s = f(b1, b2, ……bk) where s is the attribute obtained from function f. The attribute can be
a string, number, type or a memory location. Semantic rules are fragments of code which are
embedded usually at the end of production and enclosed in curly braces ({ }).
Example:
E --> E1 + T { E.val = E1.val + T.val}
Annotated Parse Tree – The parse tree containing the values of attributes at each node for given input
string is called annotated or decorated parse tree.
Features –
Let us assume an input string 4 * 5 + 6 for computing synthesized attributes. The annotated parse tree
for the input string is
For computation of attributes we start from leftmost bottom node. The rule F –> digit is used to
reduce digit to F and the value of digit is obtained from lexical analyzer which becomes value of F i.e.
from semantic action F.val = digit.lexval. Hence, F.val = 4 and since T is parent node of F so, we get
T.val = 4 from semantic action T.val = F.val. Then, for T –> T1 * F production, the corresponding
semantic action is T.val = T1.val * F.val . Hence, T.val = 4 * 5 = 20
Similarly, combination of E1.val + T.val becomes E.val i.e. E.val = E1.val + T.val = 26. Then, the
production S –> E is applied to reduce E.val = 26 and semantic action associated with it prints the
result E.val . Hence, the output will be 26.
2. Inherited Attributes – These are the attributes which derive their values from their parent or sibling
nodes i.e. value of inherited attributes are computed by value of parent or sibling nodes.
Example:
A --> BCD { C.in = A.in, C.type = B.type }
Computation of Inherited Attributes –
● Construct the SDD using semantic actions.
● The annotated parse tree is generated and attribute values are computed in top down manner.
Example: Consider the following grammar
S --> T L
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The SDD for the above grammar can be written as follow
Let us assume an input string int a, c for computing inherited attributes. The annotated parse tree for
the input string is
The value of L nodes is obtained from T.type (sibling) which is basically lexical value obtained as int,
float or double. Then L node gives type of identifiers a and c. The computation of type is done in top
down manner or preorder traversal. Using function Enter_type the type of identifiers a and c is
inserted in symbol table at corresponding id.entry.
L16
Type checking:
Type checking is the process of verifying and enforcing constraints of types in values. A compiler
must check that the source program should follow the syntactic and semantic conventions of the
source language and it should also check the type rules of the language. It allows the programmer to
limit what types may be used in certain circumstances and assigns types to values. The type-checker
determines whether these values are used appropriately or not.
It checks the type of objects and reports a type error in the case of a violation, and incorrect types are
corrected. Whatever the compiler we use, while it is compiling the program, it has to follow the type
rules of the language. Every language has its own set of type rules for the language. We know that the
information about data types is maintained and computed by the compiler.
The information about data types like INTEGER, FLOAT, CHARACTER, and all the other data types
is maintained and computed by the compiler. The compiler contains modules, where the type checker
is a module of a compiler and its task is type checking.
Conversion
Conversion from one type to another type is known as implicit if it is to be done automatically by the
compiler. Implicit type conversions are also called Coercion and coercion is limited in many
languages.
Example: An integer may be converted to a real but real is not converted to an integer.
Conversion is said to be Explicit if the programmer writes something to do the Conversion.
Tasks:
1. has to allow “Indexing is only on an array”
2. has to check the range of data types used
3. INTEGER (int) has a range of -32,768 to +32767
4. FLOAT has a range of 1.2E-38 to 3.4E+38.
Polymorphic functions
The word “polymorphism” means having many forms. In simple words, we can define polymorphism
as the ability of a message to be displayed in more than one form. A real-life example of
polymorphism is a person who at the same time can have different characteristics. A man at the same
time is a father, a husband, and an employee. So the same person exhibits different behavior in
different situations. This is called polymorphism. Polymorphism is considered one of the important
features of Object-Oriented Programming.
Types of Polymorphism
● Compile-time Polymorphism
● Runtime Polymorphism
Types of Polymorphism
1. Compile-Time Polymorphism
This type of polymorphism is achieved by function overloading or operator overloading.
A. Function Overloading
When there are multiple functions with the same name but different parameters, then the functions are
said to be overloaded, hence this is known as Function Overloading. Functions can be overloaded
by changing the number of arguments or/and changing the type of arguments. In simple terms, it is a
feature of object-oriented programming providing many functions that have the same name but
distinct parameters when numerous tasks are listed under one function name. There are certain Rules
of Function Overloading that should be followed while overloading a function.
2. Runtime Polymorphism
This type of polymorphism is achieved by Function Overriding. Late binding and dynamic
polymorphism are other names for runtime polymorphism. The function call is resolved at runtime
in runtime polymorphism. In contrast, with compile time polymorphism, the compiler determines
which function call to bind to the object after deducing it at runtime.
A. Function Overriding
Function Overriding occurs when a derived class has a definition for one of the member functions of
the base class. That base function is said to be overridden.
Function overriding Explanation
unction Overloading that should be followed while overloading a function.
Virtual Function
A virtual function is a member function that is declared in the base class using the keyword virtual
and is re-defined (Overridden) in the derived class.
Some Key Points About Virtual Functions:
● Virtual functions are Dynamic in nature.
● They are defined by inserting the keyword “virtual” inside a base class and are always
declared with a base class and overridden in a child class
● A virtual function is called during Runtime
L18
Run time Environment:
A translation needs to relate the static source text of a program to the dynamic actions that must occur
at runtime to implement the program. The program consists of names for procedures, identifiers, etc.,
that require mapping with the actual memory location at runtime. Runtime environment is a state of
the target machine, which may include software libraries, environment variables, etc., to provide
services to the processes running in the system.
Storage organization:
Activation Tree
A program consists of procedures, a procedure definition is a declaration that, in its simplest form,
associates an identifier (procedure name) with a statement (body of the procedure). Each execution of
the procedure is referred to as an activation of the procedure. Lifetime of an activation is the sequence
of steps present in the execution of the procedure. If ‘a’ and ‘b’ be two procedures then their
activations will be non-overlapping (when one is called after other) or nested (nested procedures). A
procedure is recursive if a new activation begins before an earlier activation of the same procedure has
ended. An activation tree shows the way control enters and leaves activations. Properties of activation
trees are :-
● Each node represents an activation of a procedure.
● The root shows the activation of the main function.
● The node for procedure ‘x’ is the parent of node for procedure ‘y’ if and only if the control
flows from procedure x to procedure y.
Example – Consider the following program of Quicksort
main() {
Int n;
readarray();
quicksort(1,n);
}
quicksort(int m, int n) {
Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}
The activation tree for this program will be:
First main function as the root then main calls readarray and quicksort. Quicksort in turn calls
partition and quicksort again. The flow of control in a program corresponds to a pre-order depth-first
traversal of the activation tree which starts at the root.
CONTROL STACK AND ACTIVATION RECORDS
Control stack or runtime stack is used to keep track of the live procedure activations i.e the procedures
whose execution have not been completed. A procedure name is pushed on to the stack when it is
called (activation begins) and it is popped when it returns (activation ends). Information needed by a
single execution of a procedure is managed using an activation record or frame. When a procedure is
called, an activation record is pushed into the stack and as soon as the control returns to the caller
function the activation record is popped.
● i.Formal Parameter: Variables that take the information passed by the caller procedure are
called formal parameters. These variables are declared in the definition of the called
function. ii.Actual Parameter: Variables whose values and functions are passed to the called
function are called actual parameters. These variables are specified in the function call as
arguments.
Different ways of passing the parameters to the procedure:
● Call by Value: In call by value the calling procedure passes the r-value of the actual
parameters and the compiler puts that into called procedure’s activation record. Formal
parameters hold the values passed by the calling procedure, thus any changes made in the
formal parameters do not affect the actual parameters.
● Call by ReferenceIn call by reference the formal and actual parameters refers to same memory
location. The l-value of actual parameters is copied to the activation record of the called
function. Thus the called function has the address of the actual parameters. If the actual
parameters does not have a l-value (eg- i+3) then it is evaluated in a new temporary location
and the address of the location is passed. Any changes made in the formal parameter is
reflected in the actual parameters (because changes are made at the address).
● Call by Copy Restore In call by copy restore compiler copies the value in formal parameters
when the procedure is called and copy them back in actual parameters when control returns to
the called function. The r-values are passed and on return r-value of formals are copied into
l-value of actuals.
● Call by Name In call by name the actual parameters are substituted for formals in all the
places formals occur in the procedure. It is also referred as lazy evaluation because evaluation
is done on parameters only when needed.
Advantages:
Portability: A runtime environment can provide a layer of abstraction between the compiled code and
the operating system, making it easier to port the program to different platforms.
Resource management: A runtime environment can manage system resources, such as memory and
CPU time, making it easier to avoid memory leaks and other resource-related issues.
Dynamic memory allocation: A runtime environment can provide dynamic memory allocation,
allowing memory to be allocated and freed as needed during program execution.
Garbage collection: A runtime environment can perform garbage collection, automatically freeing
memory that is no longer being used by the program.
Exception handling: A runtime environment can provide exception handling, allowing the program
to gracefully handle errors and prevent crashes.
Disadvantages:
Performance overhead: A runtime environment can add performance overhead, as it requires
additional processing and memory usage.
Platform dependency: Some runtime environments may be specific to certain platforms, making it
difficult to port programs to other platforms.
Debugging: Debugging can be more difficult in a runtime environment, as the additional layer of
abstraction can make it harder to trace program execution.
Compatibility issues: Some runtime environments may not be compatible with certain operating
systems or hardware architectures, which can limit their usefulness.
Versioning: Different versions of a runtime environment may have different features or APIs, which
can lead to versioning issues when running programs compiled with different versions of the same
runtime environment.
L21
Dynamic storage allocation
Since C is a structured language, it has some fixed rules for programming. One of them includes
changing the size of an array. An array is a collection of items stored at contiguous memory locations.
As can be seen, the length (size) of the array above is 9. But what if there is a requirement to change
this length (size)? For example,
● If there is a situation where only 5 elements are needed to be entered in this array. In this case,
the remaining 4 indices are just wasting memory in this array. So there is a requirement to
lessen the length (size) of the array from 9 to 5.
● Take another situation. In this, there is an array of 9 elements with all 9 indices filled. But
there is a need to enter 3 more elements in this array. In this case, 3 indices more are required.
So the length (size) of the array needs to be changed from 9 to 12.
This procedure is referred to as Dynamic Memory Allocation in C.
Therefore, C Dynamic Memory Allocation can be defined as a procedure in which the size of a data
structure (like Array) is changed during the runtime.
C provides some functions to achieve these tasks. There are 4 library functions provided by C defined
under <stdlib.h> header file to facilitate dynamic memory allocation in C programming. They are:
1. malloc()
2. calloc()
3. free()
4. realloc()
Let’s look at each of them in greater detail.
C malloc() method
The “malloc” or “memory allocation” method in C is used to dynamically allocate a single large
block of memory with the specified size. It returns a pointer of type void which can be cast into a
pointer of any form. It doesn’t Initialize memory at execution time so that it has initialized each block
with the default garbage value initially.
Syntax of malloc() in C
ptr = (cast-type*) malloc(byte-size)
For Example:
C calloc() method
1. “calloc” or “contiguous allocation” method in C is used to dynamically allocate the
specified number of blocks of memory of the specified type. it is very much similar to
malloc() but has two different points and these are:
2. It initializes each block with a default value ‘0’.
3. It has two parameters or arguments as compare to malloc().
Syntax of calloc() in C
ptr = (cast-type*)calloc(n, element-size);
here, n is the no. of elements and element-size is the size of each element.
For Example:
ptr = (float*) calloc(25, sizeof(float));
This statement allocates contiguous space in memory for 25 elements each with the size of the float.
C free() method
“free” method in C is used to dynamically de-allocate the memory. The memory allocated using
functions malloc() and calloc() is not de-allocated on their own. Hence the free() method is used,
whenever the dynamic memory allocation takes place. It helps to reduce wastage of memory by
freeing it.
Syntax of free() in C
free(ptr);
L22
Symbol table:
Symbol Table is an important data structure created and maintained by the compiler in order to keep
track of semantics of variables i.e. it stores information about the scope and binding information about
names, information about instances of various entities such as variable and function names, classes,
objects, etc.
● It is built-in lexical and syntax analysis phases.
● The information is collected by the analysis phases of the compiler and is used by the
synthesis phases of the compiler to generate code.
● It is used by the compiler to achieve compile-time efficiency.
● It is used by various phases of the compiler as follows:-
1. Lexical Analysis: Creates new table entries in the table, for example like entries
about tokens.
2. Syntax Analysis: Adds information regarding attribute type, scope, dimension, line
of reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the table to check for semantics
i.e. to verify that expressions and assignments are semantically correct(type checking)
and update it accordingly.
4. Intermediate Code generation: Refers symbol table for knowing how much and
what type of run-time is allocated and table helps in adding temporary variable
information.
5. Code Optimization: Uses information present in the symbol table for
machine-dependent optimization.
6. Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that support the
compiler in different phases.
Use of Symbol Table-
The symbol tables are typically used in compilers. Basically compiler is a program which scans the
application program (for instance: your C program) and produces machine code.
During this scan compiler stores the identifiers of that application program in the symbol table. These
identifiers are stored in the form of name, value address, type.
Here the name represents the name of identifier, value represents the value stored in an identifier, the
address represents memory location of that identifier and type represents the data type of identifier.
Thus compiler can keep track of all the identifiers with all the necessary information.
Items stored in Symbol table:
● Variable names and constants
● Procedure and function names
● Literal constants and strings
● Compiler generated temporaries
● Labels in source languages
Information used by the compiler from Symbol table:
● Data type and name
● Declaring procedures
● Offset in storage
● If structure or record then, a pointer to structure table.
● For parameters, whether parameter passing by value or by reference
● Number and type of arguments passed to function
● Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
Operations on Symbol Table :
Following operations can be performed on symbol table-
1. Insertion of an item in the symbol table.
2. Deletion of any item from the symbol table.
3. Searching of desired item from symbol table.
Implementation of Symbol table –
Following are commonly used data structures for implementing symbol table:-
1. List –
we use a single array or equivalently several arrays, to store names and their associated
information ,New names are added to the list in the order in which they are encountered . The position
of the end of the array is marked by the pointer available, pointing to where the next symbol-table
entry will go. The search for a name proceeds backwards from the end of the array to the beginning.
when the name is located the associated information can be found in the words following next.
id_ info_
id1 info1 id2 info2 ……..
n n
L24
Ad-Hoc and Systematic Methods.
Error recovery for syntactic phase error:
1. Panic Mode Recovery
● In this method, successive characters from the input are removed one at a time until a
designated set of synchronizing tokens is found. Synchronizing tokens are deli-meters such
as; or }
● The advantage is that it’s easy to implement and guarantees not to go into an infinite loop
● The disadvantage is that a considerable amount of input is skipped without checking it for
additional errors
2. Statement Mode recovery
● In this method, when a parser encounters an error, it performs the necessary correction on the
remaining input so that the rest of the input statement allows the parser to parse ahead.
● The correction can be deletion of extra semicolons, replacing the comma with semicolons, or
inserting a missing semicolon.
● While performing correction, utmost care should be taken for not going in an infinite loop.
● A disadvantage is that it finds it difficult to handle situations where the actual error occurred
before pointing of detection.
3. Error production
● If a user has knowledge of common errors that can be encountered then, these errors can be
incorporated by augmenting the grammar with error productions that generate erroneous
constructs.
● If this is used then, during parsing appropriate error messages can be generated and parsing
can be continued.
● The disadvantage is that it’s difficult to maintain.
4. Global Correction
● The parser examines the whole program and tries to find out the closest match for it which is
error-free.
● The closest match program has less number of insertions, deletions, and changes of tokens to
recover from erroneous input.
● Due to high time and space complexity, this method is not implemented practically.
Semantic errors
These errors are detected during the semantic analysis phase. Typical semantic errors are
● Incompatible type of operands
● Undeclared variables
● Not matching of actual arguments with a formal one
Example : int a[10], b;
.......
.......
a = b;
It generates a semantic error because of an incompatible type of a and b.
Error recovery for Semantic errors
● If the error “Undeclared Identifier” is encountered then, to recover from this a symbol table
entry for the corresponding identifier is made.
● If data types of two operands are incompatible then, automatic type conversion is done by the
compiler.
Advantages:
Improved code quality: Error detection and recovery in a compiler can improve the overall quality of
the code produced. This is because errors can be identified early in the compilation process and
addressed before they become bigger issues.
Increased productivity: Error recovery can also increase productivity by allowing the compiler to
continue processing the code after an error is detected. This means that developers do not have to stop
and fix every error manually, saving time and effort.
Better user experience: Error recovery can also improve the user experience of software
applications. When errors are handled gracefully, users are less likely to become frustrated and are
more likely to continue using the application.
Better debugging: Error recovery in a compiler can help developers to identify and debug errors
more efficiently. By providing detailed error messages, the compiler can assist developers in
pinpointing the source of the error, saving time and effort.
Consistent error handling: Error recovery ensures that all errors are handled in a consistent manner,
which can help to maintain the quality and reliability of the software being developed.
Reduced maintenance costs: By detecting and addressing errors early in the development process,
error recovery can help to reduce maintenance costs associated with fixing errors in later stages of the
software development lifecycle.
Improved software performance: Error recovery can help to identify and address code that may
cause performance issues, such as memory leaks or inefficient algorithms. By improving the
performance of the code, the overall performance of the software can be improved as well.
Disadvantages:
Slower compilation time: Error detection and recovery can slow down the compilation process,
especially if the recovery mechanism is complex. This can be an issue in large software projects
where the compilation time can be a bottleneck.
Increased complexity: Error recovery can also increase the complexity of the compiler, making it
harder to maintain and debug. This can lead to additional development costs and longer development
times.
Risk of silent errors: Error recovery can sometimes mask errors in the code, leading to silent errors
that go unnoticed. This can be particularly problematic if the error affects the behavior of the software
application in subtle ways.
Potential for incorrect recovery: If the error recovery mechanism is not implemented correctly, it
can potentially introduce new errors or cause the code to behave unexpectedly.
Dependency on the recovery mechanism: If developers rely too heavily on the error recovery
mechanism, they may become complacent and not thoroughly check their code for errors. This can
lead to errors being missed or not addressed properly.
Difficulty in diagnosing errors: Error recovery can make it more difficult to diagnose and debug
errors since the error message may not accurately reflect the root cause of the issue. This can make it
harder to fix errors and may lead to longer development times.
Compatibility issues: Error recovery mechanisms may not be compatible with certain programming
languages or platforms, leading to issues with portability and cross-platform development.
Unit –IV Code Generation
Intermediate code generation: Declarations, Assignment statements, Boolean expressions, Case
statements, Back patching, Procedure calls Code Generation: Issues in the design of code generator,
Basic block and flow graphs, Register allocation and assignment, DAG representation of basic blocks,
peephole optimization, generating code from DAG.
L25-27 PPT
L28
Issues in the design of a code generator
Code generator converts the intermediate representation of source code into a form that can be readily
executed by the machine. A code generator is expected to generate the correct code. Designing of the
code generator should be done in such a way that it can be easily implemented, tested, and
maintained.
The following issue arises during the code generation phase:
Input to code generator – The input to the code generator is the intermediate code generated by the
front end, along with information in the symbol table that determines the run-time addresses of the
data objects denoted by the names in the intermediate representation. Intermediate codes may be
represented mostly in quadruples, triples, indirect triples, Postfix notation, syntax trees, DAGs, etc.
The code generation phase just proceeds on an assumption that the input is free from all syntactic and
state semantic errors, the necessary type checking has taken place and the type-conversion operators
have been inserted wherever necessary.
● Target program: The target program is the output of the code generator. The output may be
absolute machine language, relocatable machine language, or assembly language.
● Absolute machine language as output has the advantages that it can be placed in a
fixed memory location and can be immediately executed. For example, WATFIV is a
compiler that produces the absolute machine code as output.
● Relocatable machine language as an output allows subprograms and subroutines to be
compiled separately. Relocatable object modules can be linked together and loaded by
a linking loader. But there is added expense of linking and loading.
● Assembly language as output makes the code generation easier. We can generate
symbolic instructions and use the macro-facilities of assemblers in generating code.
And we need an additional assembly step after code generation.
● Memory Management – Mapping the names in the source program to the addresses of data
objects is done by the front end and the code generator. A name in the three address
statements refers to the symbol table entry for the name. Then from the symbol table entry, a
relative address can be determined for the name.
Instruction selection – Selecting the best instructions will improve the efficiency of the program. It
includes the instructions that should be complete and uniform. Instruction speeds and machine idioms
also play a major role when efficiency is considered. But if we do not care about the efficiency of the
target program then instruction selection is straightforward. For example, the respective three-address
statements would be translated into the latter code sequence as shown below:
P:=Q+R
S:=P+T
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S
Here the fourth statement is redundant as the value of the P is loaded again in that statement that just
has been stored in the previous statement. It leads to an inefficient code sequence. A given
intermediate representation can be translated into many code sequences, with significant cost
differences between the different implementations. Prior knowledge of instruction cost is needed in
order to design good sequences, but accurate cost information is difficult to predict.
L29
Basic block and flow graphs
Basic Block is a straight-line code sequence that has no branches in and out branches except to the
entry and at the end respectively. Basic Block is a set of statements that always executes one after
other, in a sequence.
The first task is to partition a sequence of three-address codes into basic blocks. A new basic block is
begun with the first instruction and instructions are added until a jump or a label is met. In the absence
of a jump, control moves further consecutively from one instruction to another. The idea is
standardized in the algorithm below:
Algorithm: Partitioning three-address code into basic blocks.
Input: A sequence of three address instructions.
Process: Instructions from intermediate code which are leaders are determined. The following are the
rules used for finding a leader:
1. The first three-address instruction of the intermediate code is a leader.
2. Instructions that are targets of unconditional or conditional jump/goto statements are leaders.
3. Instructions that immediately follow unconditional or conditional jump/goto statements are
considered leaders.
Each leader thus determined its basic block contains itself and all instructions up to excluding the next
leader.
Example 1:
The following sequence of three-address statements forms a basic block:
t1 := a*a
t2 := a*b
t3 := 2*t2
t4 := t1+t3
t5 := b*b
t6 := t4 +t5
A three address statement x:= y+z is said to define x and to use y and z. A name in a basic block is
said to be live at a given point if its value is used after that point in the program, perhaps in another
basic block.
Example 2:
Intermediate code to set a 10*10 matrix to an identity matrix:
1) i=1 //Leader 1 (First statement)
2) j=1 //Leader 2 (Target of 11th statement)
3) t1 = 10 * i //Leader 3 (Target of 9th statement)
4) t2 = t1 + j
5) t3 = 8 * t2
6) t4 = t3 - 88
7) a[t4] = 0.0
8) j = j + 1
9) if j <= 10 goto (3)
10) i = i + 1 //Leader 4 (Immediately following Conditional goto statement)
11) if i <= 10 goto (2)
12) i = 1 //Leader 5 (Immediately following Conditional goto statement)
13) t5 = i - 1 //Leader 6 (Target of 17th statement)
14) t6 = 88 * t5
15) a[t6] = 1.0
16) i = i + 1
17) if i <= 10 goto (13)
The given algorithm is used to convert a matrix into identity matrix i.e. a matrix with all diagonal
elements 1 and all other elements as 0.
Steps (3)-(6) are used to make elements 0, step (14) is used to make an element 1. These steps are
used recursively by goto statements.
There are 6 Basic Blocks in the above code :
B1) Statement 1
B2) Statement 2
B3) Statement 3-9
B4) Statement 10-11
B5) Statement 12
B6) Statement 13-17
L30
Register allocation and assignment
Registers are the fastest locations in the memory hierarchy. But unfortunately, this resource is limited.
It comes under the most constrained resources of the target processor. Register allocation is an
NP-complete problem. However, this problem can be reduced to graph coloring to achieve allocation
and assignment. Therefore a good register allocator computes an effective approximate solution to a
hard problem.
Figure – Input-Output
The register allocator determines which values will reside in the register and which register will hold
each of those values. It takes as its input a program with an arbitrary number of registers and produces
a program with a finite register set that can fit into the target machine. (See image)
Allocation vs Assignment:
Allocation –
Maps an unlimited namespace onto that register set of the target machine.
● Reg. to Reg. Model: Maps virtual registers to physical registers but spills excess amount to
memory.
● Mem. to Mem. Model: Maps some subset of the memory location to a set of names that
models the physical register set.
Allocation ensures that code will fit the target machine’s reg. set at each instruction.
Assignment –
Maps an allocated name set to the physical register set of the target machine.
● Assumes allocation has been done so that code will fit into the set of physical registers.
● No more than ‘k’ values are designated into the registers, where ‘k’ is the no. of physical
registers.
General register allocation is an NP-complete problem:
● Solved in polynomial time, when (no. of required registers) <= (no. of available physical
registers).
● An assignment can be produced in linear time using Interval-Graph Coloring.
Local Register Allocation And Assignment:
Allocation just inside a basic block is called Local Reg. Allocation. Two approaches for local reg.
allocation: Top-down approach and bottom-up approach.
Top-Down Approach is a simple approach based on ‘Frequency Count’. Identify the values which
should be kept in registers and which should be kept in memory.
Algorithm:
1. Compute a priority for each virtual register.
2. Sort the registers into priority order.
3. Assign registers in priority order.
4. Rewrite the code.
Moving beyond single Blocks:
● More complicated because the control flow enters the picture.
● Liveness and Live Ranges: Live ranges consist of a set of definitions and uses that are related
to each other as they i.e. no single register can be common in a such couple of
instruction/data.
Following is a way to find out Live ranges in a block. A live range is represented as an interval [i,j],
where i is the definition and j is the last use.
Global Register Allocation and Assignment:
1. The main issue of a register allocator is minimizing the impact of spill code;
● Execution time for spill code.
● Code space for spill operation.
● Data space for spilled values.
2. Global allocation can’t guarantee an optimal solution for the execution time of spill code.
3. Prime differences between Local and Global Allocation:
● The structure of a global live range is naturally more complex than the local one.
● Within a global live range, distinct references may execute a different number of times.
(When basic blocks form a loop)
4. To make the decision about allocation and assignments, the global allocator mostly uses graph
coloring by building an interference graph.
5. Register allocator then attempts to construct a k-coloring for that graph where ‘k’ is the no. of
physical registers.
● In case, the compiler can’t directly construct a k-coloring for that graph, it modifies the
underlying code by spilling some values to memory and tries again.
● Spilling actually simplifies that graph which ensures that the algorithm will halt.
6. Global Allocator uses several approaches, however, we’ll see top-down and bottom-up allocations
strategies. Subproblems associated with the above approaches.
● Discovering Global live ranges.
● Estimating Spilling Costs.
● Building an Interference graph.
L31
DAG representation of basic blocks
The Directed Acyclic Graph (DAG) is used to represent the structure of basic blocks, to visualize the
flow of values between basic blocks, and to provide optimization techniques in the basic block. To
apply an optimization technique to a basic block, a DAG is a three-address code that is generated as
the result of an intermediate code generation.
● Directed acyclic graphs are a type of data structure and they are used to apply transformations
to basic blocks.
● The Directed Acyclic Graph (DAG) facilitates the transformation of basic blocks.
● DAG is an efficient method for identifying common sub-expressions.
● It demonstrates how the statement’s computed value is used in subsequent statements.
Examples of directed acyclic graph :
Expression 2: T1 = T0 + c
Expression 3 : d = T0 + T1
Optimized code:
y = x + 5;
w = y * 3; //* there is no i now
//* We've removed two redundant variables i & z whose value were just being copied from one
another.
B. Constant folding: The code that can be simplified by the user itself, is simplified. Here
simplification to be done at runtime are replaced with simplified code to avoid additional
computation.
Initial code:
x = 2 * 3;
Optimized code:
x = 6;
C. Strength Reduction: The operators that consume higher execution time are replaced by the
operators consuming less execution time.
Initial code:
y = x * 2;
Optimized code:
y = x + x; or y = x << 1;
Initial code:
y = x / 2;
Optimized code:
y = x >> 1;
D. Null sequences/ Simplify Algebraic Expressions : Useless operations are deleted.
a := a + 0;
a := a * 1;
a := a/1;
a := a - 0;
E. Combine operations: Several operations are replaced by a single equivalent operation.
F. Deadcode Elimination:- Dead code refers to portions of the program that are never executed or do
not affect the program’s observable behavior. Eliminating dead code helps improve the efficiency and
performance of the compiled program by reducing unnecessary computations and memory usage.
Initial Code:-
int Dead(void)
{
int a=10;
int z=50;
int c;
c=z*5;
printf(c);
a=20;
a=a*10; //No need of These Two Lines
return 0;
}
Optimized Code:-
int Dead(void)
{
int a=10;
int z=50;
int c;
c=z*5;
printf(c);
return 0;
}
L33
Generating code from DAG.
Language translation: Three address code can also be used to translate code from one programming
language to another. By translating code to a common intermediate representation, it becomes easier
to translate the code to multiple target languages.
General representation –
a = b op c
Where a, b or c represents operands like names, constants or compiler generated temporaries and op
represents the operator
Example-1: Convert the expression a * – (b + c) into three address code.
3. Indirect Triples – This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared to quadruple
representation but requires less space than it. Temporaries are implicit and easier to rearrange code.
Example – Consider expression a = b * – c + b * – c
Question – Write quadruple, triples and indirect triples for following expression : (x + y) * (y + z) + (x
+ y + z)
Explanation – The three address code is:
t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = t1 + z
t5 = t3 + t4
Unit –V Code Optimization
Introduction to Code optimization: sources of optimization of basic blocks,
loops in flow graphs, dead code elimination, loop optimization, Introduction to
global data flow analysis, Code Improving transformations, Data flow analysis
of structure flow graph Symbolic debugging of optimized code.
L34
Introduction to Code optimization:
The code optimization in the synthesis phase is a program transformation technique, which tries to
improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so that
faster-running machine code will result. Compiler optimizing process should meet the following
objectives:
● The optimization must be correct, it must not, in any way, change the meaning of the
program.
● Optimization should increase the speed and performance of the program.
● The compilation time must be kept reasonable.
● The optimization process should not delay the overall compiling process.
When to Optimize?
Optimization of the code is often performed at the end of the development stage since it reduces
readability and adds code that is used to increase the performance.
Why Optimize?
Optimizing an algorithm is beyond the scope of the code optimization phase. So, the program is
optimized. And it may involve reducing the size of the code. So, optimization helps to:
● Reduce the space consumed and increases the speed of compilation.
● Manually analysing datasets involves a lot of time. Hence, we make use of software like
Tableau for data analysis. Similarly, manually performing the optimization is also tedious and
is better done using a code optimizer.
● An optimized code often promotes re-usability.
Types of Code Optimization: The optimization process can be broadly classified into two types:
1. Machine Independent Optimization: This code optimization phase attempts to improve
the intermediate code to get a better target code as the output. The part of the intermediate
code which is transformed here does not involve any CPU registers or absolute memory
locations.
2. Machine Dependent Optimization: Machine-dependent optimization is done after the target
code has been generated and when the code is transformed according to the target machine
architecture. It involves CPU registers and may have absolute memory references rather than
relative references. Machine-dependent optimizers put efforts to take maximum advantage of
the memory hierarchy.
Structure-Preserving Transformations:
The structure-preserving transformation on basic blocks includes:
1. Dead Code Elimination
2. Common Subexpression Elimination
3. Renaming of Temporary variables
4. Interchange of two independent adjacent statements
L 35
Loops in flow graphs:
In the context of code optimization and program analysis, flow graphs are graphical representations
that depict the control flow of a program. Loops in flow graphs are visual representations of loops or
repetitive structures in the source code. Understanding loops in flow graphs is essential for analysing
the program's behaviour, identifying optimization opportunities, and improving the efficiency of the
code.
1. Basic Block:
● A basic block is a sequence of consecutive statements in which flow control enters at
the beginning and leaves at the end without any internal branches except at the end.
Basic blocks are the building blocks of flow graphs.
2. Flow Graph:
● A flow graph represents the control flow of a program using nodes and directed
edges. Nodes typically correspond to basic blocks, and edges represent the flow of
control between these blocks. Flow graphs help visualize the execution path of a
program.
3. Loop:
● A loop is a structure in a program that allows a set of statements to be executed
repeatedly based on a certain condition. In a flow graph, a loop is often represented
by a back edge, which is a directed edge that connects a node inside the loop to a
node outside the loop.
4. Back Edge:
● A back edge is an edge in the flow graph that connects a node to an ancestor node in
the control flow hierarchy. In the context of loops, a back edge connects a node inside
the loop to a node outside the loop, indicating the loop's cyclic nature.
5. Loop Header:
● The loop header is the entry point of a loop. It is the node that is the target of the back
edge. The loop header typically contains the conditional branch that determines
whether the loop should continue or exit.
6. Loop Body:
● The loop body consists of the nodes and edges within the loop. It represents the set of
statements that are executed iteratively as long as the loop condition holds true.
7. Exit Node:
● The exit node is a node within the loop that leads to the loop's exit. It is the point
where the loop terminates, and control flow moves outside the loop.
1.Dead Code Elimination:
Dead code is defined as that part of the code that never executes during the program execution. So, for
optimization, such code or dead code is eliminated. The code which is never executed during the
program (Dead code) takes time so, for optimization and speed, it is eliminated from the code.
Eliminating the dead code increases the speed of the program as the compiler does not have to
translate the dead code.
Example:
// Program with Dead code
int main()
{
x=2
if (x > 2)
cout << "code"; // Dead code
else
cout << "Optimization";
return 0;
}
// Optimized Program without dead code
int main()
{
x = 2;
cout << "Optimization"; // Dead Code Eliminated
return 0;
}
2.Common Subexpression Elimination:
In this technique, the sub-expression which are common are used frequently are calculated only once
and reused when needed. DAG ( Directed Acyclic Graph ) is used to eliminate common
subexpressions.
Example:
3.Renaming of Temporary Variables:
Statements containing instances of a temporary variable can be changed to instances of a new
temporary variable without changing the basic block value.
Example: Statement t = a + b can be changed to x = a + b where t is a temporary variable and x is a
new temporary variable without changing the value of the basic block.
4.Interchange of Two Independent Adjacent Statements:
If a block has two adjacent statements which are independent can be interchanged without affecting
the basic block value.
Example:
t1 = a + b
t2 = c + d
These two independent statements of a block can be interchanged without affecting the value of the
block.
Algebraic Transformation:
Countless algebraic transformations can be used to change the set of expressions computed by a basic
block into an algebraically equivalent set. Some of the algebraic transformation on basic blocks
includes:
1. Constant Folding
2. Copy Propagation
3. Strength Reduction
1. Constant Folding:
Solve the constant terms which are continuous so that compiler does not need to solve this expression.
Example:
x = 2 * 3 + y ⇒ x = 6 + y (Optimized code)
2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)
L36
Loop Optimization:
Loop optimization includes the following strategies:
1. Code motion & Frequency Reduction
2. Induction variable elimination
3. Loop merging/combining
4. Loop Unrolling
1. Code Motion & Frequency Reduction
Move loop invariant code outside of the loop.
// Program with loop variant inside loop
int main()
{
for (i = 0; i < n; i++) {
x = 10;
y = y + i;
}
return 0;
}
// Program with loop variant outside loop
int main()
{
x = 10;
for (i = 0; i < n; i++)
y = y + i;
return 0;
}
2. Induction Variable Elimination:
Eliminate various unnecessary induction variables used in the loop.
// Program with multiple induction variables
int main()
{
i1 = 0;
i2 = 0;
for (i = 0; i < n; i++) {
A[i1++] = B[i2++];
}
return 0;
}
// Program with one induction variable
int main()
{
for (i = 0; i < n; i++) {
A[i] = B[i]; // Only one induction variable
}
return 0;
}
3. Loop Merging/Combining:
If the operations performed can be done in a single loop then, merge or combine the loops.
// Program with multiple loops
int main()
{
for (i = 0; i < n; i++)
A[i] = i + 1;
for (j = 0; j < n; j++)
B[j] = j - 1;
return 0;
}
// Program with one loop when multiple loops are merged
int main()
{
for (i = 0; i < n; i++) {
A[i] = i + 1;
B[i] = i - 1;
}
return 0;
}
4. Loop Unrolling:
If there exists simple code which can reduce the number of times the loop executes then, the loop can
be replaced with these codes.
// Program with loops
int main()
{
for (i = 0; i < 3; i++)
cout << "Cd";
return 0;
}
// Program with simple code without loops
int main()
{
cout << "Cd";
cout << "Cd";
cout << "Cd";
return 0;
}
L37
Introduction to global data flow analysis
Global Data Flow Analysis (also known as Global Flow Analysis or Global Data-Flow Analysis) is a
compiler optimization technique used to analyze the flow of data throughout an entire program. It
provides insights into how values are computed and propagated across different parts of the program,
helping compilers make informed decisions to optimize the code for performance, memory usage, and
other aspects.
Here's an introduction to Global Data Flow Analysis:
1. Objective:
● The primary goal of Global Data Flow Analysis is to gather information about how
data is used and modified throughout the entire program. By understanding how
values flow through variables and expressions, compilers can make optimizations that
improve performance, reduce memory usage, and eliminate redundant computations.
2. Scope:
● Global Data Flow Analysis considers the entire program rather than focusing on
individual functions or basic blocks. It takes into account the relationships and
dependencies between variables and expressions across different parts of the program.
3. Data Flow Graph:
● The analysis often involves constructing a data flow graph that represents the flow of
values between variables and expressions. Nodes in the graph represent program
points, and edges represent the flow of data from one point to another. This graph
provides a visual representation of how data propagates through the program.
4. Reaching Definitions:
● One common aspect of Global Data Flow Analysis is the identification of reaching
definitions. A reaching definition is a point in the program where a variable is
defined, and its value may reach a certain program point during execution. This
information is crucial for understanding how values are computed and used across
different parts of the code.
5. Uses and Optimizations:
● Global Data Flow Analysis is used by compilers to perform various optimizations,
such as:
● Dead Code Elimination: Identifying and removing code that does not
contribute to the final output, improving program efficiency.
● Common Subexpression Elimination: Identifying repeated computations
and replacing them with a single computation, reducing redundant work.
● Constant Propagation: Propagating constant values through the program to
eliminate unnecessary computations and simplify expressions.
● Register Allocation: Optimizing the allocation of registers for variables
based on their usage throughout the program.
6. Iterative Algorithms:
● Global Data Flow Analysis often employs iterative algorithms to refine the
information gathered about data flow. Algorithms such as the worklist algorithm or
iterative data flow analysis are commonly used to iteratively update the data flow
graph until a stable solution is reached.
7. Challenges:
● Analyzing data flow globally can be computationally expensive, especially for large
programs. Balancing the precision of the analysis with the computational cost is a key
challenge in implementing effective Global Data Flow Analysis.
8. Interprocedural Analysis:
● Global Data Flow Analysis can also be extended to analyze data flow across different
functions in the program, known as interprocedural analysis. This provides a more
comprehensive understanding of how data is exchanged between functions.
L38
Code Improving transformations
Code improving transformations, also known as code optimizations, refer to techniques and strategies
used to enhance the performance, maintainability, and efficiency of software programs. These
transformations aim to produce code that executes faster, uses fewer resources, and is easier to
understand. Here are several common code improving transformations:
Code Optimization is done in the following different ways:
1. Compile Time Evaluation:
(i) A = 2*(22.0/7.0)*r
Perform 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluate x/2.3 as 12.4/2.3 at compile time.
2. Variable Propagation:
//Before Optimization
c=a*b
x=a
till
d=x*b+4
//After Optimization
c=a*b
x=a
till
d=a*b+4
3. Constant Propagation:
● If the value of a variable is a constant, then replace the variable with the constant. The
variable may not always be a constant.
Example:
(i) A = 2*(22.0/7.0)*r
Performs 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluates x/2.3 as 12.4/2.3 at compile time.
(iii) int k=2;
if(k) go to L3;
It is evaluated as :
go to L3 ( Because k = 2 which implies condition is always true)
4. Constant Folding:
● Consider an expression : a = b op c and the values b and c are constants, then the value of a
can be computed at compile time.
Example:
#define k 5
x=2*k
y=k+5
This can be computed at compile time and the values of x and y are :
x = 10
y = 10
//After Optimization
c=a*b
x=a
till
d=a*b+4
c=a*b
x=a
till
d=a*b+4
//After elimination :
c=a*b
till
d=a*b+4
● C++
#include <iostream>
using namespace std;
int main() {
int num;
num=10;
cout << "GFG!";
return 0;
cout << num; //unreachable code
}
//after elimination of unreachable code
int main() {
int num;
num=10;
cout << "GFG!";
return 0;
}
9. Function Inlining:
● Here, a function call is replaced by the body of the function itself.
● This saves a lot of time in copying all the parameters, storing the return address, etc.
10. Function Cloning:
● Here, specialized codes for a function are created for different calling parameters.
● Example: Function Overloading
11. Induction Variable and Strength Reduction:
● An induction variable is used in the loop for the following kind of assignment i = i + constant.
It is a kind of Loop Optimization Technique.
● Strength reduction means replacing the high strength operator with a low strength.
Examples:
Example 1 :
Multiplication with powers of 2 can be replaced by shift left operator which is less
expensive than multiplication
a=a*16
// Can be modified as :
a = a<<4
Example 2 :
i = 1;
while (i<10)
{
y = i * 4;
}
//After Reduction
i=1
t=4
{
while( t<40)
y = t;
t = t + 4;
}
a = 200;
while(a>0)
{
b = x + y;
if (a % b == 0)
printf(“%d”, a);
}
2. Loop Jamming:
● Two or more loops are combined in a single loop. It helps in reducing the compile time.
Example:
for(int k=0;k<10;k++)
{
y = k+3;
}
//After loop jamming
for(int k=0;k<10;k++)
{
x = k*2;
y = k+3;
}
3. Loop Unrolling:
● It helps in optimizing the execution time of the program by reducing the iterations.
● It increases the program’s speed by eliminating the loop control and test instructions.
Example:
for(int i=0;i<2;i++)
{
printf("Hello");
}
printf("Hello");
printf("Hello");