0% found this document useful (0 votes)
14 views118 pages

Introduction of Compiler Design

This document provides an overview of compiler design, including the introduction to compilers, their structure, and the various phases involved in the compilation process such as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. It discusses the different types of compilers (single pass, two pass, multi-pass) and the data structures used in compilers like tokens, syntax trees, symbol tables, and parse trees. Additionally, it highlights the importance of the analysis and synthesis phases in ensuring the correctness and efficiency of the generated machine code.

Uploaded by

Sanyam Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views118 pages

Introduction of Compiler Design

This document provides an overview of compiler design, including the introduction to compilers, their structure, and the various phases involved in the compilation process such as lexical analysis, syntax analysis, semantic analysis, intermediate code generation, optimization, and code generation. It discusses the different types of compilers (single pass, two pass, multi-pass) and the data structures used in compilers like tokens, syntax trees, symbol tables, and parse trees. Additionally, it highlights the importance of the analysis and synthesis phases in ensuring the correctness and efficiency of the generated machine code.

Uploaded by

Sanyam Choudhary
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 118

Unit-I Introduction to compiling & Lexical Analysis

Introduction of Compiler, Major data Structure in compiler, types of Compilers, Front-end and
Backend of compiler, Compiler structure: analysis-synthesis model of compilation, various phases of
a compiler, Lexical analysis: Input buffering, Specification & Recognition of Tokens, Design of a
Lexical Analyzer Generator, LEX.
L1
Introduction of Compiler Design

The compiler is software that converts a program written in a high-level language (Source Language)
to a low-level language (Object/Target/Machine Language/0, 1’s).
A translator or language processor is a program that translates an input program written in a
programming language into an equivalent program in another language.
The compiler is a type of translator, which takes a program written in a high-level programming
language as input and translates it into an equivalent program in low-level languages such as machine
language or assembly language.
The program written in a high-level language is known as a source program, and the program
converted into a low-level language is known as an object (or target) program. Without compilation,
no program written in a high-level language can be executed. For every programming language, we
have a different compiler; however, the basic tasks performed by every compiler are the same.
The process of translating the source code into machine code involves several stages, including lexical
analysis, syntax analysis, semantic analysis, code generation, and optimization.
Compiler is an intelligent program as compare to an assembler. Compiler verifies all types of limits,
ranges, errors, etc.
Compiler program takes more time to run and it occupies huge amount of memory space.
The speed of compiler is slower than other system software. It takes time because it enters through the
program and then does translation of the full program.
When compiler runs on same machine and produces machine code for the same machine on which it
is running. Then it is called as self-compiler or resident compiler.
Compiler may run on one machine and produces the machine codes for other computer then in that
case it is called as cross compiler.

Various Data Structures Used in Compiler


The compiler has various data structures that the compiler uses to perform its operations. These data
structures are needed by the phases of the compiler. Now we are going to discuss the various data
structures in the compiler. There are various data structures used in compilers such as: -
1. Tokens
2. Syntax Tree
3. Symbol Table
4. Literal Table
5. Parse Tree
1. Tokens
Typically when a scanner scans the input and gathers a stream of characters into tokens, it
represents the token symbolically it is represented as an enumerated data type representing the
set of tokens in the source language. It is important to keep the characters string and the
information derived from it.

2. Syntax Tree
A syntax tree is a tree data structure in which a node represents an operand and each interior
node represents an operator. It is a dynamically allocated pointer-based tree data structure that
is created as parsing proceeds. If the syntax tree is generated by the parser, then it is in the tree
form.

For ex- Syntax tree for a + b*c.

3. Symbol Table
The symbol table is a data structure that is used to keep the information of identifiers, functions,
variables, constants, and data types. It is created and maintained by the compiler because it keeps the
information about the occurrence of entities. The symbol table is used in almost every phase of the
compiler, we can see that in the below diagram of phases of a compiler. The scanner, parser, and
semantic phase may enter identifiers into the symbol table and the optimization and code generation
phase will access the symbol table to use the information provided by the symbol table to make
appropriate decisions. Given the frequency of access to the symbol table, the insertion, deletion, and
access operations should be well-optimized and efficient. The hash table is mainly used here.
4. Literal Table
A literal table is a data structure that is used to keep track of literal variables in the program. It
holds constant and strings used in the program but it can appear only once in a literal table and its
contents apply to the whole program, which is why deletions are not necessary for it. The literal
table allows the reuse of constants and strings that plays an important role in reducing the
program size.

5. Parse Tree
A parse tree is the hierarchical representation of symbols. The symbols include terminal or
non-terminal. In the parse tree the string is derived from the starting symbol and the starting
symbol is mainly the root of the parse tree. All the leaf nodes are symbols and the inner nodes are
the operators or non-terminals. To get the output we can use Inorder Traversal.
For example: - Parse tree for a+b*c.

And there is intermediate code which also needs data structures to store the data.

6. Intermediate Code
Once the intermediate code is generated, the intermediate code can be stored as a linked list of
structures, a text file, or an array of strings that only depends on the type of intermediate code that
is generated. According to that, we choose the right data structures that will carry optimization.
L2
Types of Compilers
There are mainly three types of compilers.
● Single Pass Compilers
● Two Pass Compilers
● Multi-pass Compilers

Single Pass Compiler


When all the phases of the compiler are present inside a single module, it is simply called a
single-pass compiler. It performs the work of converting source code to machine code.

Two Pass Compiler


Two-pass compiler is a compiler in which the program is translated twice, once from the front end and
the back from the back end known as Two Pass Compiler.

Multi-pass Compiler
When several intermediate codes are created in a program and a syntax tree is processed many times,
it is called Multi pass Compiler. It breaks codes into smaller programs.

phases of a compiler
The process of converting the source code into machine code involves several phases or stages, which
are collectively known as the phases of a compiler. The typical phases of a compiler are:
1. Lexical Analysis: The first phase of a compiler is lexical analysis, also known as scanning.
This phase reads the source code and breaks it into a stream of tokens, which are the basic
units of the programming language. The tokens are then passed on to the next phase for
further processing.
2. Syntax Analysis: The second phase of a compiler is syntax analysis, also known as parsing.
This phase takes the stream of tokens generated by the lexical analysis phase and checks
whether they conform to the grammar of the programming language. The output of this phase
is usually an Abstract Syntax Tree (AST).
3. Semantic Analysis: The third phase of a compiler is semantic analysis. This phase checks
whether the code is semantically correct, i.e., whether it conforms to the language’s type
system and other semantic rules. In this stage, the compiler checks the meaning of the source
code to ensure that it makes sense. The compiler performs type checking, which ensures that
variables are used correctly and that operations are performed on compatible data types. The
compiler also checks for other semantic errors, such as undeclared variables and incorrect
function calls.
4. Intermediate Code Generation: The fourth phase of a compiler is intermediate code
generation. This phase generates an intermediate representation of the source code that can be
easily translated into machine code.
5. Optimization: The fifth phase of a compiler is optimization. This phase applies various
optimization techniques to the intermediate code to improve the performance of the generated
machine code.
6. Code Generation: The final phase of a compiler is code generation. This phase takes the
optimized intermediate code and generates the actual machine code that can be executed by
the target hardware.
Symbol Table – It is a data structure being used and maintained by the compiler, consisting of all the
identifier’s names along with their types. It helps the compiler to function smoothly by finding the
identifiers quickly.
The analysis of a source program is divided into mainly three phases. They are:
Linear Analysis- This involves a scanning phase where the stream of characters is read from
left to right. It is then grouped into various tokens having a collective meaning.
Hierarchical Analysis- In this analysis phase, based on a collective meaning, the tokens are
categorized hierarchically into nested groups.
Semantic Analysis- This phase is used to check whether the components of the source
program are meaningful or not.
L3
Front-end and Backend of compiler
The compiler has two modules namely the front end and the back end. Front-end constitutes the
Lexical analysers, semantic analysers, syntax analysers, and intermediate code generator. And the rest
are assembled to form the back end.

Compiler structure: analysis-synthesis model of compilation


We basically have two phases of compilers, namely the Analysis phase and Synthesis phase. The
analysis phase creates an intermediate representation from the given source code. The synthesis phase
creates an equivalent target program from the intermediate representation.

Analysis phase of a Compiler:


The Analysis phase, also known as the front end of a compiler, is the first step in the compilation
process. It is responsible for breaking down the source code, written in a high-level programming
language, into a more manageable and structured form that can be understood by the compiler. This
phase includes several sub-phases such as lexical analysis, parsing, and semantic analysis.
● The lexical analysis phase is responsible for breaking down the source code into small,
meaningful units called tokens. These tokens are then used by the next phase of the compiler,
parsing, to build a structure representation of the source code, such as an Abstract Syntax Tree
(AST).
● The parsing phase takes the tokens generated by the lexical analysis phase and uses them to
build a hierarchical representation of the source code. This representation is known as the
Abstract Syntax Tree (AST) and it captures the grammatical structure of the source code.
● The semantic analysis phase is responsible for checking the source code for semantic errors,
such as type mismatches and undefined variables. It also attaches meaning to the various
elements of the source code, such as variables and functions.
Issues in the Analysis Phase:
The analysis phase, also known as the parsing phase, is the first phase of a compiler. Its main task is to
analyze the source code of a program and ensure that it is syntactically and semantically correct.
During the analysis phase, the compiler constructs an intermediate representation of the source code,
usually in the form of an abstract syntax tree (AST). This intermediate representation captures the
structure and meaning of the source code and is used by the compiler to generate the target code in the
next phase (the synthesis phase).
There are several issues that can arise during the analysis phase, including:
1. Syntax errors: These are errors in the source code that violate the rules of the programming
language’s syntax. For example, a missing closing brace or an extra comma in an array
declaration could cause a syntax error.
2. Semantic errors: These are errors in the source code that are not necessarily syntax errors,
but still result in the code not behaving as intended. For example, using a variable before it
has been initialized, or calling a function with the wrong number of arguments, could cause a
semantic error.
3. Ambiguities: In some cases, the source code may be ambiguous and could be interpreted in
multiple ways. This can cause the compiler to produce incorrect intermediate code or target
code.
4. Unsupported features: If the source code uses features that are not supported by the
compiler, it may fail to generate the correct intermediate code or target code.

Importance of the Analysis Phase:


The analysis phase of a compiler is an important step in the compilation process as it plays a crucial
role in understanding and verifying the structure and meaning of the source code. Some key benefits
of the analysis phase include:
1. Error detection: The analysis phase detects and reports errors in the source code, such as
syntax errors or semantic errors, helping the developer fix them before the code is compiled.
2. Intermediate representation: The output of the analysis phase is typically an intermediate
representation of the code, such as an abstract syntax tree, that can be used by the next phase
of the compiler. This allows the compiler to work with a simplified and structured
representation of the code, making it easier to generate machine code.
3. Improved code quality: By performing lexical, syntactic, and semantic analysis, the analysis
phase can help ensure that the source code is well-structured, consistent, and semantically
correct, resulting in better-quality code.
4. Language independent: The analysis phase is typically language independent, this means it
can be applied to a wide variety of programming languages, making it a versatile and reusable
component in a compiler.
5. Enabling optimization: The analysis phase provides the compiler with a deeper
understanding of the source code which could lead to better optimization opportunities in the
next phases.
6. Improving the development process: By detecting and reporting errors early in the
development process, the analysis phase can save time and resources by reducing the need for
debugging and testing later on.

Applications of the Analysis Phase:


The analysis phase of a compiler is an important step in the process of converting source code into an
executable form. Some of the key applications of the analysis phase include:
1. Tokenization: During the analysis phase, the compiler breaks the source code down into
smaller units called tokens. These tokens represent the different elements of the code, such as
keywords, identifiers, operators, and punctuation.
2. Parsing: After tokenization, the compiler uses the tokens to build a representation of the code
in a form that can be easily understood and processed. This is done through a process called
parsing, which involves recognizing the patterns and structures that make up the code and
building a tree-like structure called an abstract syntax tree (AST).
3. Semantic analysis: The compiler also performs semantic analysis during the analysis phase
to ensure that the source code is well-formed and follows the rules of the programming
language. This includes checking for syntax errors, type errors, and other issues that could
cause the code to be incorrect or difficult to understand.
4. Intermediate representation: The result of the analysis phase is typically an intermediate
representation of the code, which is a simplified, abstract version of the source code that can
be easily processed by the compiler. This intermediate representation is often used as the
input to the synthesis phase, which generates the target code that can be executed by the target
platform.
L4
Synthesis Phase in Compiler Design:
The synthesis phase, also known as the code generation or code optimization phase, is the final
step of a compiler. It takes the intermediate code generated by the front end of the compiler and
converts it into machine code or assembly code, which can be executed by a computer. The
intermediate code can be in the form of an abstract syntax tree, intermediate representation, or
some other form of representation.
The back end of the compiler, which includes the synthesis phase, is responsible for generating
efficient and fast code by performing various optimization techniques such as register allocation,
instruction scheduling, and memory management. These optimization techniques are intended to
minimize the code size and increase performance by reducing the number of instructions and
cycles required for execution.
The output of the synthesis phase is a binary file that can be loaded into memory and executed by
the CPU. The generated code is platform-specific and depends on the target architecture that the
compiler was designed for. The synthesis phase is crucial for producing efficient and
high-performance code that can run on different platforms.
There are several potential issues that may arise during the synthesis phase of compilation. Some
common issues include:
1. Code generation errors: These are errors that occur when the compiler is unable to generate
machine code or intermediate code that is correct or complete. This may be due to errors in
the AST or problems with the code generator itself.
2. Code size and performance: The compiler may generate machine code that is larger or slower
than desired, which can impact the performance of the resulting program.
3. Compatibility issues: The generated code may not be compatible with the target platform or
with other libraries or frameworks that the program is intended to use.
4. Linking errors: If the generated code references external symbols or functions that are not
defined, the linker may generate errors when trying to combine the generated code with other
object files.
To address these and other issues that may arise during the synthesis phase, compiler designers
and developers must carefully design and test their code generators to ensure that they produce
high-quality machine code or intermediate code.

Applications
1. Generating machine code or executable code for a specific platform: The synthesis phase
takes the intermediate code and generates code that can be run on a specific computer
architecture.
2. Instruction selection: The compiler selects appropriate machine instructions for the target
platform to implement the intermediate code.
3. Register allocation: The compiler assigns values to registers to improve the performance of
the generated code.
4. Memory management: The compiler manages the allocation and deallocation of memory to
ensure the generated code runs efficiently.
5. Optimization: The compiler performs various optimization techniques such as dead code
elimination, constant folding, and common subexpression elimination to improve the
performance of the generated code.
6. Creating executable files: The final output of the synthesis phase is typically a file
containing machine code or assembly code that can be directly executed by the computer’s
CPU.

L5
Lexical analysis
Lexical Analysis is the first phase of the compiler also known as a scanner. It converts the
High-level input program into a sequence of Tokens.
● Lexical Analysis can be implemented with the Deterministic finite Automata.
● The output is a sequence of tokens that is sent to the parser for syntax analysis

What is a token?
A lexical token is a sequence of characters that can be treated as a unit in the grammar of the
programming languages. Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens (IF, void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc

Example of non-tokens:
Comments, preprocessor directive, macros, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the corresponding token or a
sequence of input characters that comprises a single token is called a lexeme. eg- “float”,
“abs_zero_Kelvin”, “=”, “-”, “273”, “;” .

How Lexical Analyzer works-


1. Input preprocessing: This stage involves cleaning up the input text and preparing it for
lexical analysis. This may include removing comments, whitespace, and other non-essential
characters from the input text.
2. Tokenization: This is the process of breaking the input text into a sequence of tokens. This is
usually done by matching the characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type of each token. For example,
in a programming language, the lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token is valid according to the
rules of the programming language. For example, it might check that a variable name is a
valid identifier, or that an operator has the correct syntax.
5. Output generation: In this final stage, the lexer generates the output of the lexical analysis
process, which is typically a list of tokens. This list of tokens can then be passed to the next
stage of compilation or interpretation.
● The lexical analyzer identifies the error with the help of the automation machine and the
grammar of the given language on which it is based like C, C++, and gives row number and
column number of the error.
Suppose we pass a statement through lexical analyzer – a = b + c;
It will generate token sequence like this: id=id+id;
Where each id refers to it’s variable in the symbol table referencing all details For example,
consider the program
int main ()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:

'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens. You can observe that we have omitted comments.

consider below printf statement


There are 5 valid tokens in this printf statement.
Exercise 1: Count number of tokens:
int main ()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2: Count number of tokens: int max(int i);
● Lexical analyzer first read int and finds it to be valid and accepts as token.
● max is read by it and found to be a valid function name after reading (
● int is also a token, then again I as another token and finally ;
● Answer: Total number of tokens 7:
● int, max, ( ,int, i, ), ;
● We can represent in the form of lexemes and tokens as under

Lexemes Tokens Lexemes Tokens

while WHILE a IDENTIEFIER

( LAPREN = ASSIGNMENT

a IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

b IDENTIFIER 2 INTEGER

) RPAREN ; SEMICOLON

Advantages:
Efficiency: Lexical analysis improves the efficiency of the parsing process because it breaks down the
input into smaller, more manageable chunks. This allows the parser to focus on the structure of the
code, rather than the individual characters.
Flexibility: Lexical analysis allows for the use of keywords and reserved words in programming
languages. This makes it easier to create new programming languages and to modify existing ones.
Error Detection: The lexical analyzer can detect errors such as misspelled words, missing
semicolons, and undefined variables. This can save a lot of time in the debugging process.
Code Optimization: Lexical analysis can help optimize code by identifying common patterns and
replacing them with more efficient code. This can improve the performance of the program.
Disadvantages:
Complexity: Lexical analysis can be complex and require a lot of computational power. This can
make it difficult to implement in some programming languages.
Limited Error Detection: While lexical analysis can detect certain types of errors, it cannot detect all
errors. For example, it may not be able to detect logic errors or type errors.
Increased Code Size: The addition of keywords and reserved words can increase the size of the code,
making it more difficult to read and understand.
Reduced Flexibility: The use of keywords and reserved words can also reduce the flexibility of a
programming language. It may not be possible to use certain words or phrases in a way that is
intuitive to the programmer.

Unit-II Syntax Analysis &Syntax Directed Translation

Syntax analysis: CFGs, Top-down parsing, Brute force approach, recursive descent parsing,
transformation on the grammars, predictive parsing, bottom-up parsing, operator precedence parsing,
LR parsers (SLR, LALR, LR), Parser generation. Syntax directed definitions: Construction of Syntax
trees, Bottom-up evaluation of S-attributed definition, L-attribute definition, Top-down translation,
Bottom-Up evaluation of inherited attributes Recursive Evaluation, Analysis of Syntax directed
definition.

L6
Syntax analysis :
Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the syntactical
structure of the given input, i.e. whether the given input is in the correct syntax (of the language in
which the input has been written) or not. It does so by building a data structure, called a Parse tree or
Syntax tree. The parse tree is constructed by using the pre-defined Grammar of the language and the
input string. If the given input string can be produced with the help of the syntax tree (in the
derivation process), the input string is found to be in the correct syntax. if not, the error is reported by
the syntax analyser.
Syntax analysis, also known as parsing, is a process in compiler design where the compiler checks if
the source code follows the grammatical rules of the programming language. This is typically the
second stage of the compilation process, following lexical analysis.

Context-Free Grammar (CFGs)

Context Free Grammar is formal grammar, the syntax or structure of a formal language can be
described using context-free grammar (CFG), a type of formal grammar.
The grammar has four tuples: (V,T,P,S).
V - It is the collection of variables or nonterminal symbols.
T - It is a set of terminals.
P - It is the production rules that consist of both terminals and nonterminal.
S - It is the Starting symbol.
A grammar is said to be the Context-free grammar if every production is in the form of :
G -> (V∪T)*, where G ∊ V
● And the left-hand side of the G, here in the example can only be a Variable, it cannot be a
terminal.
● But on the right-hand side here it can be a Variable or Terminal or both combination of
Variable and Terminal.
Above equation states that every production which contains any combination of the ‘V’ variable or
‘T’ terminal is said to be a context-free grammar.
For example the grammar A = { S, a,b, P,S} having production :

● Here S is the starting symbol.


● {a,b} are the terminals generally represented by small characters.
● P is variable along with S.
S-> aS
S-> bSa
but
a->bSa, or
a->ba is not a CFG as on the left-hand side there is a variable which does not follow the CFGs rule.
In the computer science field, context-free grammars are frequently used, especially in the areas of
formal language theory, compiler development, and natural language processing. It is also used for
explaining the syntax of programming languages and other formal languages.

Limitations of Context-Free Grammar


Apart from all the uses and importance of Context-Free Grammar in the Compiler design and the
computer science field, there are some limitations that are addressed, that is CFGs are less expressive,
and neither English nor programming language can be expressed using Context-Free Grammar.
Context-Free Grammar can be ambiguous means we can generate multiple parse trees of the same
input. For some grammar, Context-Free Grammar can be less efficient because of the exponential time
complexity. And the less precise error reporting as CFGs error reporting system is not that precise that
can give more detailed error messages and information.

L7
Top-down parsing:
● In top-down technique parse tree constructs from top and input will read from left to right.
● In top-down parser, It will start symbol from proceed to string.

● It follows left most derivation.


● In top-down parser, difficulty with top-down parser is if variable contain more than one
possibility selecting 1 is difficult.

Working of Top-Down Parser:


Let’s consider an example where grammar is given and you need to construct a parse tree by using
top-down parser technique.

Example –

S -> aABe
A -> Abc | b
B -> d
Now, let’s consider the input to read and to construct a parse tree with top-down approach.

Input –

abbcde$
Now, you will see that how top-down approach works. Here, you will see how you can generate a
input string from the grammar for top down approach.

● First, you can start with S -> a A B e and then you will see input string a in the beginning and
e in the end.

● Now, you need to generate abbcde .

● Expand A-> Abc and Expand B-> d.

● Now, You have string like aAbcde and your input string is abbcde.

● Expand A->b.

● Final string, you will get abbcde.

Given below is the Diagram explanation for constructing top-down parse tree. You can see clearly in
the diagram how you can generate the input string using grammar with top-down approach.
Brute Force Approach and its pros and cons
some features of the brute force algorithm are:
● It is an intuitive, direct, and straightforward technique of problem-solving in which all the
possible ways or all the possible solutions to a given problem are enumerated.
● Many problems solved in day-to-day life using the brute force strategy, for example exploring
all the paths to a nearby market to find the minimum shortest path.
● Arranging the books in a rack using all the possibilities to optimize the rack spaces, etc.

● In fact, daily life activities use a brute force nature, even though optimal algorithms are also
possible.

PROS AND CONS OF BRUTE FORCE ALGORITHM:


Pros:
● The brute force approach is a guaranteed way to find the correct solution by listing all the
possible candidate solutions for the problem.
● It is a generic method and not limited to any specific domain of problems.
● The brute force method is ideal for solving small and simpler problems.
● It is known for its simplicity and can serve as a comparison benchmark.
Cons:

● The brute force approach is inefficient. For real-time problems, algorithm analysis often goes
above the O(N!) order of growth.
● This method relies more on compromising the power of a computer system for solving a
problem than on a good algorithm design.
● Brute force algorithms are slow.
● Brute force algorithms are not constructive or creative compared to algorithms that are
constructed using some other design paradigms

Parsing is the process to determine whether the start symbol can derive the program or not. If
the Parsing is successful then the program is a valid program otherwise the program is
invalid.
There are generally two types of Parsers:
1. Top-Down Parsers:
● In this Parsing technique we expand the start symbol to the whole program.
● Recursive Descent and LL parsers are the Top-Down parsers.
2. Bottom-Up Parsers:
● In this Parsing technique we reduce the whole program to start symbol.
● Operator Precedence Parser, LR(0) Parser, SLR Parser, LALR Parser and CLR Parser
are the Bottom-Up parsers.

L8
Recursive Descent Parser:
It is a kind of Top-Down Parser. A top-down parser builds the parse tree from the top to down,
starting with the start non-terminal. A Predictive Parser is a special case of Recursive Descent Parser,
where no Back Tracking is required.
By carefully writing a grammar means eliminating left recursion and left factoring from it, the
resulting grammar will be a grammar that can be parsed by a recursive descent parser.
Example:

Before removing left After removing left


recursion recursion

E –> T E’
E –> E + T | T E’ –> + T E’ | e
T –> T * F | F T –> F T’
F –> ( E ) | id T’ –> * F T’ | e
F –> ( E ) | id

**Here e is Epsilon

Transformation on the grammars


The definition of context free grammars (CFGs) allows us to develop a wide variety of grammars.
Most of the time, some of the productions of CFGs are not useful and are redundant. This happens
because the definition of CFGs does not restrict us from making these redundant productions.
By simplifying CFGs we remove all these redundant productions from a grammar , while keeping the
transformed grammar equivalent to the original grammar. Two grammars are called equivalent if they
produce the same language. Simplifying CFGs is necessary to later convert them into Normal forms.
Types of redundant productions and the procedure of removing them are mentioned below.

1. Useless productions – The productions that can never take part in derivation of any string ,
are called useless productions. Similarly , a variable that can never take part in derivation of any string
is called a useless variable. For eg.
S -> abS | abA | abB
A -> cd
B -> aB
C -> dc
In the example above , production ‘C -> dc’ is useless because the variable ‘C’ will never occur in
derivation of any string. The other productions are written in such a way that variable ‘C’ can never
reached from the starting variable ‘S’.
Production ‘B ->aB’ is also useless because there is no way it will ever terminate . If it never
terminates , then it can never produce a string. Hence the production can never take part in any
derivation.
To remove useless productions , we first find all the variables which will never lead to a terminal
string such as variable ‘B’. We then remove all the productions in which variable ‘B’ occurs.
So the modified grammar becomes –

S -> abS | abA


A -> cd
C -> dc
We then try to identify all the variables that can never be reached from the starting variable such as
variable ‘C’. We then remove all the productions in which variable ‘C’ occurs.
The grammar below is now free of useless productions –

S -> abS | abA


A -> cd
2. ? productions – The productions of type ‘A -> ?’ are called ? productions ( also called lambda
productions and null productions) . These productions can only be removed from those grammars that
do not generate ? (an empty string). It is possible for a grammar to contain null productions and yet
not produce an empty string.
To remove null productions , we first have to find all the nullable variables. A variable ‘A’ is called
nullable if ? can be derived from ‘A’. For all the productions of type ‘A -> ?’ , ‘A’ is a nullable
variable. For all the productions of type ‘B -> A1A2…An ‘ , where all ’Ai’s are nullable variables ,
‘B’ is also a nullable variable.
After finding all the nullable variables, we can now start to construct the null production free
grammar. For all the productions in the original grammar , we add the original production as well as
all the combinations of the production that can be formed by replacing the nullable variables in the
production by ?. If all the variables on the RHS of the production are nullable , then we do not add ‘A
-> ?’ to the new grammar. An example will make the point clear. Consider the grammar –

S -> ABCd (1)


A -> BC (2)
B -> bB | ? (3)
C -> cC | ? (4)
Lets first find all the nullable variables. Variables ‘B’ and ‘C’ are clearly nullable because they
contain ‘?’ on the RHS of their production. Variable ‘A’ is also nullable because in (2) , both variables
on the RHS are also nullable. So variables ‘A’ , ‘B’ and ‘C’ are nullable variables.
Lets create the new grammar. We start with the first production. Add the first production as it is. Then
we create all the possible combinations that can be formed by replacing the nullable variables with ?.
Therefore line (1) now becomes ‘S -> ABCd | ABd | ACd | BCd | Ad | Bd |Cd | d ’.We apply the same
rule to line (2) but we do not add ‘A -> ?’ even though it is a possible combination. We remove all the
productions of type ‘V -> ?’. The new grammar now becomes –

S -> ABCd | ABd | ACd | BCd | Ad | Bd |Cd | d


A -> BC | B | C
B -> bB | b
C -> cC | c
3. Unit productions – The productions of type ‘A -> B’ are called unit productions.
To create a unit production free grammar ‘Guf’ from the original grammar ‘G’ , we follow the
procedure mentioned below.
First add all the non-unit productions of ‘G’ in ‘Guf’. Then for each variable ‘A’ in grammar ‘G’ ,
find all the variables ‘B’ such that ‘A *=> B’. Now , for all variables like ‘A ’ and ‘B’, add ‘A -> x1 |
x2 | …xn’ to ‘Guf’ where ‘B -> x1 | x2 | …xn ‘ is in ‘Guf’ . None of the x1 , x2 … xn are single
variables because we only added non-unit productions in ‘Guf’. Hence the resultant grammar is unit
production free. For eg.

S -> Aa | B
A -> b | B
B -> A | a
Lets add all the non-unit productions of ‘G’ in ‘Guf’. ‘Guf’ now becomes –

S -> Aa
A -> b
B -> a
Now we find all the variables that satisfy ‘X *=> Z’. These are ‘S*=>B’, ‘A *=> B’ and ‘B *=> A’.
For ‘A *=> B’ , we add ‘A -> a’ because ‘B ->a’ exists in ‘Guf’. ‘Guf’ now becomes

S -> Aa
A -> b | a
B -> a
For ‘B *=> A’ , we add ‘B -> b’ because ‘A -> b’ exists in ‘Guf’. The new grammar now becomes

S -> Aa
A -> b | a
B -> a | b
We follow the same step for ‘S*=>B’ and finally get the following grammar –

S -> Aa | b | a
A -> b | a
B -> a | b
Now remove B -> a|b , since it doesnt occur in the production ‘S’, then the following grammar
becomes,
S->Aa|b|a
A->b|a
Note: To remove all kinds of productions mentioned above, first remove the null productions, then the
unit productions and finally , remove the useless productions. Following this order is very important
to get the correct result.
L9
Predictive parsing
In this, we will cover the overview of Predictive Parser and mainly focus on the role of Predictive
Parser. And will also cover the algorithm for the implementation of the Predictive parser algorithm
and finally will discuss an example by implementing the algorithm for precedence parsing. Let’s
discuss it one by one.

Predictive Parser :
A predictive parser is a recursive descent parser with no backtracking or backup. It is a top-down
parser that does not require backtracking. At each step, the choice of the rule to be expanded is made
upon the next terminal symbol.
Consider
A -> A1 | A2 | ... | An
If the non-terminal is to be further expanded to ‘A’, the rule is selected based on the current input
symbol ‘a’ only.
Predictive Parser Algorithm:
1. Make a transition diagram (DFA/NFA) for every rule of grammar.
2. Optimize the DFA by reducing the number of states, yielding the final transition diagram.
3. Simulate the string on the transition diagram to parse a string.
4. If the transition diagram reaches an accept state after the input is consumed, it is parsed.
Consider the following grammar –
E->E+T|T
T->T*F|F
F->(E)|id
After removing left recursion, left factoring
E->TT'
T'->+TT'|ε
T->FT''
T''->*FT''|ε
F->(E)|id
STEP 1:
Make a transition diagram (DFA/NFA) for every rule of grammar.
● E->TT’

● T’->+TT’|ε

● T->FT”

● T”->*FT”|ε

● F->(E)|id
STEP 2:
Optimize the DFA by decreases the number of states, yielding the final transition diagram.
● T’->+TT’|ε

It can be optimized ahead by combining it with DFA for E->TT’

Accordingly, we optimize the other structures to produce the following DFA


STEP 3:
Simulation on the input string.
Steps involved in the simulation procedure are:
1. Start from the starting state.
2. If a terminal arrives consume it, move to the next state.
3. If a non-terminal arrives go to the state of the DFA of the non-terminal and return on reached
up to the final state.
4. Return to actual DFA and Keep doing parsing.
5. If one completes reading the input string completely, you reach a final state, and the string is
successfully parsed.
L10
Bottom-up parsing or Shift Reduce Parsers
Bottom-up Parsers / Shift Reduce Parsers Build the parse tree from leaves to root. Bottom-up parsing
can be defined as an attempt to reduce the input string w to the start symbol of grammar by tracing out
the rightmost derivations of w in reverse. Eg.

Classification of Bottom-up Parsers:


A general shift reduce parsing is LR parsing. The L stands for scanning the input from left to right and
R stands for constructing a rightmost derivation in reverse.
Benefits of LR parsing:
1. Many programming languages using some variations of an LR parser. It should be noted that
C++ and Perl are exceptions to it.
2. LR Parser can be implemented very efficiently.
3. Of all the Parsers that scan their symbols from left to right, LR Parsers detect syntactic errors,
as soon as possible.
In this discussion, we will explore the construction of the GOTO graph for a grammar using all four
LR parsing techniques. The GOTO graph is particularly useful in solving questions in the GATE exam
as it allows for a more efficient analysis of the given grammar.
To construct the GOTO graph using LR(0) parsing, we rely on two essential
functions: Closure() and Goto().
Firstly, we introduce the concept of an augmented grammar. In the augmented grammar, a new start
symbol, S’, is added, along with a production S’ -> S. This addition helps the parser determine when
to stop parsing and signal the acceptance of input. For example, if we have a grammar S -> AA and A
-> aA | b, the augmented grammar will be S’ -> S and S -> AA.
Next, we define LR(0) items. An LR(0) item of a grammar G is a production of G with a dot (.)
positioned at some point on the right-hand side. For instance, given the production S -> ABC, we
obtain four LR(0) items: S -> .ABC, S -> A.BC, S -> AB.C, and S -> ABC. It is worth noting that the
production A -> ? generates only one item: A -> .?.
By utilizing the Closure() function, we can calculate the closure of a set of LR(0) items. The closure
operation involves expanding the items by considering the productions that have the dot right before
the non-terminal symbol. This step helps us identify all the possible items that can be derived from the
current set.
The Goto() function is employed to construct the transitions between LR(0) items in the GOTO graph.
It determines the next set of items by shifting the dot one position to the right. This process allows us
to navigate through the graph and track the parsing progress.
Augmented Grammar: If G is a grammar with start symbol S then G’, the augmented grammar for
G, is the grammar with new start symbol S’ and a production S’ -> S. The purpose of this new starting
production is to indicate to the parser when it should stop parsing and announce acceptance of input.
Let a grammar be S -> AA A -> aA | b, The augmented grammar for the above grammar will be S’ ->
S S -> AA A -> aA | b.
LR(0) Items: An LR(0) is the item of a grammar G is a production of G with a dot at some position in
the right side. S -> ABC yields four items S -> .ABC S -> A.BC S -> AB.C S -> ABC. The
production A -> ? generates only one item A -> .?
Closure Operation: If I is a set of items for a grammar G, then closure(I) is the set of items
constructed from I by the two rules:
1. Initially every item in I is added to closure(I).
2. If A -> ?.B? is in closure(I) and B -> ? is a production then add the item B -> .? to I, If it is not
already there. We apply this rule until no more items can be added to closure(I).
Eg:

Goto Operation : Goto(I, X) =


1. Add I by moving dot after X.
2. Apply closure to first step.
\

Construction of GOTO graph-


● State I0 – closure of augmented LR(0) item
● Using I0 find all collection of sets of LR(0) items with the help of DFA
● Convert DFA to LR(0) parsing table
Construction of LR(0) parsing table:
● The action function takes as arguments a state i and a terminal a (or $ , the input end marker).
The value of ACTION[i, a] can have one of four forms:
1. Shift j, where j is a state.
2. Reduce A -> ?.
3. Accept
4. Error
● We extend the GOTO function, defined on sets of items, to states: if GOTO[Ii , A] = Ij then
GOTO also maps a state i and a nonterminal A to state j.
Eg: Consider the grammar S ->AA A -> aA | b Augmented grammar S’ -> S S -> AA A -> aA | b The
LR(0) parsing table for above GOTO graph will be –

Action part of the table contains all the terminals of the grammar whereas the goto part contains all
the nonterminals. For every state of goto graph we write all the goto operations in the table. If goto is
applied to a terminal then it is written in the action part if goto is applied on a nonterminal it is written
in goto part. If on applying goto a production is reduced ( i.e if the dot reaches at the end of
production and no further closure can be applied) then it is denoted as Ri and if the production is not
reduced (shifted) it is denoted as Si. If a production is reduced it is written under the terminals given
by follow of the left side of the production which is reduced for ex: in I5 S->AA is reduced so R1 is
written under the terminals in follow(S)={$}) in LR(0) parser. If in a state the start symbol of
grammar is reduced it is written under $ symbol as accepted.

NOTE: If in any state both reduced and shifted productions are present or two reduced productions
are present it is called a conflict situation and the grammar is not LR grammar.
NOTE:
1. Two reduced productions in one state – RR conflict.
2. One reduced and one shifted production in one state – SR conflict. If no SR or RR conflict present
in the parsing table then the grammar is LR(0) grammar. In above grammar no conflict so it is LR(0)
grammar.
L11
operator precedence parsing
A grammar that is used to define mathematical operators is called an operator grammar or operator
precedence grammar. Such grammars have the restriction that no production has either an empty
right-hand side (null productions) or two adjacent non-terminals in its right-hand side. Examples
– This is an example of operator grammar:
E->E+E/E*E/id
However, the grammar given below is not an operator grammar because two non-terminals are
adjacent to each other:
S->SAS/a
A->bSb/b
We can convert it into an operator grammar, though:
S->SbSbS/SbS/a
A->bSb/b
Operator precedence parser – An operator precedence parser is a bottom-up parser that interprets an
operator grammar. This parser is only used for operator grammars. Ambiguous grammars are not
allowed in any parser except operator precedence parser. There are two methods for determining what
precedence relations should hold between a pair of terminals:

1. Use the conventional associativity and precedence of operator.


2. The second method of selecting operator-precedence relations is first to construct an
unambiguous grammar for the language, a grammar that reflects the correct associativity and
precedence in its parse trees.
This parser relies on the following three precedence relations: ⋖, ≐, ⋗ a ⋖ b This means a “yields
precedence to” b. a ⋗ b This means a “takes precedence over” b. a ≐ b This means a “has same
precedence as” b

.
Figure – Operator precedence relation table for grammar E->E+E/E*E/id There is not given any
relation between id and id as id will not be compared and two variables can not come side by side.
There is also a disadvantage of this table – if we have n operators then size of table will be n*n and
complexity will be 0(n2). In order to decrease the size of table, we use operator function table.
Operator precedence parsers usually do not store the precedence table with the relations; rather they
are implemented in a special way. Operator precedence parsers use precedence functions that map
terminal symbols to integers, and the precedence relations between the symbols are implemented by
numerical comparison. The parsing table can be encoded by two precedence functions f and g that
map terminal symbols to integers. We select f and g such that:
1. f(a) < g(b) whenever a yields precedence to b
2. f(a) = g(b) whenever a and b have the same precedence
3. f(a) > g(b) whenever a takes precedence over b
Example – Consider the following grammar:
E -> E + E/E * E/( E )/id
This is the directed graph representing the precedence function:

Since there is no cycle in the graph, we can make this function table:

fid -> g* -> f+ ->g+ -> f$


gid -> f* -> g* ->f+ -> g+ ->f$
Size of the table is 2n. One disadvantage of function tables is that even though we have blank entries
in relation table we have non-blank entries in function table. Blank entries are also called error. Hence
error detection capability of relation table is greater than function table.
L12
LR parsers
LR parser is a bottom-up parser for context-free grammar that is very generally used by computer
programming language compiler and other associated tools. LR parser reads their input from left to
right and produces a right-most derivation. It is called a Bottom-up parser because it attempts to
reduce the top-level grammar productions by building up from the leaves. LR parsers are the most
powerful parser of all deterministic parsers in practice.

Description of LR parser:
The term parser LR(k) parser, here the L refers to the left-to-right scanning, R refers to the rightmost
derivation in reverse and k refers to the number of unconsumed “look ahead” input symbols that are
used in making parser decisions. Typically, k is 1 and is often omitted. A context-free grammar is
called LR (k) if the LR (k) parser exists for it. This first reduces the sequence of tokens to the left. But
when we read from above, the derivation order first extends to non-terminal.
1. The stack is empty, and we are looking to reduce the rule by S’→S$.
2. Using a “.” in the rule represents how many of the rules are already on the stack.
3. A dotted item, or simply, the item is a production rule with a dot indicating how much RHS
has so far been recognized. Closing an item is used to see what production rules can be used
to expand the current structure. It is calculated as follows:
Rules for LR parser:
The rules of LR parser as follows.
1. The first item from the given grammar rules adds itself as the first closed set.
2. If an object is present in the closure of the form A→ α. β. γ, where the next symbol after the
symbol is non-terminal, add the symbol’s production rules where the dot precedes the first
item.
3. Repeat steps (B) and (C) for new items added under (B).
LR parser algorithm:
LR Parsing algorithm is the same for all the parser, but the parsing table is different for each parser. It
consists following components as follows.
1. Input Buffer –
It contains the given string, and it ends with a $ symbol.

2. Stack –
The combination of state symbol and current input symbol is used to refer to the parsing table
in order to take the parsing decisions.
Parsing Table:
Parsing table is divided into two parts- Action table and Go-To table. The action table gives a
grammar rule to implement the given current state and current terminal in the input stream. There are
four cases used in action table as follows.
1. Shift Action- In shift action the present terminal is removed from the input stream and the
state n is pushed onto the stack, and it becomes the new present state.
2. Reduce Action- The number m is written to the output stream.
3. The symbol m mentioned in the left-hand side of rule m says that state is removed from the
stack.
4. The symbol m mentioned in the left-hand side of rule m says that a new state is looked up in
the goto table and made the new current state by pushing it onto the stack.
An accept - the string is accepted
No action - a syntax error is reported
Note –
The go-to table indicates which state should proceed.
LR parser diagram:
L13
Syntax directed definitions:
Construction of Syntax trees:
Syntax Directed Translation has augmented rules to the grammar that facilitate semantic analysis.
SDT involves passing information bottom-up and/or top-down to the parse tree in form of attributes
attached to the nodes. Syntax-directed translation rules use 1) lexical values of nodes, 2) constants &
3) attributes associated with the non-terminals in their definitions.
The general approach to Syntax-Directed Translation is to construct a parse tree or syntax tree and
compute the values of attributes at the nodes of the tree by visiting them in some order. In many cases,
translation can be done during parsing without building an explicit tree.
Example
E -> E+T | T
T -> T*F | F
F -> INTLIT
This is a grammar to syntactically validate an expression having additions and multiplications in it.
Now, to carry out semantic analysis we will augment SDT rules to this grammar, in order to pass some
information up the parse tree and check for semantic errors, if any. In this example, we will focus on
the evaluation of the given expression, as we don’t have any semantic assertions to check in this very
basic example.

E -> E+T { E.val = E.val + T.val } PR#1


E -> T { E.val = T.val } PR#2
T -> T*F { T.val = T.val * F.val } PR#3
T -> F { T.val = F.val } PR#4
F -> INTLIT { F.val = INTLIT.lexval } PR#5
For understanding translation rules further, we take the first SDT augmented to [ E -> E+T ]
production rule. The translation rule in consideration has val as an attribute for both the non-terminals
– E & T. Right-hand side of the translation rule corresponds to attribute values of the right-side nodes
of the production rule and vice-versa. Generalizing, SDT are augmented rules to a CFG that associate
1) set of attributes to every node of the grammar and 2) a set of translation rules to every production
rule using attributes, constants, and lexical values.
Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree corresponding to S
would be
To evaluate translation rules, we can employ one depth-first search traversal on the parse tree. This is
possible only because SDT rules don’t impose any specific order on evaluation until children’s
attributes are computed before parents for a grammar having all synthesized attributes. Otherwise, we
would have to figure out the best-suited plan to traverse through the parse tree and evaluate all the
attributes in one or more traversals. For better understanding, we will move bottom-up in the left to
right fashion for computing the translation rules of our example.

The above diagram shows how semantic analysis could happen. The flow of information happens
bottom-up and all the children’s attributes are computed before parents, as discussed above.
Right-hand side nodes are sometimes annotated with subscript 1 to distinguish between children and
parents.
Additional Information
Synthesized Attributes are such attributes that depend only on the attribute values of children nodes.
Thus [ E -> E+T { E.val = E.val + T.val } ] has a synthesized attribute val corresponding to node E. If
all the semantic attributes in an augmented grammar are synthesized, one depth-first search traversal
in any order is sufficient for the semantic analysis phase.
Inherited Attributes are such attributes that depend on parent and/or sibling’s attributes.
Thus [ Ep -> E+T { Ep.val = E.val + T.val, T.val = Ep.val } ], where E & Ep are same production
symbols annotated to differentiate between parent and child, has an inherited attribute val
corresponding to node T.

Advantages of Syntax Directed Translation:


Ease of implementation: SDT is a simple and easy-to-implement method for translating a
programming language. It provides a clear and structured way to specify translation rules using
grammar rules.
Separation of concerns: SDT separates the translation process from the parsing process, making it
easier to modify and maintain the compiler. It also separates the translation concerns from the parsing
concerns, allowing for more modular and extensible compiler designs.
Efficient code generation: SDT enables the generation of efficient code by optimizing the translation
process. It allows for the use of techniques such as intermediate code generation and code
optimization.
Disadvantages of Syntax Directed Translation:
Limited expressiveness: SDT has limited expressiveness in comparison to other translation methods,
such as attribute grammars. This limits the types of translations that can be performed using SDT.
Inflexibility: SDT can be inflexible in situations where the translation rules are complex and cannot
be easily expressed using grammar rules.
Limited error recovery: SDT is limited in its ability to recover from errors during the translation
process. This can result in poor error messages and may make it difficult to locate and fix errors in the
input program.
L14
Bottom-up evaluation of S-attributed definition, L-attribute definition,
Before coming up to S-attributed and L-attributed SDTs, here is a brief intro to Synthesized or
Inherited attributes Types of attributes – Attributes may be of two types – Synthesized or Inherited.
1. Synthesized attributes – A Synthesized attribute is an attribute of the non-terminal on the
left-hand side of a production. Synthesized attributes represent information that is being
passed up the parse tree. The attribute can take value only from its children (Variables in the
RHS of the production). The non-terminal concerned must be in the head (LHS) of
production. For e.g. let’s say A -> BC is a production of a grammar, and A’s attribute is
dependent on B’s attributes or C’s attributes then it will be synthesized attribute.
2. Inherited attributes – An attribute of a nonterminal on the right-hand side of a production is
called an inherited attribute. The attribute can take value either from its parent or from its
siblings (variables in the LHS or RHS of the production). The non-terminal concerned must
be in the body (RHS) of production. For example, let’s say A -> BC is a production of a
grammar and B’s attribute is dependent on A’s attributes or C’s attributes then it will be
inherited attribute because A is a parent here, and C is a sibling.
Now, let’s discuss about S-attributed and L-attributed SDT.
1. S-attributed SDT :
● If an SDT uses only synthesized attributes, it is called as S-attributed SDT.
● S-attributed SDTs are evaluated in bottom-up parsing, as the values of the parent
nodes depend upon the values of the child nodes.
● Semantic actions are placed in rightmost place of RHS.
2. L-attributed SDT:
● If an SDT uses both synthesized attributes and inherited attributes with a restriction
that inherited attribute can inherit values from left siblings only, it is called as
L-attributed SDT.
● Attributes in L-attributed SDTs are evaluated by depth-first and left-to-right parsing
manner.
● Semantic actions are placed anywhere in RHS.
● Example : S->ABC, Here attribute B can only obtain its value either from the parent –
S or its left sibling A but It can’t inherit from its right sibling C. Same goes for A & C
– A can only get its value from its parent & C can get its value from S, A, & B as well
because C is the rightmost attribute in the given production.

Terminologies:

● Parse Tree: A parse tree is a tree that represents the syntax of the production
hierarchically.
● Annotated Parse Tree: Annotated Parse tree contains the values and attributes at each
node.
● Synthesized Attributes: When the evaluation of any node’s attribute is based on children.
● Inherited Attributes: When the evaluation of any node’s attribute is based on children or
parents.

Dependency Graphs
A dependency graph provides information about the order of evaluation of attributes with the help of
edges. It is used to determine the order of evaluation of attributes according to the semantic rules of
the production. An edge from the first node attribute to the second node attribute gives the
information that first node attribute evaluation is required for the evaluation of the second node
attribute. Edges represent the semantic rules of the corresponding production.
Dependency Graph Rules: A node in the dependency graph corresponds to the node of the parse tree
for each attribute. Edges (first node from the second node) of the dependency graph represent that the
attribute of the first node is evaluated before the attribute of the second node

Ordering the Evaluation of Attributes:


The dependency graph provides the evaluation order of attributes of the nodes of the parse tree. An
edge ( i.e. first node to the second node) in the dependency graph represents that the attribute of the
second node is dependent on the attribute of the first node for further evaluation. This order of
evaluation gives a linear order called topological order.
There is no way to evaluate SDD on a parse tree when there is a cycle present in the graph and due to
the cycle, no topological order exists.

Production Table

S.No
. Productions Semantic Rules

1. S⇢A&B S.val = A.syn + B.syn

A.syn = A1.syn * B.syn


2. A ⇢ A1 # B
A1.inh = A.syn

3. A1 ⇢ B A1.syn = B.syn

4. B ⇢ digit B.syn = digit.lexval


Annotated Parse Tree For 1#2&3

Dependency Graph For 1#2&3


Explanation of dependency graph:
Node number in the graph represents the order of the evaluation of the associated attribute. Edges in
the graph represent that the second value is dependent on the first value.
Table-1 represents the attributes corresponding to each node.
Table-2 represents the semantic rules corresponding to each edge.
Table-1

Node Attribute

digit.lexva
1
l

digit.lexva
2
l

digit.lexva
3
l

4 B.syn

5 B.syn

6 B.syn

7 A1.syn

8 A.syn

9 A1.inh

10 S.val
Table-2

Edge
Corresponding Semantic Rule
From To (From the production table)

1 4 B.syn = digit.lexval

2 5 B.syn = digit.lexval

3 6 B.syn = digit.lexval

4 7 A1.syn = B.syn

5 8 A.syn = A1.syn * B.syn

10
6 S.val = A.syn + B.syn

7 8 A.syn = A1.syn * B.syn

8 10 S.val = A.syn + B.syn

8 9 A1.inh = A.syn

S-Attributed Definitions:
S-attributed SDD can have only synthesized attributes. In this type of definitions semantic rules are
placed at the end of the production only. Its evaluation is based on bottom up parsing.
Example: S ⇢ AB { S.x = f(A.x | B.x) }
L-Attributed Definitions:
L-attributed SDD can have both synthesized and inherited (restricted inherited as attributes can only
be taken from the parent or left siblings). In this type of definition, semantics rules can be placed
anywhere in the RHS of the production. Its evaluation is based on inorder (topological sorting).
Example: S ⇢ AB {A.x = S.x + 2} or S ⇢ AB { B.x = f(A.x | B.x) } or S ⇢ AB { S.x = f(A.x | B.x) }
Note:
● Every S-attributed grammar is also L-attributed.
● For L-attributed evaluation in order of the annotated parse tree is used.
● For S-attributed reverse of the rightmost derivation is used.
Semantic Rules with controlled side-effects:
Side effects are the program fragment contained within semantic rules. These side effects in SDD can
be controlled in two ways: Permit incidental side effects and constraint admissible evaluation orders
to have the same translation as any admissible order.
L15
Analysis of Syntax directed definition
Syntax Directed Definition (SDD) is a kind of abstract specification. It is generalization of context
free grammar in which each grammar production X –> a is associated with it a set of production rules
of the form s = f(b1, b2, ……bk) where s is the attribute obtained from function f. The attribute can be
a string, number, type or a memory location. Semantic rules are fragments of code which are
embedded usually at the end of production and enclosed in curly braces ({ }).
Example:
E --> E1 + T { E.val = E1.val + T.val}
Annotated Parse Tree – The parse tree containing the values of attributes at each node for given input
string is called annotated or decorated parse tree.

Features –

● High level specification


● Hides implementation details
● Explicit order of evaluation is not specified
Types of attributes – There are two types of attributes:
1. Synthesized Attributes – These are those attributes which derive their values from their children
nodes i.e. value of synthesized attribute at node is computed from the values of attributes at children
nodes in parse tree.
Example:
E --> E1 + T { E.val = E1.val + T.val}
In this, E.val derive its values from E1.val and T.val
Computation of Synthesized Attributes –
● Write the SDD using appropriate semantic rules for each production in given grammar.
● The annotated parse tree is generated and attribute values are computed in bottom up manner.
● The value obtained at root node is the final output.
Example: Consider the following grammar
S --> E
E --> E1 + T
E --> T
T --> T1 * F
T --> F
F --> digit
The SDD for the above grammar can be written as follow

Let us assume an input string 4 * 5 + 6 for computing synthesized attributes. The annotated parse tree
for the input string is
For computation of attributes we start from leftmost bottom node. The rule F –> digit is used to
reduce digit to F and the value of digit is obtained from lexical analyzer which becomes value of F i.e.
from semantic action F.val = digit.lexval. Hence, F.val = 4 and since T is parent node of F so, we get
T.val = 4 from semantic action T.val = F.val. Then, for T –> T1 * F production, the corresponding
semantic action is T.val = T1.val * F.val . Hence, T.val = 4 * 5 = 20
Similarly, combination of E1.val + T.val becomes E.val i.e. E.val = E1.val + T.val = 26. Then, the
production S –> E is applied to reduce E.val = 26 and semantic action associated with it prints the
result E.val . Hence, the output will be 26.
2. Inherited Attributes – These are the attributes which derive their values from their parent or sibling
nodes i.e. value of inherited attributes are computed by value of parent or sibling nodes.
Example:
A --> BCD { C.in = A.in, C.type = B.type }
Computation of Inherited Attributes –
● Construct the SDD using semantic actions.
● The annotated parse tree is generated and attribute values are computed in top down manner.
Example: Consider the following grammar
S --> T L
T --> int
T --> float
T --> double
L --> L1, id
L --> id
The SDD for the above grammar can be written as follow

Let us assume an input string int a, c for computing inherited attributes. The annotated parse tree for
the input string is
The value of L nodes is obtained from T.type (sibling) which is basically lexical value obtained as int,
float or double. Then L node gives type of identifiers a and c. The computation of type is done in top
down manner or preorder traversal. Using function Enter_type the type of identifiers a and c is
inserted in symbol table at corresponding id.entry.

Unit-III Type Checking & Run Time Environment


Type checking: type system, specification of simple type checker, equivalence of expression, types,
type conversion, overloading of functions and operations, polymorphic functions.
Run time Environment: storage organization, Storage allocation strategies, parameter passing,
dynamic storage allocation, Symbol table, Error Detection & Recovery, Ad-Hoc and Systematic
Methods.

L16
Type checking:
Type checking is the process of verifying and enforcing constraints of types in values. A compiler
must check that the source program should follow the syntactic and semantic conventions of the
source language and it should also check the type rules of the language. It allows the programmer to
limit what types may be used in certain circumstances and assigns types to values. The type-checker
determines whether these values are used appropriately or not.
It checks the type of objects and reports a type error in the case of a violation, and incorrect types are
corrected. Whatever the compiler we use, while it is compiling the program, it has to follow the type
rules of the language. Every language has its own set of type rules for the language. We know that the
information about data types is maintained and computed by the compiler.
The information about data types like INTEGER, FLOAT, CHARACTER, and all the other data types
is maintained and computed by the compiler. The compiler contains modules, where the type checker
is a module of a compiler and its task is type checking.
Conversion
Conversion from one type to another type is known as implicit if it is to be done automatically by the
compiler. Implicit type conversions are also called Coercion and coercion is limited in many
languages.
Example: An integer may be converted to a real but real is not converted to an integer.
Conversion is said to be Explicit if the programmer writes something to do the Conversion.
Tasks:
1. has to allow “Indexing is only on an array”
2. has to check the range of data types used
3. INTEGER (int) has a range of -32,768 to +32767
4. FLOAT has a range of 1.2E-38 to 3.4E+38.

Specification of simple type checker


Types of Type Checking:
There are two kinds of type checking:
1. Static Type Checking.
2. Dynamic Type Checking.
Static Type Checking:
Static type checking is defined as type checking performed at compile time. It checks the type
variables at compile-time, which means the type of the variable is known at the compile time. It
generally examines the program text during the translation of the program. Using the type rules of a
system, a compiler can infer from the source text that a function (fun) will be applied to an operand
(a) of the right type each time the expression fun(a) is evaluated.
Examples of Static checks include:
● Type-checks: A compiler should report an error if an operator is applied to an incompatible
operand. For example, if an array variable and function variable are added together.
● The flow of control checks: Statements that cause the flow of control to leave a construct
must have someplace to which to transfer the flow of control. For example, a break statement
in C causes control to leave the smallest enclosing while, for, or switch statement, an error
occurs if such an enclosing statement does not exist.
● Uniqueness checks: There are situations in which an object must be defined only once. For
example, in Pascal an identifier must be declared uniquely, labels in a case statement must be
distinct, and else a statement in a scalar type may not be represented.
● Name-related checks: Sometimes the same name may appear two or more times. For example
in Ada, a loop may have a name that appears at the beginning and end of the construct. The
compiler must check that the same name is used at both places.
The Benefits of Static Type Checking:
1. Runtime Error Protection.
2. It catches syntactic errors like spurious words or extra punctuation.
3. It catches wrong names like Math and Predefined Naming.
4. Detects incorrect argument types.
5. It catches the wrong number of arguments.
6. It catches wrong return types, like return “70”, from a function that’s declared to return an int.
Dynamic Type Checking:
Dynamic Type Checking is defined as the type checking being done at run time. In Dynamic Type
Checking, types are associated with values, not variables. Implementations of dynamically
type-checked languages runtime objects are generally associated with each other through a type tag,
which is a reference to a type containing its type information. Dynamic typing is more flexible. A
static type system always restricts what can be conveniently expressed. Dynamic typing results in
more compact programs since it is more flexible and does not require types to be spelled out.
Programming with a static type system often requires more design and implementation effort.
Languages like Pascal and C have static type checking. Type checking is used to check the correctness
of the program before its execution. The main purpose of type-checking is to check the correctness
and data type assignments and type-casting of the data types, whether it is syntactically correct or not
before their execution.
Static Type-Checking is also used to determine the amount of memory needed to store the variable.
The design of the type-checker depends on:
1. Syntactic Structure of language constructs.
2. The Expressions of languages.
3. The rules for assigning types to constructs (semantic rules).

The Position of the Type checker in the Compiler:

Type checking in Compiler


The token streams from the lexical analyzer are passed to the PARSER. The PARSER will generate a
syntax tree. When a program (source code) is converted into a syntax tree, the type-checker plays a
Crucial Role. So, by seeing the syntax tree, you can tell whether each data type is handling the correct
variable or not. The Type-Checker will check and if any modifications are present, then it will modify.
It produces a syntax tree, and after that, INTERMEDIATE CODE Generation is done.
L17
Overloading of functions and operations
An Overloading symbol is one that has different operations depending on its context.
Overloading is of two types:
1. Operator Overloading
2. Function Overloading
Operator Overloading: In Mathematics, the arithmetic expression “x+y” has the addition operator
‘+’ is overloaded because ‘+’ in “x+y” have different operators when ‘x’ and ‘y’ are integers, complex
numbers, reals, and Matrices.
Example: In Ada, the parentheses ‘()’ are overloaded, the ith element of the expression A(i) of an
Array A has a different meaning such as a ‘call to function ‘A’ with argument ‘i’ or an explicit
conversion of expression i to type ‘A’. In most languages the arithmetic operators are overloaded.
Function Overloading: The Type Checker resolves the Function Overloading based on types of
arguments and Numbers.
Example:
E-->E1(E2)
{
E.type:= if E2.type = s
E1.type = s -->t then t
else type_error
}

Polymorphic functions
The word “polymorphism” means having many forms. In simple words, we can define polymorphism
as the ability of a message to be displayed in more than one form. A real-life example of
polymorphism is a person who at the same time can have different characteristics. A man at the same
time is a father, a husband, and an employee. So the same person exhibits different behavior in
different situations. This is called polymorphism. Polymorphism is considered one of the important
features of Object-Oriented Programming.
Types of Polymorphism
● Compile-time Polymorphism
● Runtime Polymorphism
Types of Polymorphism
1. Compile-Time Polymorphism
This type of polymorphism is achieved by function overloading or operator overloading.
A. Function Overloading
When there are multiple functions with the same name but different parameters, then the functions are
said to be overloaded, hence this is known as Function Overloading. Functions can be overloaded
by changing the number of arguments or/and changing the type of arguments. In simple terms, it is a
feature of object-oriented programming providing many functions that have the same name but
distinct parameters when numerous tasks are listed under one function name. There are certain Rules
of Function Overloading that should be followed while overloading a function.

2. Runtime Polymorphism

This type of polymorphism is achieved by Function Overriding. Late binding and dynamic
polymorphism are other names for runtime polymorphism. The function call is resolved at runtime
in runtime polymorphism. In contrast, with compile time polymorphism, the compiler determines
which function call to bind to the object after deducing it at runtime.
A. Function Overriding
Function Overriding occurs when a derived class has a definition for one of the member functions of
the base class. That base function is said to be overridden.
Function overriding Explanation
unction Overloading that should be followed while overloading a function.
Virtual Function
A virtual function is a member function that is declared in the base class using the keyword virtual
and is re-defined (Overridden) in the derived class.
Some Key Points About Virtual Functions:
● Virtual functions are Dynamic in nature.
● They are defined by inserting the keyword “virtual” inside a base class and are always
declared with a base class and overridden in a child class
● A virtual function is called during Runtime

L18
Run time Environment:
A translation needs to relate the static source text of a program to the dynamic actions that must occur
at runtime to implement the program. The program consists of names for procedures, identifiers, etc.,
that require mapping with the actual memory location at runtime. Runtime environment is a state of
the target machine, which may include software libraries, environment variables, etc., to provide
services to the processes running in the system.

Storage organization:
Activation Tree
A program consists of procedures, a procedure definition is a declaration that, in its simplest form,
associates an identifier (procedure name) with a statement (body of the procedure). Each execution of
the procedure is referred to as an activation of the procedure. Lifetime of an activation is the sequence
of steps present in the execution of the procedure. If ‘a’ and ‘b’ be two procedures then their
activations will be non-overlapping (when one is called after other) or nested (nested procedures). A
procedure is recursive if a new activation begins before an earlier activation of the same procedure has
ended. An activation tree shows the way control enters and leaves activations. Properties of activation
trees are :-
● Each node represents an activation of a procedure.
● The root shows the activation of the main function.
● The node for procedure ‘x’ is the parent of node for procedure ‘y’ if and only if the control
flows from procedure x to procedure y.
Example – Consider the following program of Quicksort
main() {

Int n;
readarray();
quicksort(1,n);
}

quicksort(int m, int n) {

Int i= partition(m,n);
quicksort(m,i-1);
quicksort(i+1,n);
}
The activation tree for this program will be:

First main function as the root then main calls readarray and quicksort. Quicksort in turn calls
partition and quicksort again. The flow of control in a program corresponds to a pre-order depth-first
traversal of the activation tree which starts at the root.
CONTROL STACK AND ACTIVATION RECORDS
Control stack or runtime stack is used to keep track of the live procedure activations i.e the procedures
whose execution have not been completed. A procedure name is pushed on to the stack when it is
called (activation begins) and it is popped when it returns (activation ends). Information needed by a
single execution of a procedure is managed using an activation record or frame. When a procedure is
called, an activation record is pushed into the stack and as soon as the control returns to the caller
function the activation record is popped.

A general activation record consists of the following things:


● Local variables: hold the data that is local to the execution of the procedure.
● Temporary values: stores the values that arise in the evaluation of an expression.
● Machine status: holds the information about the status of the machine just before the function
call.
● Access link (optional): refers to non-local data held in other activation records.
● Control link (optional): points to activation record of caller.
● Return value: used by the called procedure to return a value to calling procedure
● Actual parameters

Control stack for the above quicksort example:

SUBDIVISION OF RUNTIME MEMORY


Runtime storage can be subdivided to hold :
● Target code- the program code, is static as its size can be determined at compile time
● Static data objects
● Dynamic data objects- heap
● Automatic data objects- stack
L19
Storage allocation strategies
I. Static Storage Allocation
The names are bound with the storage at compiler time only and hence every time procedure is
invoked its names are bound to the same storage location only So values of local names can be
retained across activations of a procedure. Here compiler can decide where the activation records go
with respect to the target code and can also fill the addresses in the target code for the data it operates
on.
● For any program, if we create a memory at compile time, memory will be created in the static
area.
● For any program, if we create a memory at compile-time only, memory is created only once.
● It doesn’t support dynamic data structure i.e memory is created at compile-time and
deallocated after program completion.
● The drawback with static storage allocation is recursion is not supported.
● Another drawback is the size of data should be known at compile time
Eg- FORTRAN was designed to permit static storage allocation.
II. Stack Storage Allocation
● Storage is organized as a stack and activation records are pushed and popped as activation
begins and end respectively. Locals are contained in activation records so they are bound to
fresh storage in each activation.
● Recursion is supported in stack allocation
III. Heap Storage Allocation
● Memory allocation and deallocation can be done at any time and at any place depending on
the requirement of the user.
● Heap allocation is used to dynamically allocate memory to the variables and claim it back
when the variables are no more required.
● Recursion is supported.
L20
Parameter passing
The communication medium among procedures is known as parameter passing. The values of the
variables from a calling procedure are transferred to the called procedure by some mechanism.
Basic terminology:
● R- value: The value of an expression is called its r-value. The value contained in a single
variable also becomes an r-value if its appear on the right side of the assignment operator.
R-value can always be assigned to some other variable.
● L-value: The location of the memory(address) where the expression is stored is known as the
l-value of that expression. It always appears on the left side if the assignment operator.

● i.Formal Parameter: Variables that take the information passed by the caller procedure are
called formal parameters. These variables are declared in the definition of the called
function. ii.Actual Parameter: Variables whose values and functions are passed to the called
function are called actual parameters. These variables are specified in the function call as
arguments.
Different ways of passing the parameters to the procedure:
● Call by Value: In call by value the calling procedure passes the r-value of the actual
parameters and the compiler puts that into called procedure’s activation record. Formal
parameters hold the values passed by the calling procedure, thus any changes made in the
formal parameters do not affect the actual parameters.

● Call by ReferenceIn call by reference the formal and actual parameters refers to same memory
location. The l-value of actual parameters is copied to the activation record of the called
function. Thus the called function has the address of the actual parameters. If the actual
parameters does not have a l-value (eg- i+3) then it is evaluated in a new temporary location
and the address of the location is passed. Any changes made in the formal parameter is
reflected in the actual parameters (because changes are made at the address).

● Call by Copy Restore In call by copy restore compiler copies the value in formal parameters
when the procedure is called and copy them back in actual parameters when control returns to
the called function. The r-values are passed and on return r-value of formals are copied into
l-value of actuals.
● Call by Name In call by name the actual parameters are substituted for formals in all the
places formals occur in the procedure. It is also referred as lazy evaluation because evaluation
is done on parameters only when needed.

Advantages:
Portability: A runtime environment can provide a layer of abstraction between the compiled code and
the operating system, making it easier to port the program to different platforms.
Resource management: A runtime environment can manage system resources, such as memory and
CPU time, making it easier to avoid memory leaks and other resource-related issues.
Dynamic memory allocation: A runtime environment can provide dynamic memory allocation,
allowing memory to be allocated and freed as needed during program execution.
Garbage collection: A runtime environment can perform garbage collection, automatically freeing
memory that is no longer being used by the program.
Exception handling: A runtime environment can provide exception handling, allowing the program
to gracefully handle errors and prevent crashes.
Disadvantages:
Performance overhead: A runtime environment can add performance overhead, as it requires
additional processing and memory usage.
Platform dependency: Some runtime environments may be specific to certain platforms, making it
difficult to port programs to other platforms.
Debugging: Debugging can be more difficult in a runtime environment, as the additional layer of
abstraction can make it harder to trace program execution.
Compatibility issues: Some runtime environments may not be compatible with certain operating
systems or hardware architectures, which can limit their usefulness.
Versioning: Different versions of a runtime environment may have different features or APIs, which
can lead to versioning issues when running programs compiled with different versions of the same
runtime environment.

L21
Dynamic storage allocation
Since C is a structured language, it has some fixed rules for programming. One of them includes
changing the size of an array. An array is a collection of items stored at contiguous memory locations.

As can be seen, the length (size) of the array above is 9. But what if there is a requirement to change
this length (size)? For example,
● If there is a situation where only 5 elements are needed to be entered in this array. In this case,
the remaining 4 indices are just wasting memory in this array. So there is a requirement to
lessen the length (size) of the array from 9 to 5.
● Take another situation. In this, there is an array of 9 elements with all 9 indices filled. But
there is a need to enter 3 more elements in this array. In this case, 3 indices more are required.
So the length (size) of the array needs to be changed from 9 to 12.
This procedure is referred to as Dynamic Memory Allocation in C.
Therefore, C Dynamic Memory Allocation can be defined as a procedure in which the size of a data
structure (like Array) is changed during the runtime.
C provides some functions to achieve these tasks. There are 4 library functions provided by C defined
under <stdlib.h> header file to facilitate dynamic memory allocation in C programming. They are:
1. malloc()
2. calloc()
3. free()
4. realloc()
Let’s look at each of them in greater detail.
C malloc() method
The “malloc” or “memory allocation” method in C is used to dynamically allocate a single large
block of memory with the specified size. It returns a pointer of type void which can be cast into a
pointer of any form. It doesn’t Initialize memory at execution time so that it has initialized each block
with the default garbage value initially.
Syntax of malloc() in C
ptr = (cast-type*) malloc(byte-size)
For Example:

ptr = (int*) malloc(100 * sizeof(int));


Since the size of int is 4 bytes, this statement will allocate 400 bytes of memory. And, the pointer ptr
holds the address of the first byte in the allocated memory.

If space is insufficient, allocation fails and returns a NULL pointer.

C calloc() method
1. “calloc” or “contiguous allocation” method in C is used to dynamically allocate the
specified number of blocks of memory of the specified type. it is very much similar to
malloc() but has two different points and these are:
2. It initializes each block with a default value ‘0’.
3. It has two parameters or arguments as compare to malloc().
Syntax of calloc() in C
ptr = (cast-type*)calloc(n, element-size);
here, n is the no. of elements and element-size is the size of each element.
For Example:
ptr = (float*) calloc(25, sizeof(float));
This statement allocates contiguous space in memory for 25 elements each with the size of the float.

C free() method
“free” method in C is used to dynamically de-allocate the memory. The memory allocated using
functions malloc() and calloc() is not de-allocated on their own. Hence the free() method is used,
whenever the dynamic memory allocation takes place. It helps to reduce wastage of memory by
freeing it.
Syntax of free() in C
free(ptr);
L22
Symbol table:
Symbol Table is an important data structure created and maintained by the compiler in order to keep
track of semantics of variables i.e. it stores information about the scope and binding information about
names, information about instances of various entities such as variable and function names, classes,
objects, etc.
● It is built-in lexical and syntax analysis phases.
● The information is collected by the analysis phases of the compiler and is used by the
synthesis phases of the compiler to generate code.
● It is used by the compiler to achieve compile-time efficiency.
● It is used by various phases of the compiler as follows:-
1. Lexical Analysis: Creates new table entries in the table, for example like entries
about tokens.
2. Syntax Analysis: Adds information regarding attribute type, scope, dimension, line
of reference, use, etc in the table.
3. Semantic Analysis: Uses available information in the table to check for semantics
i.e. to verify that expressions and assignments are semantically correct(type checking)
and update it accordingly.
4. Intermediate Code generation: Refers symbol table for knowing how much and
what type of run-time is allocated and table helps in adding temporary variable
information.
5. Code Optimization: Uses information present in the symbol table for
machine-dependent optimization.
6. Target Code generation: Generates code by using address information of identifier
present in the table.
Symbol Table entries – Each entry in the symbol table is associated with attributes that support the
compiler in different phases.
Use of Symbol Table-
The symbol tables are typically used in compilers. Basically compiler is a program which scans the
application program (for instance: your C program) and produces machine code.
During this scan compiler stores the identifiers of that application program in the symbol table. These
identifiers are stored in the form of name, value address, type.
Here the name represents the name of identifier, value represents the value stored in an identifier, the
address represents memory location of that identifier and type represents the data type of identifier.
Thus compiler can keep track of all the identifiers with all the necessary information.
Items stored in Symbol table:
● Variable names and constants
● Procedure and function names
● Literal constants and strings
● Compiler generated temporaries
● Labels in source languages
Information used by the compiler from Symbol table:
● Data type and name
● Declaring procedures
● Offset in storage
● If structure or record then, a pointer to structure table.
● For parameters, whether parameter passing by value or by reference
● Number and type of arguments passed to function
● Base Address
Operations of Symbol table – The basic operations defined on a symbol table include:
Operations on Symbol Table :
Following operations can be performed on symbol table-
1. Insertion of an item in the symbol table.
2. Deletion of any item from the symbol table.
3. Searching of desired item from symbol table.
Implementation of Symbol table –
Following are commonly used data structures for implementing symbol table:-
1. List –
we use a single array or equivalently several arrays, to store names and their associated
information ,New names are added to the list in the order in which they are encountered . The position
of the end of the array is marked by the pointer available, pointing to where the next symbol-table
entry will go. The search for a name proceeds backwards from the end of the array to the beginning.
when the name is located the associated information can be found in the words following next.

id_ info_
id1 info1 id2 info2 ……..
n n

● In this method, an array is used to store names and associated information.


● A pointer “available” is maintained at end of all stored records and new names are added in
the order as they arrive
● To search for a name we start from the beginning of the list till available pointer and if not
found we get an error “use of the undeclared name”
● While inserting a new name we must ensure that it is not already present otherwise an error
occurs i.e. “Multiple defined names”
● Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
● The advantage is that it takes a minimum amount of space.
1. Linked List –
● This implementation is using a linked list. A link field is added to each record.
● Searching of names is done in order pointed by the link of the link field.
● A pointer “First” is maintained to point to the first record of the symbol table.
● Insertion is fast O(1), but lookup is slow for large tables – O(n) on average
2. Hash Table –
● In hashing scheme, two tables are maintained – a hash table and symbol table and are
the most commonly used method to implement symbol tables.
● A hash table is an array with an index range: 0 to table size – 1. These entries are
pointers pointing to the names of the symbol table.
● To search for a name we use a hash function that will result in an integer between 0 to
table size – 1.
● Insertion and lookup can be made very fast – O(1).
● The advantage is quick to search is possible and the disadvantage is that hashing is
complicated to implement.
3. Binary Search Tree –
● Another approach to implementing a symbol table is to use a binary search tree i.e.
we add two link fields i.e. left and right child.
● All names are created as child of the root node that always follows the property of the
binary search tree.
● Insertion and lookup are O(log2 n) on average.

Advantages of Symbol Table


1. The efficiency of a program can be increased by using symbol tables, which give quick and
simple access to crucial data such as variable and function names, data kinds, and memory
locations.
2. better coding structure Symbol tables can be used to organize and simplify code, making it
simpler to comprehend, discover, and correct problems.
3. Faster code execution: By offering quick access to information like memory addresses,
symbol tables can be utilized to optimize code execution by lowering the number of memory
accesses required during execution.
4. Symbol tables can be used to increase the portability of code by offering a standardized
method of storing and retrieving data, which can make it simpler to migrate code between
other systems or programming languages.
5. Improved code reuse: By offering a standardized method of storing and accessing
information, symbol tables can be utilized to increase the reuse of code across multiple
projects.
6. Symbol tables can be used to facilitate easy access to and examination of a program’s state
during execution, enhancing debugging by making it simpler to identify and correct mistakes.
Disadvantages of Symbol Table
1. Increased memory consumption: Systems with low memory resources may suffer from
symbol tables’ high memory requirements.
2. Increased processing time: The creation and processing of symbol tables can take a long
time, which can be problematic in systems with constrained processing power.
3. Complexity: Developers who are not familiar with compiler design may find symbol tables
difficult to construct and maintain.
4. Limited scalability: Symbol tables may not be appropriate for large-scale projects or
applications that require o the management of enormous amounts of data due to their limited
scalability.
5. Upkeep: Maintaining and updating symbol tables on a regular basis can be time- and
resource-consuming.
6. Limited functionality: It’s possible that symbol tables don’t offer all the features a developer
needs, and therefore more tools or libraries will be needed to round out their capabilities.
Applications of Symbol Table
1. Resolution of variable and function names: Symbol tables are used to identify the data
types and memory locations of variables and functions as well as to resolve their names.
2. Resolution of scope issues: To resolve naming conflicts and ascertain the range of variables
and functions, symbol tables are utilized.
3. Symbol tables, which offer quick access to information such as memory locations, are used to
optimize code execution.
4. Code generation: By giving details like memory locations and data kinds, symbol tables are
utilized to create machine code from source code.
5. Error checking and code debugging: By supplying details about the status of a program
during execution, symbol tables are used to check for faults and debug code.
6. Code organization and documentation: By supplying details about a program’s structure,
symbol tables can be used to organize code and make it simpler to understand.
L23
Error Detection & Recovery
In this phase of compilation, all possible errors made by the user are detected and reported to the user
in form of error messages. This process of locating errors and reporting them to users is called
the Error Handling process.
Functions of an Error handler.
● Detection
● Reporting
● Recovery
Classification of Errors
Compile-time errors
Compile-time errors are of three types:-
Lexical phase errors
These errors are detected during the lexical analysis phase. Typical lexical errors are:
● Exceeding length of identifier or numeric constants.
● The appearance of illegal characters
● Unmatched string
Example 1 : printf("Geeksforgeeks");$
This is a lexical error since an illegal character $ appears at the end of statement.

Example 2 : This is a comment */


This is an lexical error since end of comment is present but beginning is not present
Error recovery for lexical errors:
Panic Mode Recovery
● In this method, successive characters from the input are removed one at a time until a
designated set of synchronizing tokens is found. Synchronizing tokens are delimiters such as;
or }
● The advantage is that it is easy to implement and guarantees not to go into an infinite loop
● The disadvantage is that a considerable amount of input is skipped without checking it for
additional errors
Syntactic phase errors:
These errors are detected during the syntax analysis phase. Typical syntax errors are:
● Errors in structure
● Missing operator
● Misspelled keywords
● Unbalanced parenthesis
Example : swich(ch)
{
.......
.......
}
The keyword switch is incorrectly written as a swich. Hence, an “Unidentified
keyword/identifier” error occurs.

L24
Ad-Hoc and Systematic Methods.
Error recovery for syntactic phase error:
1. Panic Mode Recovery
● In this method, successive characters from the input are removed one at a time until a
designated set of synchronizing tokens is found. Synchronizing tokens are deli-meters such
as; or }
● The advantage is that it’s easy to implement and guarantees not to go into an infinite loop
● The disadvantage is that a considerable amount of input is skipped without checking it for
additional errors
2. Statement Mode recovery
● In this method, when a parser encounters an error, it performs the necessary correction on the
remaining input so that the rest of the input statement allows the parser to parse ahead.
● The correction can be deletion of extra semicolons, replacing the comma with semicolons, or
inserting a missing semicolon.
● While performing correction, utmost care should be taken for not going in an infinite loop.
● A disadvantage is that it finds it difficult to handle situations where the actual error occurred
before pointing of detection.
3. Error production
● If a user has knowledge of common errors that can be encountered then, these errors can be
incorporated by augmenting the grammar with error productions that generate erroneous
constructs.
● If this is used then, during parsing appropriate error messages can be generated and parsing
can be continued.
● The disadvantage is that it’s difficult to maintain.
4. Global Correction
● The parser examines the whole program and tries to find out the closest match for it which is
error-free.
● The closest match program has less number of insertions, deletions, and changes of tokens to
recover from erroneous input.
● Due to high time and space complexity, this method is not implemented practically.
Semantic errors
These errors are detected during the semantic analysis phase. Typical semantic errors are
● Incompatible type of operands
● Undeclared variables
● Not matching of actual arguments with a formal one
Example : int a[10], b;
.......
.......
a = b;
It generates a semantic error because of an incompatible type of a and b.
Error recovery for Semantic errors
● If the error “Undeclared Identifier” is encountered then, to recover from this a symbol table
entry for the corresponding identifier is made.
● If data types of two operands are incompatible then, automatic type conversion is done by the
compiler.
Advantages:
Improved code quality: Error detection and recovery in a compiler can improve the overall quality of
the code produced. This is because errors can be identified early in the compilation process and
addressed before they become bigger issues.
Increased productivity: Error recovery can also increase productivity by allowing the compiler to
continue processing the code after an error is detected. This means that developers do not have to stop
and fix every error manually, saving time and effort.
Better user experience: Error recovery can also improve the user experience of software
applications. When errors are handled gracefully, users are less likely to become frustrated and are
more likely to continue using the application.
Better debugging: Error recovery in a compiler can help developers to identify and debug errors
more efficiently. By providing detailed error messages, the compiler can assist developers in
pinpointing the source of the error, saving time and effort.
Consistent error handling: Error recovery ensures that all errors are handled in a consistent manner,
which can help to maintain the quality and reliability of the software being developed.
Reduced maintenance costs: By detecting and addressing errors early in the development process,
error recovery can help to reduce maintenance costs associated with fixing errors in later stages of the
software development lifecycle.
Improved software performance: Error recovery can help to identify and address code that may
cause performance issues, such as memory leaks or inefficient algorithms. By improving the
performance of the code, the overall performance of the software can be improved as well.
Disadvantages:
Slower compilation time: Error detection and recovery can slow down the compilation process,
especially if the recovery mechanism is complex. This can be an issue in large software projects
where the compilation time can be a bottleneck.
Increased complexity: Error recovery can also increase the complexity of the compiler, making it
harder to maintain and debug. This can lead to additional development costs and longer development
times.
Risk of silent errors: Error recovery can sometimes mask errors in the code, leading to silent errors
that go unnoticed. This can be particularly problematic if the error affects the behavior of the software
application in subtle ways.
Potential for incorrect recovery: If the error recovery mechanism is not implemented correctly, it
can potentially introduce new errors or cause the code to behave unexpectedly.
Dependency on the recovery mechanism: If developers rely too heavily on the error recovery
mechanism, they may become complacent and not thoroughly check their code for errors. This can
lead to errors being missed or not addressed properly.
Difficulty in diagnosing errors: Error recovery can make it more difficult to diagnose and debug
errors since the error message may not accurately reflect the root cause of the issue. This can make it
harder to fix errors and may lead to longer development times.
Compatibility issues: Error recovery mechanisms may not be compatible with certain programming
languages or platforms, leading to issues with portability and cross-platform development.
Unit –IV Code Generation
Intermediate code generation: Declarations, Assignment statements, Boolean expressions, Case
statements, Back patching, Procedure calls Code Generation: Issues in the design of code generator,
Basic block and flow graphs, Register allocation and assignment, DAG representation of basic blocks,
peephole optimization, generating code from DAG.

L25-27 PPT
L28
Issues in the design of a code generator
Code generator converts the intermediate representation of source code into a form that can be readily
executed by the machine. A code generator is expected to generate the correct code. Designing of the
code generator should be done in such a way that it can be easily implemented, tested, and
maintained.
The following issue arises during the code generation phase:
Input to code generator – The input to the code generator is the intermediate code generated by the
front end, along with information in the symbol table that determines the run-time addresses of the
data objects denoted by the names in the intermediate representation. Intermediate codes may be
represented mostly in quadruples, triples, indirect triples, Postfix notation, syntax trees, DAGs, etc.
The code generation phase just proceeds on an assumption that the input is free from all syntactic and
state semantic errors, the necessary type checking has taken place and the type-conversion operators
have been inserted wherever necessary.
● Target program: The target program is the output of the code generator. The output may be
absolute machine language, relocatable machine language, or assembly language.
● Absolute machine language as output has the advantages that it can be placed in a
fixed memory location and can be immediately executed. For example, WATFIV is a
compiler that produces the absolute machine code as output.
● Relocatable machine language as an output allows subprograms and subroutines to be
compiled separately. Relocatable object modules can be linked together and loaded by
a linking loader. But there is added expense of linking and loading.
● Assembly language as output makes the code generation easier. We can generate
symbolic instructions and use the macro-facilities of assemblers in generating code.
And we need an additional assembly step after code generation.
● Memory Management – Mapping the names in the source program to the addresses of data
objects is done by the front end and the code generator. A name in the three address
statements refers to the symbol table entry for the name. Then from the symbol table entry, a
relative address can be determined for the name.
Instruction selection – Selecting the best instructions will improve the efficiency of the program. It
includes the instructions that should be complete and uniform. Instruction speeds and machine idioms
also play a major role when efficiency is considered. But if we do not care about the efficiency of the
target program then instruction selection is straightforward. For example, the respective three-address
statements would be translated into the latter code sequence as shown below:
P:=Q+R
S:=P+T

MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S
Here the fourth statement is redundant as the value of the P is loaded again in that statement that just
has been stored in the previous statement. It leads to an inefficient code sequence. A given
intermediate representation can be translated into many code sequences, with significant cost
differences between the different implementations. Prior knowledge of instruction cost is needed in
order to design good sequences, but accurate cost information is difficult to predict.

L29
Basic block and flow graphs
Basic Block is a straight-line code sequence that has no branches in and out branches except to the
entry and at the end respectively. Basic Block is a set of statements that always executes one after
other, in a sequence.
The first task is to partition a sequence of three-address codes into basic blocks. A new basic block is
begun with the first instruction and instructions are added until a jump or a label is met. In the absence
of a jump, control moves further consecutively from one instruction to another. The idea is
standardized in the algorithm below:
Algorithm: Partitioning three-address code into basic blocks.
Input: A sequence of three address instructions.
Process: Instructions from intermediate code which are leaders are determined. The following are the
rules used for finding a leader:
1. The first three-address instruction of the intermediate code is a leader.
2. Instructions that are targets of unconditional or conditional jump/goto statements are leaders.
3. Instructions that immediately follow unconditional or conditional jump/goto statements are
considered leaders.
Each leader thus determined its basic block contains itself and all instructions up to excluding the next
leader.
Example 1:
The following sequence of three-address statements forms a basic block:
t1 := a*a
t2 := a*b
t3 := 2*t2
t4 := t1+t3
t5 := b*b
t6 := t4 +t5
A three address statement x:= y+z is said to define x and to use y and z. A name in a basic block is
said to be live at a given point if its value is used after that point in the program, perhaps in another
basic block.
Example 2:
Intermediate code to set a 10*10 matrix to an identity matrix:
1) i=1 //Leader 1 (First statement)
2) j=1 //Leader 2 (Target of 11th statement)
3) t1 = 10 * i //Leader 3 (Target of 9th statement)
4) t2 = t1 + j
5) t3 = 8 * t2
6) t4 = t3 - 88
7) a[t4] = 0.0
8) j = j + 1
9) if j <= 10 goto (3)
10) i = i + 1 //Leader 4 (Immediately following Conditional goto statement)
11) if i <= 10 goto (2)
12) i = 1 //Leader 5 (Immediately following Conditional goto statement)
13) t5 = i - 1 //Leader 6 (Target of 17th statement)
14) t6 = 88 * t5
15) a[t6] = 1.0
16) i = i + 1
17) if i <= 10 goto (13)
The given algorithm is used to convert a matrix into identity matrix i.e. a matrix with all diagonal
elements 1 and all other elements as 0.
Steps (3)-(6) are used to make elements 0, step (14) is used to make an element 1. These steps are
used recursively by goto statements.
There are 6 Basic Blocks in the above code :
B1) Statement 1
B2) Statement 2
B3) Statement 3-9
B4) Statement 10-11
B5) Statement 12
B6) Statement 13-17

L30
Register allocation and assignment
Registers are the fastest locations in the memory hierarchy. But unfortunately, this resource is limited.
It comes under the most constrained resources of the target processor. Register allocation is an
NP-complete problem. However, this problem can be reduced to graph coloring to achieve allocation
and assignment. Therefore a good register allocator computes an effective approximate solution to a
hard problem.

Figure – Input-Output
The register allocator determines which values will reside in the register and which register will hold
each of those values. It takes as its input a program with an arbitrary number of registers and produces
a program with a finite register set that can fit into the target machine. (See image)
Allocation vs Assignment:
Allocation –
Maps an unlimited namespace onto that register set of the target machine.
● Reg. to Reg. Model: Maps virtual registers to physical registers but spills excess amount to
memory.
● Mem. to Mem. Model: Maps some subset of the memory location to a set of names that
models the physical register set.
Allocation ensures that code will fit the target machine’s reg. set at each instruction.
Assignment –
Maps an allocated name set to the physical register set of the target machine.
● Assumes allocation has been done so that code will fit into the set of physical registers.
● No more than ‘k’ values are designated into the registers, where ‘k’ is the no. of physical
registers.
General register allocation is an NP-complete problem:
● Solved in polynomial time, when (no. of required registers) <= (no. of available physical
registers).
● An assignment can be produced in linear time using Interval-Graph Coloring.
Local Register Allocation And Assignment:
Allocation just inside a basic block is called Local Reg. Allocation. Two approaches for local reg.
allocation: Top-down approach and bottom-up approach.
Top-Down Approach is a simple approach based on ‘Frequency Count’. Identify the values which
should be kept in registers and which should be kept in memory.
Algorithm:
1. Compute a priority for each virtual register.
2. Sort the registers into priority order.
3. Assign registers in priority order.
4. Rewrite the code.
Moving beyond single Blocks:
● More complicated because the control flow enters the picture.
● Liveness and Live Ranges: Live ranges consist of a set of definitions and uses that are related
to each other as they i.e. no single register can be common in a such couple of
instruction/data.
Following is a way to find out Live ranges in a block. A live range is represented as an interval [i,j],
where i is the definition and j is the last use.
Global Register Allocation and Assignment:
1. The main issue of a register allocator is minimizing the impact of spill code;
● Execution time for spill code.
● Code space for spill operation.
● Data space for spilled values.
2. Global allocation can’t guarantee an optimal solution for the execution time of spill code.
3. Prime differences between Local and Global Allocation:
● The structure of a global live range is naturally more complex than the local one.
● Within a global live range, distinct references may execute a different number of times.
(When basic blocks form a loop)
4. To make the decision about allocation and assignments, the global allocator mostly uses graph
coloring by building an interference graph.
5. Register allocator then attempts to construct a k-coloring for that graph where ‘k’ is the no. of
physical registers.
● In case, the compiler can’t directly construct a k-coloring for that graph, it modifies the
underlying code by spilling some values to memory and tries again.
● Spilling actually simplifies that graph which ensures that the algorithm will halt.
6. Global Allocator uses several approaches, however, we’ll see top-down and bottom-up allocations
strategies. Subproblems associated with the above approaches.
● Discovering Global live ranges.
● Estimating Spilling Costs.
● Building an Interference graph.

L31
DAG representation of basic blocks
The Directed Acyclic Graph (DAG) is used to represent the structure of basic blocks, to visualize the
flow of values between basic blocks, and to provide optimization techniques in the basic block. To
apply an optimization technique to a basic block, a DAG is a three-address code that is generated as
the result of an intermediate code generation.
● Directed acyclic graphs are a type of data structure and they are used to apply transformations
to basic blocks.
● The Directed Acyclic Graph (DAG) facilitates the transformation of basic blocks.
● DAG is an efficient method for identifying common sub-expressions.
● It demonstrates how the statement’s computed value is used in subsequent statements.
Examples of directed acyclic graph :

Directed Acyclic Graph Characteristics :


A Directed Acyclic Graph for Basic Block is a directed acyclic graph with the following labels on
nodes.
● The graph’s leaves each have a unique identifier, which can be variable names or constants.
● The interior nodes of the graph are labelled with an operator symbol.
● In addition, nodes are given a string of identifiers to use as labels for storing the computed
value.
● Directed Acyclic Graphs have their own definitions for transitive closure and transitive
reduction.
● Directed Acyclic Graphs have topological orderings defined.
Algorithm for construction of Directed Acyclic Graph :
There are three possible scenarios for building a DAG on three address codes:
Case 1 – x = y op z
Case 2 – x = op y
Case 3 – x = y
Directed Acyclic Graph for the above cases can be built as follows :
Step 1 –
● If the y operand is not defined, then create a node (y).
● If the z operand is not defined, create a node for case(1) as node(z).
Step 2 –
● Create node(OP) for case(1), with node(z) as its right child and node(OP) as its left child (y).
● For the case (2), see if there is a node operator (OP) with one child node (y).
● Node n will be node(y) in case (3).
Step 3 –
Remove x from the list of node identifiers. Step 2: Add x to the list of attached identifiers for node n.
Example :
T0 = a + b —Expression 1
T1 = T0 + c —-Expression 2
d = T0 + T1 —–Expression 3
Expression 1 : T0 = a + b

Expression 2: T1 = T0 + c

Expression 3 : d = T0 + T1

Final Directed acyclic graph


Example :
T1 = a + b
T2 = T1 + c
T3 = T1 x T2
Example :
T1 = a + b
T2 = a – b
T3 = T1 * T2
T4 = T1 – T3
T5 = T4 + T3

Final Directed acyclic graph


Example :
a=bxc
d=b
e=dxc
b=e
f=b+c
g=f+d
Final Directed acyclic graph
Example :
T1:= 4*I0
T2:= a[T1]
T3:= 4*I0
T4:= b[T3]
T5:= T2 * T4
T6:= prod + T5
prod:= T6
T7:= I0 + 1
I0:= T7
if I0 <= 20 goto 1

Final Directed acyclic graph


Application of Directed Acyclic Graph:
● Directed acyclic graph determines the subexpressions that are commonly used.
● Directed acyclic graph determines the names used within the block as well as the names
computed outside the block.
● Determines which statements in the block may have their computed value outside the block.
● Code can be represented by a Directed acyclic graph that describes the inputs and outputs of
each of the arithmetic operations performed within the code; this representation allows the
compiler to perform common subexpression elimination efficiently.
● Several programming languages describe value systems that are linked together by a directed
acyclic graph. When one value changes, its successors are recalculated; each value in the
DAG is evaluated as a function of its predecessors.
L32
Peephole optimization
Peephole optimization is a type of code Optimization performed on a small part of the code. It is
performed on a very small set of instructions in a segment of code.
The small set of instructions or small part of code on which peephole optimization is performed is
known as peephole or window.
It basically works on the theory of replacement in which a part of code is replaced by shorter and
faster code without a change in output. The peephole is machine-dependent optimization.
Objectives of Peephole Optimization:
The objective of peephole optimization is as follows:
1. To improve performance
2. To reduce memory footprint
3. To reduce code size
Peephole Optimization Techniques
A. Redundant load and store elimination: In this technique, redundancy is eliminated.
Initial code:
y = x + 5;
i = y;
z = i;
w = z * 3;

Optimized code:
y = x + 5;
w = y * 3; //* there is no i now

//* We've removed two redundant variables i & z whose value were just being copied from one
another.
B. Constant folding: The code that can be simplified by the user itself, is simplified. Here
simplification to be done at runtime are replaced with simplified code to avoid additional
computation.
Initial code:
x = 2 * 3;

Optimized code:
x = 6;
C. Strength Reduction: The operators that consume higher execution time are replaced by the
operators consuming less execution time.
Initial code:
y = x * 2;

Optimized code:
y = x + x; or y = x << 1;

Initial code:
y = x / 2;

Optimized code:
y = x >> 1;
D. Null sequences/ Simplify Algebraic Expressions : Useless operations are deleted.
a := a + 0;
a := a * 1;
a := a/1;
a := a - 0;
E. Combine operations: Several operations are replaced by a single equivalent operation.
F. Deadcode Elimination:- Dead code refers to portions of the program that are never executed or do
not affect the program’s observable behavior. Eliminating dead code helps improve the efficiency and
performance of the compiled program by reducing unnecessary computations and memory usage.
Initial Code:-
int Dead(void)
{
int a=10;
int z=50;
int c;
c=z*5;
printf(c);
a=20;
a=a*10; //No need of These Two Lines
return 0;
}
Optimized Code:-
int Dead(void)
{
int a=10;
int z=50;
int c;
c=z*5;
printf(c);
return 0;
}
L33
Generating code from DAG.
Language translation: Three address code can also be used to translate code from one programming
language to another. By translating code to a common intermediate representation, it becomes easier
to translate the code to multiple target languages.
General representation –
a = b op c
Where a, b or c represents operands like names, constants or compiler generated temporaries and op
represents the operator
Example-1: Convert the expression a * – (b + c) into three address code.

Example-2: Write three address code for following code


for(i = 1; i<=10; i++)
{
a[i] = x * 5;
}
Implementation of Three Address Code –
There are 3 representations of three address code namely
1. Quadruple
2. Triples
3. Indirect Triples
1. Quadruple – It is a structure which consists of 4 fields namely op, arg1, arg2 and result. op denotes
the operator and arg1 and arg2 denotes the two operands and result is used to store the result of the
expression.
Advantage –
● Easy to rearrange code for global optimization.
● One can quickly access value of temporary variables using symbol table.
Disadvantage –
● Contain lot of temporaries.
● Temporary variable creation increases time and space complexity.
Example – Consider expression a = b * – c + b * – c. The three address code is:
t1 = uminus c
t2 = b * t1
t3 = uminus c
t4 = b * t3
t5 = t2 + t4
a = t5
2. Triples – This representation doesn’t make use of extra temporary variable to represent a single
operation instead when a reference to another triple’s value is needed, a pointer to that triple is used.
So, it consist of only three fields namely op, arg1 and arg2.
Disadvantage –
● Temporaries are implicit and difficult to rearrange code.
● It is difficult to optimize because optimization involves moving intermediate code. When a
triple is moved, any other triple referring to it must be updated also. With help of pointer one
can directly access symbol table entry.
Example – Consider expression a = b * – c + b * – c

3. Indirect Triples – This representation makes use of pointer to the listing of all references to
computations which is made separately and stored. Its similar in utility as compared to quadruple
representation but requires less space than it. Temporaries are implicit and easier to rearrange code.
Example – Consider expression a = b * – c + b * – c
Question – Write quadruple, triples and indirect triples for following expression : (x + y) * (y + z) + (x
+ y + z)
Explanation – The three address code is:
t1 = x + y
t2 = y + z
t3 = t1 * t2
t4 = t1 + z
t5 = t3 + t4
Unit –V Code Optimization
Introduction to Code optimization: sources of optimization of basic blocks,
loops in flow graphs, dead code elimination, loop optimization, Introduction to
global data flow analysis, Code Improving transformations, Data flow analysis
of structure flow graph Symbolic debugging of optimized code.
L34
Introduction to Code optimization:
The code optimization in the synthesis phase is a program transformation technique, which tries to
improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so that
faster-running machine code will result. Compiler optimizing process should meet the following
objectives:
● The optimization must be correct, it must not, in any way, change the meaning of the
program.
● Optimization should increase the speed and performance of the program.
● The compilation time must be kept reasonable.
● The optimization process should not delay the overall compiling process.
When to Optimize?
Optimization of the code is often performed at the end of the development stage since it reduces
readability and adds code that is used to increase the performance.
Why Optimize?
Optimizing an algorithm is beyond the scope of the code optimization phase. So, the program is
optimized. And it may involve reducing the size of the code. So, optimization helps to:
● Reduce the space consumed and increases the speed of compilation.
● Manually analysing datasets involves a lot of time. Hence, we make use of software like
Tableau for data analysis. Similarly, manually performing the optimization is also tedious and
is better done using a code optimizer.
● An optimized code often promotes re-usability.
Types of Code Optimization: The optimization process can be broadly classified into two types:
1. Machine Independent Optimization: This code optimization phase attempts to improve
the intermediate code to get a better target code as the output. The part of the intermediate
code which is transformed here does not involve any CPU registers or absolute memory
locations.
2. Machine Dependent Optimization: Machine-dependent optimization is done after the target
code has been generated and when the code is transformed according to the target machine
architecture. It involves CPU registers and may have absolute memory references rather than
relative references. Machine-dependent optimizers put efforts to take maximum advantage of
the memory hierarchy.

Sources of optimization of basic blocks:


Optimization is applied to the basic blocks after the intermediate code generation phase of the
compiler. Optimization is the process of transforming a program that improves the code by consuming
fewer resources and delivering high speed. In optimization, high-level codes are replaced by their
equivalent efficient low-level codes. Optimization of basic blocks can be machine-dependent or
machine-independent. These transformations are useful for improving the quality of code that will be
ultimately generated from basic block.
There are two types of basic block optimizations:
1. Structure preserving transformations
2. Algebraic transformations

Structure-Preserving Transformations:
The structure-preserving transformation on basic blocks includes:
1. Dead Code Elimination
2. Common Subexpression Elimination
3. Renaming of Temporary variables
4. Interchange of two independent adjacent statements
L 35
Loops in flow graphs:
In the context of code optimization and program analysis, flow graphs are graphical representations
that depict the control flow of a program. Loops in flow graphs are visual representations of loops or
repetitive structures in the source code. Understanding loops in flow graphs is essential for analysing
the program's behaviour, identifying optimization opportunities, and improving the efficiency of the
code.

1. Basic Block:
● A basic block is a sequence of consecutive statements in which flow control enters at
the beginning and leaves at the end without any internal branches except at the end.
Basic blocks are the building blocks of flow graphs.
2. Flow Graph:
● A flow graph represents the control flow of a program using nodes and directed
edges. Nodes typically correspond to basic blocks, and edges represent the flow of
control between these blocks. Flow graphs help visualize the execution path of a
program.
3. Loop:
● A loop is a structure in a program that allows a set of statements to be executed
repeatedly based on a certain condition. In a flow graph, a loop is often represented
by a back edge, which is a directed edge that connects a node inside the loop to a
node outside the loop.
4. Back Edge:
● A back edge is an edge in the flow graph that connects a node to an ancestor node in
the control flow hierarchy. In the context of loops, a back edge connects a node inside
the loop to a node outside the loop, indicating the loop's cyclic nature.
5. Loop Header:
● The loop header is the entry point of a loop. It is the node that is the target of the back
edge. The loop header typically contains the conditional branch that determines
whether the loop should continue or exit.
6. Loop Body:
● The loop body consists of the nodes and edges within the loop. It represents the set of
statements that are executed iteratively as long as the loop condition holds true.
7. Exit Node:
● The exit node is a node within the loop that leads to the loop's exit. It is the point
where the loop terminates, and control flow moves outside the loop.
1.Dead Code Elimination:
Dead code is defined as that part of the code that never executes during the program execution. So, for
optimization, such code or dead code is eliminated. The code which is never executed during the
program (Dead code) takes time so, for optimization and speed, it is eliminated from the code.
Eliminating the dead code increases the speed of the program as the compiler does not have to
translate the dead code.
Example:
// Program with Dead code
int main()
{
x=2
if (x > 2)
cout << "code"; // Dead code
else
cout << "Optimization";
return 0;
}
// Optimized Program without dead code
int main()
{
x = 2;
cout << "Optimization"; // Dead Code Eliminated
return 0;
}
2.Common Subexpression Elimination:
In this technique, the sub-expression which are common are used frequently are calculated only once
and reused when needed. DAG ( Directed Acyclic Graph ) is used to eliminate common
subexpressions.
Example:
3.Renaming of Temporary Variables:
Statements containing instances of a temporary variable can be changed to instances of a new
temporary variable without changing the basic block value.
Example: Statement t = a + b can be changed to x = a + b where t is a temporary variable and x is a
new temporary variable without changing the value of the basic block.
4.Interchange of Two Independent Adjacent Statements:
If a block has two adjacent statements which are independent can be interchanged without affecting
the basic block value.
Example:
t1 = a + b
t2 = c + d
These two independent statements of a block can be interchanged without affecting the value of the
block.
Algebraic Transformation:
Countless algebraic transformations can be used to change the set of expressions computed by a basic
block into an algebraically equivalent set. Some of the algebraic transformation on basic blocks
includes:
1. Constant Folding
2. Copy Propagation
3. Strength Reduction
1. Constant Folding:
Solve the constant terms which are continuous so that compiler does not need to solve this expression.
Example:
x = 2 * 3 + y ⇒ x = 6 + y (Optimized code)
2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)
L36
Loop Optimization:
Loop optimization includes the following strategies:
1. Code motion & Frequency Reduction
2. Induction variable elimination
3. Loop merging/combining
4. Loop Unrolling
1. Code Motion & Frequency Reduction
Move loop invariant code outside of the loop.
// Program with loop variant inside loop
int main()
{
for (i = 0; i < n; i++) {
x = 10;
y = y + i;
}
return 0;
}
// Program with loop variant outside loop
int main()
{
x = 10;
for (i = 0; i < n; i++)
y = y + i;
return 0;
}
2. Induction Variable Elimination:
Eliminate various unnecessary induction variables used in the loop.
// Program with multiple induction variables
int main()
{
i1 = 0;
i2 = 0;
for (i = 0; i < n; i++) {
A[i1++] = B[i2++];
}
return 0;
}
// Program with one induction variable
int main()
{
for (i = 0; i < n; i++) {
A[i] = B[i]; // Only one induction variable
}
return 0;
}
3. Loop Merging/Combining:
If the operations performed can be done in a single loop then, merge or combine the loops.
// Program with multiple loops
int main()
{
for (i = 0; i < n; i++)
A[i] = i + 1;
for (j = 0; j < n; j++)
B[j] = j - 1;
return 0;
}
// Program with one loop when multiple loops are merged
int main()
{
for (i = 0; i < n; i++) {
A[i] = i + 1;
B[i] = i - 1;
}
return 0;
}
4. Loop Unrolling:
If there exists simple code which can reduce the number of times the loop executes then, the loop can
be replaced with these codes.
// Program with loops
int main()
{
for (i = 0; i < 3; i++)
cout << "Cd";
return 0;
}
// Program with simple code without loops
int main()
{
cout << "Cd";
cout << "Cd";
cout << "Cd";
return 0;
}
L37
Introduction to global data flow analysis

Global Data Flow Analysis (also known as Global Flow Analysis or Global Data-Flow Analysis) is a
compiler optimization technique used to analyze the flow of data throughout an entire program. It
provides insights into how values are computed and propagated across different parts of the program,
helping compilers make informed decisions to optimize the code for performance, memory usage, and
other aspects.
Here's an introduction to Global Data Flow Analysis:
1. Objective:
● The primary goal of Global Data Flow Analysis is to gather information about how
data is used and modified throughout the entire program. By understanding how
values flow through variables and expressions, compilers can make optimizations that
improve performance, reduce memory usage, and eliminate redundant computations.
2. Scope:
● Global Data Flow Analysis considers the entire program rather than focusing on
individual functions or basic blocks. It takes into account the relationships and
dependencies between variables and expressions across different parts of the program.
3. Data Flow Graph:
● The analysis often involves constructing a data flow graph that represents the flow of
values between variables and expressions. Nodes in the graph represent program
points, and edges represent the flow of data from one point to another. This graph
provides a visual representation of how data propagates through the program.
4. Reaching Definitions:
● One common aspect of Global Data Flow Analysis is the identification of reaching
definitions. A reaching definition is a point in the program where a variable is
defined, and its value may reach a certain program point during execution. This
information is crucial for understanding how values are computed and used across
different parts of the code.
5. Uses and Optimizations:
● Global Data Flow Analysis is used by compilers to perform various optimizations,
such as:
● Dead Code Elimination: Identifying and removing code that does not
contribute to the final output, improving program efficiency.
● Common Subexpression Elimination: Identifying repeated computations
and replacing them with a single computation, reducing redundant work.
● Constant Propagation: Propagating constant values through the program to
eliminate unnecessary computations and simplify expressions.
● Register Allocation: Optimizing the allocation of registers for variables
based on their usage throughout the program.
6. Iterative Algorithms:
● Global Data Flow Analysis often employs iterative algorithms to refine the
information gathered about data flow. Algorithms such as the worklist algorithm or
iterative data flow analysis are commonly used to iteratively update the data flow
graph until a stable solution is reached.
7. Challenges:
● Analyzing data flow globally can be computationally expensive, especially for large
programs. Balancing the precision of the analysis with the computational cost is a key
challenge in implementing effective Global Data Flow Analysis.
8. Interprocedural Analysis:
● Global Data Flow Analysis can also be extended to analyze data flow across different
functions in the program, known as interprocedural analysis. This provides a more
comprehensive understanding of how data is exchanged between functions.
L38
Code Improving transformations
Code improving transformations, also known as code optimizations, refer to techniques and strategies
used to enhance the performance, maintainability, and efficiency of software programs. These
transformations aim to produce code that executes faster, uses fewer resources, and is easier to
understand. Here are several common code improving transformations:
Code Optimization is done in the following different ways:
1. Compile Time Evaluation:

(i) A = 2*(22.0/7.0)*r
Perform 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluate x/2.3 as 12.4/2.3 at compile time.

2. Variable Propagation:

//Before Optimization
c=a*b
x=a
till
d=x*b+4

//After Optimization
c=a*b
x=a
till
d=a*b+4

3. Constant Propagation:
● If the value of a variable is a constant, then replace the variable with the constant. The
variable may not always be a constant.
Example:
(i) A = 2*(22.0/7.0)*r
Performs 2*(22.0/7.0)*r at compile time.
(ii) x = 12.4
y = x/2.3
Evaluates x/2.3 as 12.4/2.3 at compile time.
(iii) int k=2;
if(k) go to L3;
It is evaluated as :
go to L3 ( Because k = 2 which implies condition is always true)

4. Constant Folding:
● Consider an expression : a = b op c and the values b and c are constants, then the value of a
can be computed at compile time.
Example:

#define k 5
x=2*k
y=k+5

This can be computed at compile time and the values of x and y are :
x = 10
y = 10

Note: Difference between Constant Propagation and Constant Folding:


● In Constant Propagation, the variable is substituted with its assigned constant where as in
Constant Folding, the variables whose values can be computed at compile time are considered
and computed.
5. Copy Propagation:
● It is extension of constant propagation.
● After a is assigned to x, use a to replace x till a is assigned again to another variable or value
or expression.
● It helps in reducing the compile time as it reduces copying.
Example :
//Before Optimization
c=a*b
x=a
till
d=x*b+4

//After Optimization
c=a*b
x=a
till
d=a*b+4

6. Common Sub Expression Elimination:


● In the above example, a*b and x*b is a common sub expression.
7. Dead Code Elimination:
● Copy propagation often leads to making assignment statements into dead code.
● A variable is said to be dead if it is never used after its last definition.
● In order to find the dead variables, a data flow analysis should be done.
Example:

c=a*b
x=a
till
d=a*b+4

//After elimination :
c=a*b
till
d=a*b+4

8. Unreachable Code Elimination:


● First, Control Flow Graph should be constructed.
● The block which does not have an incoming edge is an Unreachable code block.
● After constant propagation and constant folding, the unreachable branches can be eliminated.

● C++

#include <iostream>
using namespace std;

int main() {
int num;
num=10;
cout << "GFG!";
return 0;
cout << num; //unreachable code
}
//after elimination of unreachable code
int main() {
int num;
num=10;
cout << "GFG!";
return 0;
}

9. Function Inlining:
● Here, a function call is replaced by the body of the function itself.
● This saves a lot of time in copying all the parameters, storing the return address, etc.
10. Function Cloning:
● Here, specialized codes for a function are created for different calling parameters.
● Example: Function Overloading
11. Induction Variable and Strength Reduction:
● An induction variable is used in the loop for the following kind of assignment i = i + constant.
It is a kind of Loop Optimization Technique.
● Strength reduction means replacing the high strength operator with a low strength.
Examples:
Example 1 :
Multiplication with powers of 2 can be replaced by shift left operator which is less
expensive than multiplication
a=a*16
// Can be modified as :
a = a<<4

Example 2 :
i = 1;
while (i<10)
{
y = i * 4;
}

//After Reduction
i=1
t=4
{
while( t<40)
y = t;
t = t + 4;
}

Loop Optimization Techniques:


1. Code Motion or Frequency Reduction:
● The evaluation frequency of expression is reduced.
● The loop invariant statements are brought out of the loop.
Example:

a = 200;
while(a>0)
{
b = x + y;
if (a % b == 0)
printf(“%d”, a);
}

//This code can be further optimized as


a = 200;
b = x + y;
while(a>0)
{
if (a % b == 0}
printf(“%d”, a);
}

2. Loop Jamming:
● Two or more loops are combined in a single loop. It helps in reducing the compile time.
Example:

// Before loop jamming


for(int k=0;k<10;k++)
{
x = k*2;
}

for(int k=0;k<10;k++)
{
y = k+3;
}
//After loop jamming
for(int k=0;k<10;k++)
{
x = k*2;
y = k+3;
}

3. Loop Unrolling:
● It helps in optimizing the execution time of the program by reducing the iterations.
● It increases the program’s speed by eliminating the loop control and test instructions.
Example:

//Before Loop Unrolling

for(int i=0;i<2;i++)
{
printf("Hello");
}

//After Loop Unrolling

printf("Hello");
printf("Hello");

Where to apply Optimization?


Now that we learned the need for optimization and its two types, now let’s see where to apply these
optimizations.
● Source program: Optimizing the source program involves making changes to the algorithm
or changing the loop structures. The user is the actor here.
● Intermediate Code: Optimizing the intermediate code involves changing the address
calculations and transforming the procedure calls involved. Here compiler is the actor.
● Target Code: Optimizing the target code is done by the compiler. Usage of registers, and
select and move instructions are part of the optimization involved in the target code.
● Local Optimization: Transformations are applied to small basic blocks of statements.
Techniques followed are Local Value Numbering and Tree Height Balancing.
● Regional Optimization: Transformations are applied to Extended Basic Blocks. Techniques
followed are Super Local Value Numbering and Loop Unrolling.
● Global Optimization: Transformations are applied to large program segments that include
functions, procedures, and loops. Techniques followed are Live Variable Analysis and Global
Code Replacement.
● Interprocedural Optimization: As the name indicates, the optimizations are applied inter
procedurally. Techniques followed are Inline Substitution and Procedure Placement.
Advantages of Code Optimization:
Improved performance: Code optimization can result in code that executes faster and uses fewer
resources, leading to improved performance.
Reduction in code size: Code optimization can help reduce the size of the generated code, making it
easier to distribute and deploy.
Increased portability: Code optimization can result in code that is more portable across different
platforms, making it easier to target a wider range of hardware and software.
Reduced power consumption: Code optimization can lead to code that consumes less power, making it
more energy-efficient.
Improved maintainability: Code optimization can result in code that is easier to understand and
maintain, reducing the cost of software maintenance.
L39

Data Flow Analysis in Control Flow Graphs:


1. Control Flow Graph (CFG):
● A Control Flow Graph is a representation of a program's control flow. It consists of
nodes representing basic blocks (sequential code with no branches within) and
directed edges representing the flow of control between these blocks.
2. Data Flow Analysis:
● Data flow analysis involves examining how data values propagate through a program.
It helps identify how variables are defined, used, and modified within different parts
of the code.
3. Reaching Definitions:
● Reaching definitions analysis determines the points in a program where the value
assigned to a variable is guaranteed to reach during execution. This information is
essential for understanding the flow of values through the program.
4. Use-Def Chains and Def-Use Chains:
● Use-Def chains represent the uses of a variable that are defined somewhere else in the
program, while Def-Use chains represent the points in the program where a variable
is defined and subsequently used.
5. Live Variables:
● Live variable analysis identifies variables whose values may be used later in the
program. Eliminating unnecessary computations on non-live variables is a common
optimization technique.
L40
Symbolic Debugging of Optimized Code:
1. Optimization and Symbolic Debugging:
● Optimized code generated by compilers can be challenging to debug using traditional
methods because the optimized code may not precisely reflect the source code's
structure. Symbolic debugging attempts to address this issue by providing a way to
map optimized code back to the original source code.
2. Debug Information:
● Debug information, generated by the compiler, includes mappings between the
optimized code and the original source code. This information may include variable
names, line numbers, and other metadata.
3. Symbolic Expressions:
● Symbolic debugging involves working with symbolic expressions that relate the
optimized code to the source code. Developers can use these symbolic expressions to
understand how the optimized code corresponds to their original programming
constructs.
4. Breakpoints and Inspection:
● Symbolic debugging tools allow developers to set breakpoints in the source code and
inspect variables at different points in the program. The debugger translates these
breakpoints and inspections into corresponding locations in the optimized code.
5. Step-by-Step Execution:
● Symbolic debuggers enable step-by-step execution through the source code, even if
the actual execution involves optimized code. This helps developers understand the
flow of execution and identify discrepancies between the optimized and source code.
6. Variable Tracking:
● Symbolic debugging tools can track variables and their values as they change during
execution, providing insights into the data flow and helping developers understand
the impact of optimizations.
7. Source-Level Views:
● Some symbolic debuggers offer source-level views of the optimized code, allowing
developers to navigate and debug the program as if it were still in its original,
unoptimized form.

You might also like