0% found this document useful (0 votes)
4 views92 pages

Compiler Notes 1

This document outlines the course structure and objectives for the M.Sc. in Computer Science, specifically focusing on Compiler Design. It details the various phases of compilation, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with error handling and compiler construction tools. The document also includes references for further reading on compiler principles and techniques.

Uploaded by

jerry26vaish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views92 pages

Compiler Notes 1

This document outlines the course structure and objectives for the M.Sc. in Computer Science, specifically focusing on Compiler Design. It details the various phases of compilation, including lexical analysis, syntax analysis, semantic analysis, intermediate code generation, code optimization, and code generation, along with error handling and compiler construction tools. The document also includes references for further reading on compiler principles and techniques.

Uploaded by

jerry26vaish
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 92

STUDY MATERIAL

REGULATION : 2022
PROGRAMME : M.Sc., Computer Science
COURSE TYPE : CORE COURSE IV
COURSE TITLE : COMPILER DESIGN
COURSE CODE : P22CSCC22
CREDITS : 5
SEMESTER/YEAR : II/I

Compiled By
MS.M.LAKSHMI PRIYA
Assistant Professor
Department of Computer Science


COURSE OBJECTIVES:
∙ Define the design and intrinsic functioning of compilers
∙ Identify the purpose and functions of phases of the compiler
∙ Describe the Contents and data structures for Symbol table with errors
UNIT – I INTRODUCTION TO COMPILERS: Compilers - Analysis - Synthesis model of
compilation - Analysis of the source program - The phases of a compiler -Cousins of the compiler -
Compiler construction tools - Error handling.
UNIT – II LEXICAL ANALYZER: Lexical analysis - Role of lexical analyzer - Tokens, Patterns and
lexemes - Input buffering - Specification of tokens - Regular expressions - Recognition of tokens -
Transition diagrams - Implementing a transition diagram - Finite Automata - Regular expression to NFA
- Conversion of NFA to DFA
UNIT – III SYNTAX ANALYZER: Syntax analysis - Role of parser - Context-free grammars -
Derivations - Writing a grammar - Top Down parsing - Recursive descent parsing - Predictive parsers -
Non-recursive predictive parsers - Construction of predictive parsing tables - Bottom up parsing -
Handles - Shift reduce parser - Operator precedence parsing - LR parsers - Canonical collection of LR
(0) items -Constructing SLR parsing tables.
UNIT – IV INTERMEDIATE CODE GENERATION: Intermediate code generation - Intermediate
languages - Graphical Representation - Three Address Code - Assignment statements - Boolean
expressions - Flow of Control Statements - Case Statements - .Syntax directed translation of case
statements
UNIT – V CODE OPTIMIZATION AND CODE GENERATION: An Organization for an
Optimizing Compiler - the Principle sources of optimization - Function Preserving Transformations -
Common Subexpression - Copy propagation - Optimization of basic blocks - The use of Algebraic
identities - Loops in flow graphs - Code generation - issues in the design of a code generator - The target
machine.
UNIT – VI CURRENT CONTOURS (For continuous internal assessment only): Contemporary
Developments Related to the Course during the Semester Concerned
REFERENCES
1. "Compilers: Principles, Techniques, and Tools", Alfred V. Aho, Monica S. Lam, Ravi Sethi, Jeffrey
D. Ullman, Second Edition, Pearson Addison Wesley, 2007.
2. Compiler Construction Principles and Practice – D.M. Dhamadhere, McMillan India Ltd., Madras,
1983.
3. Alfred V. Aho, Ravi Sethi and Jeffrey D Ullman, "Compilers, Principles, Techniques and Tools",
Addison Wesley Longman (Singapore Pvt. Ltd.), 2011.
4. Alfred V. Aho, Jeffrey D Ullman, "Principles of Compiler Design", Addison Wesley, 1988.
5. David Galles, "Modern Compiler Design", Pearson Education, 2008
UNIT-I
1.​ INTRODUCTION TO COMPILER
●​A Compiler is a software that typically takes a high level language (Like C++ and Java) code as input
and converts the input to a lower level language at once.
●​It lists all the errors if the input code does not follow the rules of its language.
●​A compiler is a translating program that translates the instructions of high level language to machine
level language.
●​A program which is input to the compiler is called a Source program.
●​This program is now converted to a machine level language by a compiler is known as the Object code.
●​The main purpose of compiler is to change the code written in one language without changing the
meaning of the program
●​In the first part, the source program compiled and translated into the object program (low level
language).
●​In the second part, object program translated into the target program through the assembler.

1.1.ANALYSIS - SYNTHESIS MODEL OF COMPILATION


●​Analysis Phase is also known as the front-end of the compiler.
●​The analysis phase generates an intermediate representation of the source program and symbol table,
which should be fed to the Synthesis phase as input.
The analysis consists of three phases:
1. Linear analysis, in which the stream of characters making up the source program is read from left to
right and grouped into tokens that are sequences of character having a collective meaning.
2. Hierarchical analysis, in which characters or tokens are grouped hierarchically into nested collections
with collective meaning.
3. Semantic analysis, in which certain checks are performed to ensure that the components of a program
fit together meaningfully. A compiler operates in phases, each of which transforms the source program
from one representation to another
●​The Synthesis phase is also known as the back-end of the compiler.
●​The synthesis phase generates the target program with the help of intermediate source code
representation and symbol table.
●​The synthesis phase, also known as the code generation or code optimization phase, is the final step of
a compiler.
●​It takes the intermediate code generated by the front end of the compiler and converts it into machine
code or assembly code, which can be executed by a computer.
●​The intermediate code can be in the form of an abstract syntax tree, intermediate representation, or some
other form of representation.
●​After the semantic analysis phase completes, the compiler generates an intermediate representation (IR)
of the source code.
●​Synthesis Phase consist of
1.​ Code Optimization
2.​ Code generation

1.2. THE PHASES OF A COMPILER


●​The compilation process contains the sequence of various phases.
●​We basically have two phases of compilers, namely the Analysis phase and Synthesis phase.
●​ The analysis phase creates an intermediate representation from the given source code.
●​The synthesis phase creates an equivalent target program from the intermediate representation
●​ Each phase takes source program in one representation and produces output in another representation.
Each phase takes input from its previous stage.
●​The process of converting the source code into machine code involves several phases or stages, which
are collectively known as the phases of a compiler
●​The 6 phases of a compiler are:
1.​ Lexical Analysis
2.​ Syntax Analysis
3.​ Semantic Analysis
4.​ Intermediate Code Generation
5.​ Code Optimization
6.​ Code Generation
1.​ Lexical Analysis
●​The first phase of a compiler is lexical analysis, also known as scanning or linear analysis.
●​This phase reads the source code and breaks it into a stream of tokens, which are the basic units of the
programming language.
●​ The lexical analyzer reads the stream of characters making up the source program and groups the
characters into meaningful sequences called lexemes
●​The tokens are then passed on to the next phase for further processing.

x = y + 10
X Identifier id1
= Assignment operator =
Y Identifier id2
+ Addition operator +
10 Number 50

2.​ Syntax Analysis


●​Syntax analysis is the second phase of compilation process.
●​It is also called Parsing or Hierarchical Analysis.
●​ It takes tokens as input and generates a parse tree as output.
●​In syntax analysis phase, the parser checks that the expression made by the tokens is syntactically
correct or not.
●​The tokens from the previous phase are used to create an intermediate tree-like data structure known as
the syntax tree in this phase.
●​Each node has an operator, and the operator's operands are the node's children.
x = y+z; (Correct)
y+z = x; (Produce an error)

(a+b)*c

3. Semantic Analysis
●​Semantic analysis is the third phase of compilation process.
●​It checks whether the parse tree follows the rules of language.
●​An important part of semantic analysis is type checking, where the compiler checks that each operator
has matching operands.
●​For example, many programming language definitions require an array index to be an integer; the
compiler must report an error if a floating-point number is used to index an array.
●​The semantic analyzer keeps track of identifiers, their types and expressions; whether identifiers are
declared before use or not etc.
●​A method called with the wrong arguments, an undeclared variable, a type mismatch, incompatible
operands, etc. will all be checked for by Semantic Analyzer.
float x = 20.2;
float y = x*30;
●​Before multiplication, the semantic analyzer in the code above will typecast the integer 30 to float 30.0.
4. Intermediate Code Generation
●​In the intermediate code generation, compiler generates the source code into the intermediate code.
●​Intermediate code is generated between the high-level language and the machine language.
●​A code that is neither high-level nor machine code, but a middle-level code is an intermediate code.
●​We can translate this code to machine code later.
●​This stage serves as a bridge or way from analysis to synthesis.

Example of Intermediate Code Generation
Total = count + rate * 5
Intermediate code with the help of address code method is
t1 = int_to_float (5)
t2 = rate * t1
t3 = count + t2
Total = t3
5. Code Optimization
●​Code optimization is an optional phase.
●​It is used to improve the intermediate code so that the output of the program could run faster and take
less space.
●​ It removes the unnecessary lines of the code and arranges the sequence of statements in order to speed
up the program execution.
t2=rate*5.0
Total=count+t2
6. Code Generation
●​Code generation is the final stage of the compilation process.
●​ It takes the optimized intermediate code as input and maps it to the target machine language.
●​Code generator translates the intermediate code into the machine code of the specified computer.
Symbol Table
●​It is a data-structure maintained throughout all the phases of a compiler.
●​ All the identifier's names along with their types are stored here.
●​The symbol table makes it easier for the compiler to quickly search the identifier record and retrieve it.
●​The symbol table connects or interacts with all phases of the compiler and error handler for updates
Error Handling Routine
●​ An Error Handling Routine is a set of instructions implemented within a compiler to detect and handle
errors during compilation.
●​ An error is a blank entry in the symbol table.
●​ The Errors may occur in all phases of the compiler.
●​ Whenever a phase of the compiler discovers an error, it must report it to the error handler.
1.3. COUSINS OF COMPILER/LANGUAGE PROCESSING SYSTEM
●​ We have learnt that any computer system is made of hardware and software.
●​ The hardware understands a language, which humans cannot understand.
●​ So we write programs in high-level language, which is easier for us to understand and remember.
●​ These programs are then fed into a series of tools and OS components to get the desired code that can be
used by the machine. This is known as Language Processing System.
●​ The high-level language is converted into binary language in various phases.
●​ A compiler is a program that converts high-level language to assembly language.
●​ Similarly, an assembler is a program that converts the assembly language to machine-level language.
Preprocessor
●​The pre-processor terminates all the #include directives by containing the files named file inclusion and
all the #define directives using macro expansion.
●​Preprocessors produce input to compilers. They may perform the following functions:
1. Macro processing. A preprocessor may allow a user to define macros that are shorthands for longer
constructs.
#define PI 3.14

2. File inclusion. A preprocessor may include header files into the program text. For example, the C
preprocessor causes the contents of the file to replace the statement #include when it processes a file
containing this statement
#include <stdio.h>
Interpreter/Compiler
●​An interpreter, like a compiler, translates high-level language into low-level machine language.
●​ The difference lies in the way they read the source code or input.
●​A compiler reads the whole source code at once, creates tokens, checks semantics, generates
intermediate code, executes the whole program and may involve many passes.
●​ In contrast, an interpreter reads a statement from the input, converts it to an intermediate code, executes
it, then takes the next statement in sequence.
●​If an error occurs, an interpreter stops execution and reports it. Whereas a compiler reads the whole
program even if it encounters several errors.

Assembler
●​An assembler translates assembly language programs into machine code.
●​The output of an assembler is called an object file, which contains a combination of machine
instructions as well as the data required to place these instructions in memory.
There are two types of assemblers-
●​ One-Pass assembler: They go through the source code (output of Compiler) only once and assume that
all symbols will be defined before any instruction that references them.
●​ If we combine or group all the phases of compiler design in a single module known as a single pass
compiler.
●​ Two-Pass assembler: Two-pass assemblers work by creating a symbol table with the symbols and their
values in the first pass, and then using the symbol table in a second pass, they generate code.
●​ A Two pass/multi-pass Compiler is a type of compiler that processes the source code or abstract syntax
tree of a program multiple times. In multi-pass Compiler, we divide phases into two passes as: First part
is referred to as front end and second part is referred to as Back end.
Linker
●​Linker is a computer program that links and merges various object files together in order to make an
executable file.
●​All these files might have been compiled by separate assemblers.
●​The major task of a linker is to search and locate referenced module/routines in a program and to
determine the memory location where these codes will be loaded, making the program instruction to
have absolute references.
Loader
●​Loader is a part of operating system and is responsible for loading executable files into memory and
execute them. It calculates the size of a program (instructions and data) and creates memory space for it.
1.4. COMPILER CONSTRUCTION TOOLS
●​ Compiler Construction Tools are specialized tools that help in the implementation of various phases of
a compiler. These tools help in the creation of an entire compiler or its parts.
●​ These systems have often been referred to as compiler-compilers, compiler-generators, or
translator-writing systems.
●​ Some of the commonly used compiler constructions tools are:-
1.​Scanner Generator
2.​Parser Generator
3.​Syntax Directed Translation Engines
4.​Automatic Code Generators
5.​Data-Flow Analysis Engines
Scanner Generator
●​ Scanner Generator generates lexical analyzers from the input that consists of regular
expression descriptions based on tokens of a language.
●​It generates a finite automaton to recognize the regular expression. Example: Lex
Parser Generator
●​It produces syntax analyzers (parsers) from the input that is based on a grammatical description of
programming language or on a context-free grammar.
●​ It is useful as the syntax analysis phase is highly complex and consumes more manual and compilation
time. Example: PIC, EQM

Syntax Directed Translation Engines


●​ It generates intermediate code with three address format from the input that consists of a parse tree.
●​ These engines have routines to traverse the parse tree and then produces the intermediate code. In this,
each node of the parse tree is associated with one or more translations.
Automatic Code Generators
●​It generates the machine language for a target machine.
●​ Each operation of the intermediate language is translated using a collection of rules and then is taken as
an input by the code generator.
●​A template matching process is used.
●​An intermediate language statement is replaced by its equivalent machine language statement using
template.
Data-Flow Engines
●​It is used in code optimization.
●​Data flow analysis is a key part of the code optimization that gathers the information, that is the values
that flow from one part of a program to another.
1.5. ERROR HANDLING.
●​The tasks of the Error Handling process are to detect each error, report it to the user, and then make
some recovery strategy and implement them to handle the error.
●​Features of an Error handler:
1.​ Error Detection
2.​ Error Reporting
3.​ Error Recovery
Error handler = Error Detection + Error Report + Error Recovery
●​An Error is the blank entries in the symbol table.
●​Errors in the program should be detected and reported by the parser.
●​Whenever an error occurs, the parser can handle it and continue to parse the rest of the input.
●​Although the parser is mostly responsible for checking for errors, errors may occur at various stages of
the compilation process.
There are three kinds of errors:
1.​ Compile-time errors
2.​ Runtime errors
3.​ Logical errors

COMPILE TIME ERRORS


●​ Compile-time errors appear during the compilation process before the program is executed. This
can be due to a syntax error or a missing file reference that stops the application from compiling
properly.
●​ Types of Compile Time Errors
1.​ Lexical Phase errors
2.​ Syntactic phase errors
3.​ Semantic errors

Lexical Phase Errors


●​ Misspellings of identifiers, keywords, or operators fall into this category. These errors occur both at the
lexical phase and during program execution.
●​ When a series of characters does not satisfy the pattern of any token, a lexical error occurs.
●​ Exceeding length of identifier or numeric constants.
●​ The appearance of illegal characters
●​ Unmatched string
Example 1 : printf("Compiler Design");$
This is a lexical error since an illegal character $ appears at the end of statement.
Example 2 : This is a comment */
This is an lexical error since end of comment is present but beginning is not present
Syntactic Phase Errors
●​These problems arise during the syntactic phase and execution.
●​ These issues occur when there is an imbalance in the parenthesis or when some operators are missing.
●​For example, a semicolon that isn't there or a parenthesis that isn't balanced.
●​Errors in structure
●​Missing operator
●​Misspelled keywords
●​Unbalanced parenthesis
swich(ch)
{
.......
.......
}
●​The keyword switch is incorrectly written as a swich.
●​Hence, an “Unidentified keyword/identifier” error occurs.
void printHelloNinja( String s )
{
// function - body
Missing closing braces
x = a + b * c //missing semicolon
a = (b+c * (c+d); //missing closing parentheses
i = j * + c ; // missing argument between “*” and “+”
Semantic Errors
●​ These errors are detected during the compilation time of a program when they occur during the semantic
analysis step.
●​ These problems occur when operators, variables, or undeclared variables are used incorrectly.
●​ Operands of incompatible types
●​ Variable not declared
●​ The failure to match the actual argument with the formal argument
Use of a non-initialized variable
#include <stdio.h>
int main()
{
int a = 0, b = 7;
sum = a + b; // sum is undefined
return 0;
}
Incompatible Types
#include <stdio.h>
int main()
{
int a = 0;
int b = "ABC"; // b is having non-integer value
return 0;
}
RUNTIME ERRORS
●​A run-time error occurs during the execution of a program and is most commonly caused by incorrect
system parameters or improper input data.
●​This can include a lack of memory to run an application, a memory conflict with another software, or a
logical error, which is an example of run-time error.
LOGICAL ERRORS
●​ When programs execute poorly and yet don't terminate abnormally, logic errors occur.
●​ A logic error can result in unexpected or unwanted outputs or other behavior, even if it is not
immediately identified as such.
●​ These are errors that occur when the specified code is unreachable or when an infinite loop is present.
FINDING ERROR OR REPORTING AN ERROR – Viable-prefix is the property of a parser which allows
early detection of syntax errors.
●​Goal: detection of an error as soon as possible without further consuming unnecessary input
●​How: detect an error as soon as the prefix of the input does not match a prefix of any string in the​
language.
●​Example: for(;), this will report an error as for have two semicolons inside braces.
ERROR RECOVERY
●​The basic requirement for the compiler is to simply stop and issue a message, and cease compilation.
1. Panic Mode Recovery
●​This method involves removing successive characters from the input one by one until a set of
synchronized tokens are found.
●​Delimiters, such as a semicolon, opening, or closing parenthesis, for example, are synchronizing tokens.
For example,
int a, $b, sum, 5z;
●​Since variable declaration starts with invalid symbols ($) and numbers (5) hence Panic mode recovery
will discard such variables until it synchronized tokens such as commas or semicolons are not found.
2. Phase level recovery /Statement Mode Recovery
●​ When an error is discovered, the parser performs local correction on the remaining input.
●​ If a parser encounters an error, it makes the necessary corrections on the remaining input so that the
parser can continue to parse the rest of the statement.
●​ we can correct the error by deleting extra semicolons, replacing commas with semicolons, or
reintroducing missing semicolons
3. Global Correction
●​ The parser examines the whole program and tries to find out the closest match for it which is error-free.
●​ The closest match program has less number of insertions, deletions, and changes of tokens to recover
from erroneous input

UNIT-II
1.LEXICAL ANALYSIS
●​ Lexical Analysis is the first phase of a compiler that takes the input as a source code written in a
high-level language.
●​ The purpose of lexical analysis is that it aims to read the input code and break it down into meaningful
elements called tokens.
●​ Those tokens are turned into building blocks for other phases of compilation.
●​ These tokens can be individual words or symbols in a sentence, such as keywords, variable names,
numbers, and punctuation.
●​ It is also known as a scanner.
2. ROLE OF LEXICAL ANALYZER
●​ The lexical analyzer is the first phase of a compiler. Its main task is to read the input characters and
produce as output a sequence of tokens that the parser uses for syntax analysis
●​ The lexical analyzer is responsible for removing the white spaces and comments from the source
program.
●​ It corresponds to the error messages with the source program.
●​ It helps to identify the tokens.
●​ The input characters are read by the lexical analyzer from the source code.
3. TOKENS, PATTERNS AND LEXEMES
Token
●​A token is the smallest unit of meaningful data; it may be an identifier, keyword, operator, or symbol. A
token represents a series or sequence of characters that cannot be decomposed further.
1.​ Keywords : Those reserved words in C like ` int `, ` char `, ` float `, ` const `, ` goto `, etc.
2.​ Identifiers: Names of variables and user-defined functions.
3.​ Operators : ` + `, ` – `, ` * `, ` / `, etc.
4.​ Delimiters /Punctuators: Symbols used such as commas ” , ” semicolons ” ; ” braces ` {} `.
By and large, tokens may be divided into three categories:
1.​ Terminal Symbols (TRM) : Keywords and operators.
2.​ Literals (LIT) : Values like numbers and strings.
3.​ Identifiers (IDN) : Names defined by the user.
Example 1:
int a = 10; //Input Source code ​
Tokens​
int (keyword), a(identifier), =(operator), 10(constant) and ;(punctuation-semicolon)
Answer – Total number of tokens = 5
Example 2:
int main() {​

// printf() sends the string inside quotation to​
// the standard output (the display)​
printf("Compiler Design");​
return 0;​
}​
Tokens​
'int', 'main', '(', ')', '{', 'printf', '(', ' ‘"Compiler Design" ', ​
')', ';', 'return', '0', ';', '}'
Answer – Total number of tokens = 14
Lexeme
●​A lexeme is a sequence of characters in the source program that is matched by the pattern for a token.
●​const pi = 3.1416;
●​The substring pi is a lexeme for the token “identifier.”
Pattern
●​A pattern is a rule or syntax that designates how tokens are identified in a programming language.
●​For a keyword to be identified as a valid token, the pattern is the sequence of characters that make the
keyword.
●​For identifier to be identified as a valid token, the pattern is the predefined rules that it must start with
alphabet, followed by alphabet or a digit.

Token Lexeme Pattern

Keyword While w-h-i-l-e

Relop < <, >, >=, <=, !=, ==

(0 - 9)*-> Sequence of digits


Integer 7
with at least one digit

String "Hi" Characters enclosed by " "

Punctuation , ; , . ! etc.

A - Z, a - z A sequence of
Identifier Number
characters and number

4. INPUT BUFFERING
●​Input buffering is an important concept in compiler design that refers to the way in which the compiler
reads input from the source code
●​In many cases, the compiler reads input one character at a time, which can be a slow and inefficient
process.
●​Input buffering is a technique that allows the compiler to read input in larger chunks, which can improve
performance and reduce overhead.
●​The basic idea behind input buffering is to read a block of input from the source code into a buffer, and
then process that buffer before reading the next block.
●​The size of the buffer can vary depending on the specific needs of the compiler and the characteristics of
the source code being compiled.
●​ For example, a compiler for a high-level programming language may use a larger buffer than a
compiler for a low-level language.
●​Lexical Analysis scans input string from left to right one character at a time to identify tokens. It uses
two pointers to scan tokens
●​Begin Pointer (bptr)/Lexeme Begin − It points to the beginning of the string to be read.
●​ Look Ahead Pointer (lptr) /Forward− It moves ahead to search for the end of the token
For statement int a, b;
●​ Both pointers start at the beginning of the string, which is stored in the buffer.

●​Look Ahead Pointer scans buffer until the token is found.

●​The character ("blank space") beyond the token ("int") have to be examined before the token ("int")

will be determined.

●​ After processing token ("int") both pointers will set to the next token ('a'), & this process will be

repeated for the whole program.


Sentinels
●​ Sentinels play a vital role in managing buffer pairs effectively. A sentinel is a special character,
often used at the end of the input buffer, to mark its boundaries.
●​ The sentinel is a special character that should not be a part of the source code.
●​ An eof character is used as a Sentinel
●​ Sentinels are essential for two primary reasons:
1. Boundary Detection: Sentinels provide a clear demarcation between the end of the input buffer and
the start of the output buffer. This ensures that the compiler knows where to stop reading characters and
where to begin forming lexemes.
2. Safety against Buffer Overflow: Sentinels act as guards against buffer overflow. By signalling the end
of the input buffer, they prevent the input buffer from overflowing into the output buffer, which could
lead to data corruption and unpredictable behaviour.
5. SPECIFICATION OF TOKENS
●​ Regular expressions are an important notation for specifying patterns. Each pattern matches a set
of strings.
●​ Specification of tokens depends on the pattern of the lexeme
●​ Regular expressions to specify the different types of patterns that can actually form tokens.
●​ The regular expressions are inefficient in specifying all the patterns forming tokens.
There are 3 specifications of tokens:
1.​ String
2.​ Language
3.​ Regular Expression
1. String
●​ An alphabet or character class is a finite set of symbols.
●​ A string over an alphabet is a finite sequence of symbols drawn from that alphabet.
●​ In language theory, the term "word" is often used as synonyms for "string".
●​ The length of a string s, usually written |s|, is the number of occurrences of symbols in s. For example,
"banana" is a string of length six.
●​ The empty string, denoted ε, is the string of length zero.
Operations on String:
Prefix of String
●​ The prefix of String S is any string that is extracted by removing zero or more characters from the
end of string S. For example, if the String is "NINJA", the prefix can be "NIN" which is obtained by
removing "JA" from that String.
●​For Example: s = abcd
●​ The prefix of the string abcd: ∈, a, ab, abc, abcd
Suffix of String
●​ The suffix of string S is any string that is extracted by removing any number of characters from
the beginning of string S. For example, if the String is "NINJA", the suffix can be "JA," which is
obtained by removing "NIN" from that String.
●​ For Example: s = abcd
●​ Suffix of the string abcd: ∈, d, cd, bcd, abcd
Proper Prefix of String
●​ The proper prefix of the string includes all the prefixes of the string excluding ∈ and the string(s)
itself.
●​ Proper Prefix of the string abcd: a, ab, abc.
Proper Suffix of String
●​ The proper suffix of the string includes all the suffixes excluding ∈ and the string(s) itself.​
Proper Suffix of the string abcd: d, cd, bcd
Substring
●​A substring of a string S is any string obtained by removing any prefixes and suffixes of that String. For
example, if the String is "AYUSHI," then the substring can be "US," which is formed by removing the
prefix "AY" and suffix "HI." Every String is a substring of itself.
Subsequence​
●​ Any string formed by deleting zero or more not necessarily contiguous symbols from s\ e.g., baaa
is a subsequence of banana.
Concatenation
●​Concatenation is defined as the addition of two strings. For example, if we have two strings S=" Cod"
and T=" ing," then the concatenation ST would be "Coding."
2. Language
●​ A language can be defined as a finite set of strings over some symbols or alphabets.
Operations on Language
Union
●​ Union is one of the most common operations we perform on a set. In terms of languages also, it
will hold a similar meaning.
●​ Suppose there are two languages, L and S. Then the union of these two languages will be
L ∪ S will be equal to { x | x belongs to either L or S }
For example If L = {a, b} and S = {c, d}Then L ∪ S = {a, b, c, d}
Concatenation
●​ Concatenation links the string from one language to the string of another language in a series in all
possible ways. The concatenation of two different languages is denoted by:​
L ⋅ M = {st | s is in L and t is in M} If L = {a, b} and M = {c, d}
Then L ⋅ M = {ac, ad, bc, bd}
Kleene Closure
●​ Kleene closure of a language L provides you with a set of strings. This set of strings is obtained
by concatenating L zero or more time. The Kleene closure of the language L is denoted by:​
If L = {a, b} then L* = {∈, a, b, aa, bb, aaa, bbb, …}
Positive Closure
●​ The positive closure on a language L provides a set of strings. This set of strings is obtained by
concatenating ‘L’ one or more times. It is denoted by:
●​ It is similar to the Kleene closure. Except for the term L0, i.e. L+ excludes ∈ until it is in L itself.​
If L = {a, b} then L+ = {a, b, aa, bb, aaa, bbb, …}
3. Regular Expression
●​A regular expression is a sequence of symbols used to specify lexeme patterns. A regular expression is
helpful in describing the languages that can be built using operators such as union, concatenation, and
closure over the symbols.
●​ Each regular expression r denotes a language L(r).
●​A regular expression ‘r’ that denotes a language L(r) is built recursively over the smaller regular
expression using the rules given below.
●​The following rules define the regular expression over some alphabet Σ and the languages denoted by
these regular expressions.
1.​ ∈ is a regular expression that denotes a language L(∈). The language L(∈) has a set of strings {∈} which
means that this language has a single empty string.
2.​ If there is a symbol ‘a’ in Σ then ‘a’ is a regular expression that denotes a language L(a). The language
L(a) = {a} i.e. the language has only one string of length one and the string holds ‘a’ in the first position.
3.​ Consider the two regular expressions r and s then:
●​ r|s denotes the language L(r) ∪ L(s).
●​ (r) (s) denotes the language L(r) ⋅ L(s).
●​ (r)* denotes the language (L(r))*.
●​ (r)+ denotes the language L(r).
●​ The unary operator * has the highest precedence and is left associative,
●​ Concatenation has the second highest precedence and is left associative,
●​ | has the lowest precedence and is left associative.
​ 5. RECOGNITION OF TOKENS
●​ We address the question of how to recognize them.

where the terminals if, then, else, relop, id, and num generate sets of strings given by the following
regular definitions:

●​ For this language fragment the lexical analyzer will recognize the keywords if, then, else, as well
as the lexemes denoted by relop, id, and num. To simplify matters, we assume keywords are reserved;
that is, they cannot be used as identifiers.
●​ We assume lexemes are separated by white space, consisting of nonnull sequences of blanks, tabs,
and newlines.
●​ Our lexical analyzer will strip out white space. It will do so by comparing a string against the
regular definition ws, below.
delim blank | tab | newline
ws delim+
●​ If a match for ws is found, the lexical analyzer does not return a token to the parser. Rather, it proceeds
to find a token following the white space and returns that to the parser.
●​ Tokens obtained during lexical analysis are recognized by Finite Automata.
●​ Finite Automata (FA) is a simple idealized machine that can be used to recognize patterns within input
taken from a character set or alphabet (denoted as C). The primary task of an FA is to accept or reject an
input based on whether the defined pattern occurs within the input.
●​ There are two notations for representing Finite Automata. They are:
1.​ Transition Table
2.​ Transition Diagram
Transition Table
●​ It is a tabular representation that lists all possible transitions for each state and input symbol
combination.
●​ Our goal is to construct a lexical analyzer that will isolate the lexeme for the next token in the input
buffer and produce as output a pair consisting of the appropriate token and attribute-value, using the
translation table.

Transition Diagram
●​ It is a directed labeled graph consisting of nodes and edges. Nodes represent states, while edges
represent state transitions.
●​ Edges are directed from one state of the transition diagram to another.
●​ One state is labelled the Start State. It is the initial state of transition diagram where control
resides when we begin to recognize a token.
●​ Position is transition diagrams are drawn as circles and are called states.
●​ The states are connected by Arrows called edges. Labels on edges are indicating the input
characters
●​ Zero or more final states or Accepting states are represented by double circle in which the tokens
has been found.
●​ A transition diagram that recognizes the lexemes matching the token relop. We begin in state 0,
the start state. If we see < as the first input symbol, then among the lexemes that match the pattern
for relop we can only be looking at <, <>, or <=.
●​ We therefore go to state 1, and look at the next character. If it is =, then we recognize lexeme <=,
enter state 2, and return the token relop with attribute LE, the symbolic constant representing this
particular comparison operator.
●​ If in state 1 the next character is >, then instead we have lexeme <>, and enter state 3 to return an
indication that the not-equals operator has been found.
●​ On any other character, the lexeme is <, and we enter state 4 to return that information. Note,
however, that state 4 has a * to indicate that we must retract the input one position.

●​ On the other hand, if in state 0 the first character we see is =, then this one character must be the
lexeme. We immediately return that fact from state 5.
●​ The remaining possibility is that the first character is >. Then, we must enter state 6 and decide, on
the basis of the next character, whether the lexeme is >= (if we next see the = sign), or just > (on any
other character).
6. FINITE AUTOMATA
●​ Finite automata are used to recognize patterns.
●​ It takes the string of symbol as input and changes its state accordingly. When the desired symbol
is found, then the transition occurs.
●​ At the time of transition, the automata can either move to the next state or stay in the same state.
●​ Finite automata have two states, Accept state or Reject state. When the input string is processed
●​ Finite means a finite number of states and Automata means the Automatic Machine which works
without any interference of human beings.
●​ Finite automata consist of a set of finite states and a set of transitions from state to state that
appears on input symbols chosen from an alphabet ∑.
Formal Definition of FA
A finite automaton is a collection of 5-tuple (Q, ∑, δ, q0, F), where:
1.​ Q: finite set of states
2.​ ∑: finite set of the input symbol
3.​ q0: initial state
4.​ F: final state
5.​ δ: Transition function
●​ States: States of FA are represented by circles. State names are written inside circles.
●​ Start state: The state from where the automata start is known as the start state. Start state has an
arrow pointed towards it.
●​ Intermediate states: All intermediate states have at least two arrows; one pointing to and another
pointing out from them.
●​ Final state: If the input string is successfully parsed, the automata are expected to be in this state.
Final state is represented by double circles.
●​ Transition: The transition from one state to another state happens when a desired symbol in the
input is found. Upon transition, automata can either move to the next state or stay in the same state.
Movement from one state to another is shown as a directed arrow, where the arrows points to the
destination state. If an automaton stays on the same state, an arrow pointing from a state to itself is
drawn.
Types of Automata
There are two types of finite automata:
1.​ DFA(deterministic finite automata)
2.​ NFA(non-deterministic finite automata)
DFA (DETERMINISTIC FINITE AUTOMATA)
●​ DFA refers to deterministic finite automata. Deterministic refers to the uniqueness of the
computation. The finite automata are called deterministic finite automata if the machine is read an input
string one symbol at a time.
●​ In DFA, there is only one path for specific input from the current state to the next state.
●​ DFA does not accept the null move, i.e., the DFA cannot change state without any input character.
●​ DFA can contain multiple final states.

●​ We can see that from state q0 for input a, there is only one path which is going to q1. Similarly,
from q0, there is only one path for input b going to q2.
Formal Definition of DFA
●​ A DFA is a collection of 5-tuples same as we described in the definition of FA.
1.​ Q: finite set of states
2.​ ∑: finite set of the input symbol
3.​ q0: initial state
4.​ F: final state
5.​ δ: Transition function
Transition function can be defined as:
δ: Q x ∑→Q

Graphical Representation of DFA


An NFA is composed of:
1.​ States: Represented by circles.
2.​ Transitions: Arrows between states indicating how the automaton moves from one state to another
based on input symbols.
3.​ Start State: The initial state often depicted with an incoming arrow pointing to it.
4.​ Accept States: States that indicate acceptance of the input string, usually represented by double circles.
5.​ Deterministic Transitions: Single transitions for a single input symbol from a given state.
Example 1:
Q = {q0, q1, q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
Solution:
Transition Diagram:
Transition Table:

Present State Next state for Input 0 Next State of Input 1

→q0 q0 q1

q1 q2 q1

*q2 q2 q2

Example 2:
DFA with ∑ = {0, 1} accepts all starting with 0.
Solution:

●​ We can see that on given 0 as input to DFA in state q0 the DFA changes state to q1 and always go
to final state q1 on starting input 0.
●​ It can accept 00, 01, 000, 001....etc. It can't accept any string which starts with 1, because it will
never go to final state on a string starting with 1.
Example 3:
DFA with ∑ = {0, 1} accepts all ending with 0.
Solution:

Explanation:
●​ We can see that on given 0 as input to DFA in state q0, the DFA changes state to q1. It can accept
any string which ends with 0 like 00, 10, 110, 100....etc.It can't accept any string which ends with 1,
because it will never go to the final state q1 on 1 input, so the string ending with 1, will not be accepted
or will be rejected.
Dead state in DFA
●​ If the machine has successfully progressed to the final string accepting state, we can say that the
string has been accepted by the DFA.
●​ However, if we arrive at a point where the machine can no longer progress to its final state, we
have arrived at a dead state. A Dummy state is another name for a dead state.
●​ The machine will begin, and if we want to read strings that begin with 0, the machine will reach its final
state B, which will allow it to accept a string.
●​ However, if we start the machine with 1, it will not be able to advance to the final state. It will reach
another intermediate state which is C. We are now in a dead state after reading 1. And the string will not
be accepted by DFA.
The transition table for the above DFA is as followed:
Present State Next State from input 0 Next State from input 1
→A B c
C No Transition No Transition
*B B B
NFA (NON-DETERMINISTIC FINITE AUTOMATA)
●​ NFA stands for non-deterministic finite automata.
●​ The finite automata are called NFA when there exist many paths for specific input from the current state
to the next state.
●​ ε-transition: state transition can be made without reading a symbol; and
●​ Non determinism: state transition can have zero or more than one possible value.

●​ We can see that from state q0 for input a, there are two next states q1 and q2, similarly, from q0 for input
b, the next states are q0 and q1.
●​ Thus it is not fixed or determined that with a particular input where to go next. Hence this FA is called
non-deterministic finite automata.
Formal definition of NFA:
NFA also has five states same as DFA, but with different transition function, as shown follows:
δ: Q x ∑ →2Q
where,
Q: finite set of states
∑: finite set of the input symbol
q0: initial state
F: final state
δ: Transition function
Graphical Representation of an NFA
An NFA can be represented by digraphs called state diagram. In which:
1.​ The state is represented by vertices.
2.​ The arc labeled with an input character show the transitions.
3.​ The initial state is marked with an arrow.
4.​ The final state is denoted by the double circle.
Example 1:
Q = {q0, q1, q2}
∑ = {0, 1}
q0 = {q0}
F = {q2}
Solution:
Transition diagram:

Transition Table:

Present State Next state for Input 0 Next State of Input 1

→q0 q0, q1 q1

q1 q2 q0

*q2 q2 q1, q2

●​ In the above diagram, we can see that when the current state is q0, on input 0, the next state will
be q0 or q1, and on 1 input the next state will be q1.
●​ When the current state is q1, on input 0 the next state will be q2 and on 1 input, the next state will
be q0. When the current state is q2, on 0 input the next state is q2, and on 1 input the next state will be
q1 or q2.
Example 2:
NFA with ∑ = {0, 1} accepts all strings with 01.
Solution:

Transition Table:

Present State Next state for Input 0 Next State of Input 1

→q0 q1 ε

q1 ε q2

*q2 q2 q2

Parameter DFA NFA


Every state has exactly one transition States can have zero, one, or multiple
Determinism
per input symbol. transitions for a given input symbol.
Transition δ: Q × Σ → P(Q) (where P(Q) is the
δ: Q × Σ → Q
Function power set of Q)
Epsilon (ε)
Not allowed Allowed
Transitions
Generally requires more states to Can often represent the same language
State Complexity
represent certain languages. with fewer states.
Easier to implement due to More complex to implement due to
Implementation
determinism. non-determinism.
Can be simulated by an NFA without Requires conversion to DFA for
Simulation
modification. simulation on deterministic machines.
Accepts input if there is a unique Accepts input if there exists at least one
Acceptance sequence of transitions leading to an sequence of transitions leading to an
accept state. accept state.
Can use backtracking to explore multiple
Backtracking Does not require backtracking.
transitions.
Used in more theoretical contexts and
Widely used in lexical analysis, text
Practical Use where non-determinism provides a clearer
parsing, etc.
solution.
7. REGULAR EXPRESSION TO NFA
●​ Regular expressions and finite automata are equally powerful.
●​ For every RE, there is an FA, For every FA
●​ A Regular Expression is a representation of Tokens. But, to recognize a token, it can need a token
Recognizer, which is nothing but a Finite Automata (NFA). So, it can convert Regular Expression into
NFA.

Regular Expression
A Regular Expression can be recursively defined as follows
●​ε is a Regular Expression indicates the language containing an empty string. (L (ε) = {ε})
●​φ is a Regular Expression denoting an empty language. (L (φ) = { })
●​x is a Regular Expression where L = {x}
●​If X is a Regular Expression denoting the language L(X) and Y is a Regular Expression denoting the
language L(Y), then
●​X + Y is a Regular Expression corresponding to the language L(X) ∪ L(Y) where L(X+Y) = L(X) ∪
L(Y).
●​ X.Y is a Regular Expression corresponding to the language L(X) . L(Y) where L(X.Y) = L(X) . L(Y)
●​R* is a Regular Expression corresponding to the language L(R*)where L(R*) = (L(R))*
Construction of an FA from an RE
●​We can use Thompson's Construction to find out a Finite Automaton from a Regular Expression.
●​To convert a regular expression to a nondeterministic finite automaton, we can follow an algorithm
given first by McNaughton and Yamada, and then by Ken Thompson.
●​ First, we define automata corresponding to the base cases of REs:

●​ Now, suppose that we have already constructed NFAs for the regular expressions A and B,
indicated below by rectangles.
●​ Both A and B have a single start state (on the left) and accepting state (on the right). If we write
the concatenation of A and B as AB, then the corresponding NFA is simply A and B connected by an Ɛ
transition.
●​ The start state of A becomes the start state of the combination, and the accepting state of B
becomes the accepting state of the combination
●​ In a similar fashion, the alternation of A and B written as A|B can be expressed as two automata
joined by common starting and accepting nodes, all connected by Ɛ transitions

●​ Finally, the Kleene closure A* is constructed by taking the automaton for A, adding starting and
accepting nodes, then adding Ɛ transitions to allow zero or more repetitions

●​ Let’s consider the process for an example regular expression a(cat|cow)*.


●​ First, we start with the innermost expression cat and assemble it into three transitions resulting in
an accepting state. Then, do the same thing for cow, yielding these two FAs:

●​ The alternation of the two expressions cat|cow is accomplished by adding a new starting and
accepting node, with epsilon transitions.
●​ Then, the Kleene closure (cat|cow)* is accomplished by adding another starting and accepting
state around the previous FA, with epsilon transitions between

●​ Finally, the concatenation of a(cat|cow)* is achieved by adding a single state at the beginning for a

(a |b)*abb

8. CONVERSION OF NFA TO DFA


●​ We can convert any NFA into an equivalent DFA using the technique of subset construction. The basic
idea is to create a DFA such that each state in the DFA corresponds to multiple states in the NFA,
according to the “many-worlds” interpretation.
●​ An NFA can have zero, one or more than one move from a given state on a given input symbol. An NFA
can also have NULL moves (moves without input symbol). On the other hand, DFA has one and only
one move from a given state on a given input symbol.
Steps for converting NFA to DFA
Step 1: Convert the given NFA to its equivalent transition table
●​To convert the NFA to its equivalent transition table, we need to list all the states, input symbols, and the
transition rules. The transition rules are represented in the form of a matrix, where the rows represent the
current state, the columns represent the input symbol, and the cells represent the next state.
Step 2: Create the DFA’s start state
●​ The DFA’s start state is the set of all possible starting states in the NFA. This set is called the
“epsilon closure” of the NFA’s start state. The epsilon closure is the set of all states that can be reached
from the start state by following epsilon (?) transitions.
Step 3: Create the DFA’s transition table
●​ The DFA’s transition table is similar to the NFA’s transition table, but instead of individual states,
the rows and columns represent sets of states. For each input symbol, the corresponding cell in the
transition table contains the epsilon closure of the set of states obtained by following the transition rules
in the NFA’s transition table.
Step 4: Create the DFA’s final states
●​ The DFA’s final states are the sets of states that contain at least one final state from the NFA.
Step 5: Simplify the DFA
The DFA obtained in the previous steps may contain unnecessary states and transitions.
To simplify the DFA, we can use the following techniques:
●​ Remove unreachable states: States that cannot be reached from the start state can be removed
from the DFA.
●​ Remove dead states: States that cannot lead to a final state can be removed from the DFA.
●​ Merge equivalent states: States that have the same transition rules for all input symbols can be
merged into a single state.
Step 6: Repeat steps 3-5 until no further simplification is possible
●​ After simplifying the DFA, we repeat steps 3-5 until no further simplification is possible. The
final DFA obtained is the minimized DFA equivalent to the given NFA.

●​Following are the various parameters for NFA. State = { q0, q1, q2 } Input = ( a, b ) F = { q2 }.
●​For each state in Q’, find the states for each input symbol. Currently, state in Q’ is q0, find moves from
q0 on input symbol a and b using transition function of NFA and update the transition table of DFA.

●​ Now { q0, q1 } will be considered as a single state.

●​Now { q0, q2 } will be considered as a single state.

●​As there is no new state generated, we are done with the conversion. Final state of DFA will be state
which has q2 as its component i.e., { q0, q2 } Following are the various parameters for DFA. State= {
q0, { q0, q1 }, { q0, q2 } } Input = ( a, b ) F = { { q0, q2 } }
UNIT-III
SYNTAX ANALYSIS
●​ Syntax analysis, often known as parsing, is an important step in the compilation process.
●​ Syntax analysis ensures that these tokens are arranged according to with the programming
language’s grammar.
●​ This process helps in detecting and reporting errors, ensuring the source code adheres to the rules
before further processing in the compiler.
3.1. ROLE OF THE PARSER
●​ Syntax analysis (parsing) is the second phase of the compilation process, following lexical
analysis. Its primary goal is to verify the syntactical correctness of the source code.
●​ It takes the tokens generated by the lexical analyzer and attempts to build a Parse
Tree or Abstract Syntax Tree (AST), representing the program’s structure.
●​ During this phase, the syntax analyzer checks whether the input string adheres to the grammatical
rules of the language using context-free grammar.
●​ If the syntax is correct, the analyzer moves forward; otherwise, it reports an error.
●​ The main goal of syntax analysis is to create a parse tree or abstract syntax tree (AST) of the
source code, which is a hierarchical representation of the source code that reflects the grammatical
structure of the program.
●​ In the syntactic checking step, a compiler checks whether or not tokens grouped by the lexical
classifier are classified according to the syntactic rules of the language. This is done by the classifier.
●​ The parser receives a string of tokens from the lexical parser and checks that the string must be a
native language. It detects and reports any syntax errors and generates a parse tree from which
intermediate code can be generated.
1.1. Types of Parsing
The parsing is divided into two types, which are as follows:
1.​ Top-down Parsing
2.​ Bottom-up Parsing

Top-Down Parsing
●​ Top-down parsing attempts to build the parse tree from the root node to the leaf node. The
top-down parser will start from the start symbol and proceed to the string. It follows the leftmost
derivation.
Leftmost derivation:
●​It is a process of exploring the production rules from left to right and selecting the leftmost non-terminal
in the current string as the next symbol to expand.
●​This approach ensures that the parser always chooses the leftmost derivation and tries to match the input
string. If a match cannot be found, the parser backtracks and tries another production rule.
●​This process continues until the parser reaches the end of the input string or fails to find a valid parse
tree.
●​ Consider the lexical analyzer’s input string ‘acb’ for the following grammar by using left most
deviation.
S->aAb
A->cd|c
1.​ Recursive-descent parsers:Recursive-descent parsers are a type of top-down parser that uses a set of
recursive procedures to parse the input. Each non-terminal symbol in the grammar corresponds to a
procedure that parses input for that symbol.
2.​ Backtracking parsers: Backtracking parsers are a type of top-down parser that can handle
non-deterministic grammar. When a parsing decision leads to a dead end, the parser can backtrack and
try another alternative. Backtracking parsers are not as efficient as other top-down parsers because they
can potentially explore many parsing paths.
3.​ Non-backtracking parsers: Non-backtracking is a technique used in top-down parsing to ensure that
the parser doesn’t revisit already-explored paths in the parse tree during the parsing process. This is
achieved by using a predictive parsing table that is constructed in advance and selecting the appropriate
production rule based on the top non-terminal symbol on the parser’s stack and the next input symbol.
By not backtracking, predictive parsers are more efficient than other types of top-down parsers, although
they may not be able to handle all grammar.
4.​ Predictive parsers: Predictive parsers are top-down parsers that use a parsing to predict which
production rule to apply based on the next input symbol. Predictive parsers are also called LL parsers
because they construct a left-to-right, leftmost derivation of the input string.
Bottom-Up Parsing
●​ Bottom-up parsing builds the parse tree from the leaf node to the root node. The bottom-up
parsing will reduce the input string to the start symbol. It traces the rightmost derivation of the string in
reverse.
LR Parser
●​ The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of
context-free grammar which makes it the most efficient syntax analysis technique.
●​ LR parsers are also known as LR(k) parsers, where L stands for left-to-right scanning of the input
stream; R stands for the construction of right-most derivation in reverse, and k denotes the number of
lookahead symbols to make decisions.
Operator precedence parser
●​ An operator precedence parser is a bottom-up parser that interprets an operator grammar. This
parser is only used for operator grammars. Ambiguous grammars are not allowed in any parser except
operator precedence parser. There are two methods for determining what precedence relations should
hold between a pair of terminals:
1.​ Use the conventional associativity and precedence of operator.
2.​ The second method of selecting operator-precedence relations is first to construct an unambiguous
grammar for the language, a grammar that reflects the correct associativity and precedence in its parse
trees.
1.2. Syntax Error Handling
●​ If a compiler had to process only correct programs, its design and implementa¬ tion would be
greatly simplified. But programmers frequently write incorrect programs, and a good compiler should
assist the programmer in identifying and locating errors.
●​ We know that programs can contain errors at many different levels.
●​ For example, errors can be
1.​ Lexical, such as misspelling an identifier, keyword, or operator
2.​ Syntactic, such as an arithmetic expression with unbalanced parentheses
3.​ Semantic, such as an operator applied to an incompatible operand
4.​ Logical, such as an infinitely recursive call
●​ The error handler in a parser has simple-to-state goals: • It should report the presence of errors
clearly and accurately. • It should recover from each error quickly enough to be able to detect sub¬
sequent errors. • It should not significantly slow down the processing of correct programs.
●​ Error-Recovery Strategies
●​ There are many different general strategies that a parser can employ to recover from a syntactic
error.
●​ Panic Mode
●​ Phrase Level
●​ Error Productions
●​ Global Correction.
3.2. CONTEXT-FREE GRAMMARS
●​ Context-free grammars (CFGs) are used to describe context-free languages. A context-free
grammar is a set of recursive rules used to generate patterns of strings. A context-free grammar can
describe all regular languages and more, but they cannot describe all possible languages.

Context-free grammar G can be defined by four tuples as:


G = (V, T, P, S)
●​ G is the grammar, which consists of a set of the production rule. It is used to generate the string of
a language.
●​ V is the final set of a non-terminal symbol. It is denoted by capital letters. A set of non terminal
symbols (or variables) which are placeholders for patterns of terminal symbols that can be generated by
the non terminal symbols. These are the symbols that will always appear on the left-hand side of the
production rules, though they can be included on the right-hand side.
●​ T is the final set of a terminal symbol. It is denoted by lower case letters. A set of terminal
symbols which are the characters that appear in the language/strings generated by the grammar.
Terminal symbols never appear on the left-hand side of the production rule and are always on the
right-hand side.
●​ P is a set of production rules, which is used for replacing non-terminals symbols (on the left side
of the production) in a string with other terminal or non-terminal symbols (on the right side of the
production).
●​ S is the start symbol which is used to derive the string. We can derive the string by repeatedly
replacing a non-terminal by the right-hand side of the production until all non-terminal have been
replaced by terminal symbols.
●​ For example, the grammar A = { S, a, b } having productions:
●​ Here S is the starting symbol.
●​ {a, b} are the terminals generally represented by small characters.
●​ S is the variable.
S-> aS​
S-> bSa
a->bSa, or​
a->ba is not a CFG as on the left-hand side there is a variable which does not follow the CFGs rule.
●​ Lets consider the string “aba” and and try to derive the given grammar from the productions given. we
start with symbol S, apply production rule S->bSa and then S->aS (S->a) to get the string “aba”.

Parse tree of string “aba”


●​ In the computer science field, context-free grammars are frequently used, especially in the areas of
formal language theory, compiler development, and natural language processing. It is also used for
explaining the syntax of programming languages and other formal languages.
●​ The grammar with the following productions defines simple arithmetic expressions.
●​ In this grammar, the terminal symbols are id + - * / t ( )
●​ The nonterminal symbols are expr and op, and expr is the start symbol.

1.These symbols are terminals:


i)​ Lower-case letters early in the alphabet such as a, b, c.
ii)​ Operator symbols such as +, -, etc.
iii)​ Punctuation symbols such as parentheses, comma, etc.
iv)​ The digits 0, 1, . . . , 9.
v)​ Boldface strings such as id or if.
2. These symbols are nonterminals:
i) Upper-case letters early in the alphabet such as A, B, C​
iI)The letter S, which, when it appears, is usually the start symbol.
iii) Lower-case italic names such as expr or stmt.
3. Upper-case letters late in the alphabet, such as X, Y, Z, represent gram¬ mar symbols, that is, either
nonterminals or terminals.
4. Lower-case letters late in the alphabet, chiefly u, v, . . . , z, represent strings of terminals.
5. Lower-case Greek letters, a, (3, y, for example, represent strings of grammar symbols. Thus, a generic
production could be written as A a, indicating that there is a single nonterminal A on the left of the
arrow (the left side of the production) and a string of grammar symbols a to the right of the arrow (the
right side of the production).
●​ Initially, we have a string that only contains the start symbol.
●​ We expand the start symbol using one of its productions (i.e., using a production whose left side (head)
is the start symbol). – i.e. we replace the start symbol with a string which appears on the right side of a
production rule belongs to the start symbol.
●​ If the resulting string contains at least one variable, we further expand the resulting string by replacing
one of its variables with the right side (body) of one of its productions.We can continue these
replacements until we derive a string consisting entirely of terminals.
●​ The language of the grammar is the set of all strings of terminals that we can be obtained in this way.
●​ Replacement of a variable (in a string) with the right side of one of its productions is called as
derivation.
●​ Let a CFG {N,T,P,S} be
●​ N = {S}, T = {a, b}, Starting symbol = S, P = S → SS | aSb | ε
●​ One derivation from the above CFG is “abaabb”
●​ S → SS → aSbS → abS → abaSb → abaaSbb → abaabb

Sentential Form and Partial Derivation Tree


●​ A partial derivation tree is a sub-tree of a derivation tree/parse tree such that either all of its children are
in the sub-tree or none of them are in the sub-tree.
●​ Example
●​ If in any CFG the productions are −
●​ S → AB, A → aaA | ε, B → Bb| ε
●​ The partial derivation tree can be the following

●​ If a partial derivation tree contains the root S, it is called a sentential form. The above sub-tree is
also in sentential form
Leftmost and Rightmost Derivation of a String
Leftmost derivation − A leftmost derivation is obtained by applying production to the leftmost variable
in each step.
Rightmost derivation − A rightmost derivation is obtained by applying production to the rightmost
variable in each step.
Example

Let any set of production rules in a CFG be

X → X+X | X*X |X| a

over an alphabet {a}.

The leftmost derivation for the string "a+a*a" may be −

X → X+X → a+X → a + X*X → a+a*X → a+a*a


The stepwise derivation of the above string is shown as below
Left and Right Recursive Grammars
●​ In a context-free grammar G, if there is a production in the form X → Xa where X is a non-terminal
and ‘a’ is a string of terminals, it is called a left recursive production. The grammar having a left
recursive production is called a left recursive grammar.
●​ And if in a context-free grammar G, if there is a production is in the form X → aX where X is a
non-terminal and ‘a’ is a string of terminals, it is called a right recursive production. The grammar
having a right recursive production is called a right recursive grammar.
3.3.TOP DOWN PARSING
The top-down parsing technique parses the input, and starts constructing a parse tree from the root node
gradually moving down to the leaf nodes.

Recursive Descent Parsing


●​ Recursive descent is a top-down parsing technique that constructs the parse tree from the top and
the input is read from left to right. It uses procedures for every terminal and non-terminal entity.
●​ This parsing technique recursively parses the input to make a parse tree, which may or may not
require back-tracking. But the grammar associated with it (if not left factored) cannot avoid
back-tracking.
●​ A form of recursive-descent parsing that does not require any back-tracking is known
as predictive parsing.
●​ This parsing technique is regarded recursive as it uses context-free grammar which is recursive in
nature.
Back-tracking
●​ Top- down parsers start from the root node (start symbol) and match the input string against the
production rules to replace them (if matched). To understand this, take the following example of CFG:
S → rXd | rZd
X → oa | ea
Z → ai
For an input string: read, a top-down parser, will behave like this:
●​ It will start with S from the production rules and will match its yield to the left-most letter of the
input, i.e. ‘r’. The very production of S (S → rXd) matches with it. So the top-down parser advances to
the next input letter (i.e. ‘e’).
●​ The parser tries to expand non-terminal ‘X’ and checks its production from the left (X → oa). It
does not match with the next input symbol. So the top-down parser backtracks to obtain the next
production rule of X, (X → ea).
●​ Now the parser matches all the input letters in an ordered manner. The string is accepted.

Predictive Parser
●​ Predictive parser is a recursive descent parser, which has the capability to predict which production is to
be used to replace the input string.
●​ The predictive parser does not suffer from backtracking.
●​ To accomplish its tasks, the predictive parser uses a look-ahead pointer, which points to the next input
symbols.
●​ To make the parser back-tracking free, the predictive parser puts some constraints on the grammar and
accepts only a class of grammar known as LL(k) grammar.
●​ Predictive parsing uses a stack and a parsing table to parse the input and generate a parse tree.
●​ Both the stack and the input contains an end symbol $ to denote that the stack is empty and the input is
consumed.
●​ The parser refers to the parsing table to take any decision on the input and stack element combination
.

●​ In recursive descent parsing, the parser may have more than one production to choose from for a
single instance of input, whereas in predictive parser, each step has at most one production to choose.
●​ There might be instances where there is no production matching the input string, making the
parsing procedure to fail.
Non Recursive Predictive Parser
●​ The table-driven predictive parser has an input buffer, stack, a parsing table and an output stream.
Input buffer:
●​ It consists of strings to be parsed, followed by $ to indicate the end of the input string.
Stack:
●​ It contains a sequence of grammar symbols preceded by $ to indicate the bottom of the stack.
Initially, the stack contains the start symbol on top of $.
Parsing table:
●​ It is a two-dimensional array M [A, a], where ‘A’ is a non-terminal and ‘a’ is a terminal.

Construction of Predictive Parsing Table


●​Parsing is an essential part of computer science, especially in compilers and interpreters. From the various
parsing techniques, LL(1) parsing is best. It uses a predictive, top-down approach.
●​This allows efficient parsing without backtracking. This article will explore parsing and LL(1) parsing. It
will cover its structure, how to build an LL(1) parsing table and its benefits.
What is LL(1) Parsing?
●​ Here the 1st L represents that the scanning of the Input will be done from the Left to Right manner and
the second L shows that in this parsing technique, we are going to use the Left most Derivation Tree.
●​ And finally, the 1 represents the number of look-ahead, which means how many symbols you will see
when you want to make a decision.

Conditions for an LL(1) Grammar


To construct a working LL(1) parsing table, a grammar must satisfy these conditions:
●​ No Left Recursion: Avoid recursive definitions like A -> A + b.
●​ Unambiguous Grammar: Ensure each string can be derived in only one way.
●​ Left Factoring: Make the grammar deterministic, so the parser can proceed without guessing.
Algorithm to Construct LL(1) Parsing Table
Step 1: First check all the essential conditions mentioned above and go to step 2.
Step 2: Calculate First() and Follow() for all non-terminals.
1.​ First(): If there is a variable, and from that variable, if we try to drive all the strings then the
beginning Terminal Symbol is called the First.
2.​ Follow(): What is the Terminal Symbol which follows a variable in the process of derivation.
Step 3: For each production A –> α. (A tends to alpha)
1.​ Find First(α) and for each terminal in First(α), make entry A –> α in the table.
2.​ If First(α) contains ε (epsilon) as terminal, then find the Follow(A) and for each terminal in
Follow(A), make entry A –> ε in the table.
3.​ If the First(α) contains ε and Follow(A) contains $ as terminal, then make entry A –> ε in the
table for the $.
To construct the parsing table, we have two functions:
In the table, rows will contain the Non-Terminals and the column will contain the Terminal Symbols. All
the Null Productions of the Grammars will go under the Follow elements and the remaining productions
will lie under the elements of the First set.
Now, let’s understand with an example.
Example 1: Consider the Grammar:
E --> TE'​
E' --> +TE' | ε ​
T --> FT'​
T' --> *FT' | ε​
F --> id | (E)​
*ε denotes epsilon
Step 1: The grammar satisfies all properties in step 1.
Step 2: Calculate first() and follow().
Find their First and Follow sets:

First Follow

E –> TE’ { id, ( } { $, ) }

E’ –> +TE’/
{ +, ε } { $, ) }
ε

T –> FT’ { id, ( } { +, $, ) }

T’ –> *FT’/
{ *, ε } { +, $, ) }
ε

F –> id/(E) { id, ( } { *, +, $, ) }

Step 3: Make a parser table.


Now, the LL(1) Parsing Table is:
Advantages of Construction of LL(1) Parsing Table
●​ Clear Decision-Making: With an LL(1) parsing table, the parser can decide what to do by
looking at just one symbol ahead. This makes it easy to choose the right rule without confusion
or guessing.
●​ Fast Parsing: Since there’s no need to go back and forth or guess the next step, LL(1) parsing is
quick and efficient. This is useful for applications like compilers where speed is important.
●​ Easy to Spot Errors: The table helps identify errors right away. If the current symbol doesn’t
match any rule in the table, the parser knows there’s an error and can handle it immediately.
●​ Simple to Implement: Once the table is set up, the parsing process is straightforward. You just
follow the instructions in the table, making it easier to build and maintain.
●​ Good for Predictive Parsing: LL(1) parsing is often called “predictive parsing” because the table
lets you predict the next steps based on the input. This makes it reliable for parsing programming
languages and structured data.

3.4.BOTTOM UP PARSING
●​ Bottom-up parsing starts from the leaf nodes of a tree and works in upward direction till it reaches the
root node. Here, we start from a sentence and then apply production rules in reverse manner in order to
reach the start symbol.

Shift-Reduce Parsing
Shift-reduce parsing uses two unique steps for bottom-up parsing. These steps are known as shift-step
and reduce-step.
●​Shift step: The shift step refers to the advancement of the input pointer to the next input symbol, which
is called the shifted symbol. This symbol is pushed onto the stack. The shifted symbol is treated as a
single node of the parse tree.
●​Reduce step : When the parser finds a complete grammar rule (RHS) and replaces it to (LHS), it is
known as reduce-step. This occurs when the top of the stack contains a handle. To reduce, a POP
function is performed on the stack which pops off the handle and replaces it with LHS non-terminal
symbol.
Example
Grammar:
1.​ S → S+S
2.​ S → S-S
3.​ S → (S)
4.​ S → a
Input string: a1-(a2+a3)
Parsing table
Stack contents Input String Actions
$ a1-(a2+a3)$ Shift a1
$a1 -(a2+a3)$ Reduce by s->a
$S -(a2+a3)$ Shift -
$S- (a2+a3)$ Shift (
$S-( a2+a3)$ Shift a2
$S-(a2 +a3)$ Reduce by S->a
$S-(S +a3)$ Shift +
$S-(S+ a3)$ Shift a3
$S-(S+a3 )$ Reduce by S->a
$S-(S+S )$ shift)
$S-(S+S) $ Reduce by S->S+S
$S-(S) $ Reduce by S->(S)
$S-S $ Reduce by S->S-S
$S $ Accept
Operator Precedence Parser
●​ A grammar that is used to define mathematical operators is called an operator
grammar or operator precedence grammar. Such grammars have the restriction that no production has
either an empty right-hand side (null productions) or two adjacent non-terminals in its right-hand
side. Examples – This is an example of operator grammar:
E->E+E/E*E/id

However, the grammar given below is not an operator grammar because two non-terminals are adjacent to
each other:

S->SAS/a

A->bSb/b

We can convert it into an operator grammar, though:

S->SbSbS/SbS/a

A->bSb/b

●​An operator precedence parser is a bottom-up parser that interprets an operator grammar. This parser is
only used for operator grammars.
●​Ambiguous grammars are not allowed in any parser except operator precedence parser. There are two
methods for determining what precedence relations should hold between a pair of terminals:
1.​ Use the conventional associativity and precedence of operator.
2.​ The second method of selecting operator-precedence relations is first to construct an
unambiguous grammar for the language, a grammar that reflects the correct associativity and
precedence in its parse trees.
3.​ This parser relies on the following three precedence relations: ⋖, ≐, ⋗ a ⋖ b This means a
“yields precedence to” b. a ⋗ b This means a “takes precedence over” b. a ≐ b This means a
“has same precedence as” b.
●​In order to decrease the size of table, we use operator function table. Operator precedence parsers usually
do not store the precedence table with the relations; rather they are implemented in a special way.
●​Operator precedence parsers use precedence functions that map terminal symbols to integers, and the
precedence relations between the symbols are implemented by numerical comparison.
●​The parsing table can be encoded by two precedence functions f and g that map terminal symbols to
integers. We select f and g such that:
1.​ f(a) < g(b) whenever a yields precedence to b
2.​ f(a) = g(b) whenever a and b have the same precedence
3.​ f(a) > g(b) whenever a takes precedence over b

Consider the following grammar:


E -> E + E/E * E/( E )/id

This is the directed graph representing the precedence function:

Since there is no cycle in the graph, we can make this function table:
LR Parser
●​ The LR parser is a non-recursive, shift-reduce, bottom-up parser. It uses a wide class of
context-free grammar which makes it the most efficient syntax analysis technique. LR parsers are also
known as LR(k) parsers, where L stands for left-to-right scanning of the input stream; R stands for the
construction of right-most derivation in reverse, and k denotes the number of lookahead symbols to
make decisions.
Rules for LR parser :
The rules of LR parser as follows.
1.​ The first item from the given grammar rules adds itself as the first closed set.
2.​ If an object is present in the closure of the form A→ α. β. γ, where the next symbol after the symbol is
non-terminal, add the symbol’s production rules where the dot precedes the first item.
3.​ Repeat steps (B) and (C) for new items added under (B).
LR parser algorithm :
LR Parsing algorithm is the same for all the parser, but the parsing table is different for each parser. It
consists following components as follows.
1.​ Input Buffer –
It contains the given string, and it ends with a $ symbol.​

2.​ Stack –
The combination of state symbol and current input symbol is used to refer to the parsing table in order to
take the parsing decisions.
Parsing Table :
Parsing table is divided into two parts- Action table and Go-To table. The action table gives a grammar
rule to implement the given current state and current terminal in the input stream. There are four cases
used in action table as follows.
1.​ Shift Action- In shift action the present terminal is removed from the input stream and the state n is
pushed onto the stack, and it becomes the new present state.
2.​ Reduce Action- The number m is written to the output stream.
3.​ The symbol m mentioned in the left-hand side of rule m says that state is removed from the stack.
4.​ The symbol m mentioned in the left-hand side of rule m says that a new state is looked up in the goto
table and made the new current state by pushing it onto the stack.
Canonical LR
●​The CLR parser stands for canonical LR parser.It is a more powerful LR parser.It makes use of lookahead
symbols. This method uses a large set of items called LR(1) items.
●​The main difference between LR(0) and LR(1) items is that, in LR(1) items, it is possible to carry more
information in a state, which will rule out useless reduction states.
●​This extra information is incorporated into the state by the lookahead symbol.
●​The general syntax becomes [A->∝.B, a ]
where A->∝.B is the production and a is a terminal or right end marker $​
LR(1) items=LR(0) items + look ahead.
CASE 1 –
A->∝.BC, a
Suppose this is the 0th production.Now, since ‘ . ‘ precedes B,so we have to write B’s productions as well.
B->.D [1st production]
Suppose this is B’s production. The look ahead of this production is given as we look at previous
productions ie 0th production. Whatever is after B, we find FIRST(of that value) , that is the lookahead of
1st production.So,here in 0th production, after B, C is there. assume FIRST(C)=d, then 1st production
become
B->.D, d
CASE 2 –​
Now if the 0th production was like this,
A->∝.B, a
Here, we can see there’s nothing after B. So the lookahead of 0th production will be the lookahead of 1st
production. ie-
B->.D, a
CASE 3 –​
Assume a production A->a|b
A->a,$ [0th production]
A->b,$ [1st production]
Here, the 1st production is a part of the previous production, so the lookahead will be the same as that of
its previous production.
These are the 2 rules of look ahead.
Steps for constructing CLR parsing table :
1.​ Writing augmented grammar
2.​ LR(1) collection of items to be found
3.​ Defining 2 functions: goto[list of terminals] and action[list of non-terminals] in the CLR parsing
table
SLR Parsing

●​ SLR (1) refers to simple LR Parsing. It is same as LR(0) parsing. The only difference is in the
parsing table.To construct SLR (1) parsing table, we use canonical collection of LR (0) item.
●​ In the SLR (1) parsing, we place the reduce move only in the follow of left hand side.
Various steps involved in the SLR (1) Parsing:
●​ For the given input string write a context free grammar
●​ Check the ambiguity of the grammar
●​ Add Augment production in the given grammar
●​ Create Canonical collection of LR (0) items
●​ Draw a data flow diagram (DFA)
●​ Construct a SLR (1) parsing table

●​ If a state (Ii) is going to some other state (Ij) on a terminal then it corresponds to a shift move in the
action part.

●​ If a state (Ii) is going to some other state (Ij) on a variable then it correspond to go to
move in the Go to part.

●​ If a state (Ii) contains the final item like A → ab• which has no transitions to the next state
then the production is known as reduce production. For all terminals X in FOLLOW (A),
write the reduce entry along with their production numbers.

Example
S -> •Aa
A->αβ•
Follow(S) = {$}
Follow (A) = {a}
UNIT-IV
INTERMEDIATE CODE GENERATION
●​ Intermediate Representation(IR), as the name suggests, is any representation of a program between the source and
target languages.​

GRAPHICAL REPRESENTATIONS
●​ An Abstract Syntax Tree (AST) is a hierarchical tree-like data structure that represents the structure of source
code in a programming language.
●​ Each node in the tree corresponds to a programming construct or element, such as statements, expressions, or
variables.
●​ Each interior node represents an operator.
●​ Each leaf node represents an operand.

THREE ADDRESS CODE


●​ The directed acyclic graph (DAG) is one step simplified from the AST. A DAG is similar to the AST, except that
it can have an arbitrary graph structure, and the individual nodes are greatly simplified.
●​ Each node of it contains a unique value.
●​ It does not contain any cycles in it, hence called Acyclic.
●​ Now suppose we compile a simple expression like x=(a+10)*(a+10). The AST representation of this expression
would directly capture the syntactic structure

●​ After performing typechecking, we may learn that a is a floating point value, and therefore 10 must be converted
into a float before performing floating point arithmetic. In addition, the computation a+10 need only be performed
once, and the resulting value used twice.
●​ All of that can be represented with the following DAG, which introduces a new type of node ITOF to perform
integer-to-float conversion, and nodes FADD and FMUL to perform floating point arithmetic.

●​ It is also common for a DAG to represent address computations related to pointers and arrays in greater detail, so
that they can be shared and optimized, where possible. For example, the array lookup x=a[i] would have a very
simple representation in the AST.
●​ An array lookup is actually accomplished by adding the starting address of the array a with the index of the item i
multiplied by the size of objects in the array, determined by consulting the symbol table.


●​ The value-number method can be used to construct a DAG from an AST. The idea is to build an array where each
entry consists of a DAG node type, and the array index of the child nodes. Every time we wish to add a new node
to the DAG, we search the array for a matching node and re-use it to avoid duplication.
●​ One easy optimization is constant folding. This is the process of reducing an expression consisting of only
constants into a single value.
●​ Suppose you have an expression that computes the number of seconds present in the number of days. The
programmer expresses this as secs=days*24*60*60 to make it clear that there are 24 hours in a day, 60 minutes in
an hour, and 60 seconds in a minute.
●​ The algorithm descends through the tree and combines IMUL(60,60) into 3600, and then IMUL(3600,24) into
86400.
Control Flow Graph
●​ It is important to note that a DAG by itself is suitable for encoding expressions, but it isn’t as effective for control
flow or other ordered program structures.
●​ we can use a control flow graph to represent the higherlevel structure of the program. The control flow graph is a
directed graph (possibly cyclic) where each node of the graph consists of a basic block of sequential statements.
●​ The edges of the graph represent the possible flows of control between basic blocks.
●​ A conditional construct (like if or switch) results in branches in the graph, while a loop construct (like for or
while) results in reverse edges.
for(i=0;i<10;i++) {
if(i%2==0) {
print "even";
} else {
print "odd";
}
print "\n";
}
return;
ASSIGNMENT STATEMENT​
●​ The static single assignment (SSA) [1] form is a commonly-used representation for complex optimizations.
●​ SSA uses the information in the control flow and updates each basic block with a new restriction: variables
cannot change their values. Instead, whenever a variable is assigned a new value, it is given a new version
number.
For example, suppose that we have this bit of code:
int x = 1;
int a = x;
int b = a + 10;
x = 20 * b;
x = x + 30;
We could re-write it in SSA form like this:
int x_1 = 1;
int a_1 = x_1;
int b_1 = a_1 + 10;
x_2 = 20 * b_1;
x_3 = x_2 + 30;

LINEAR IR/BOOLEAN EXPRESSIONS


●​ A linear IR is an ordered sequence of instructions that is closer to the final goal of an assembly language.
●​ It can capture expressions, statements, and control flow all within one data structure. This enables some
optimization techniques that span multiple expressions.
●​ A linear IR typically looks like an idealized assembly language with a large (or infinite) number of registers and
the usual arithmetic and control flow operations
●​ let us assume an IR where LOAD and STOR are used to move values between memory and registers, and
three-address arithmetic operations read two registers and write to a third, from left to right.
●​ Each instruction can be afixed size 4-tuple representing the operation and (max) of three arguments.
●​ it is most convenient to pretend that there arean infinite number of virtual registers, such that every new value
computed writes to a new register.
●​ In this form, we can easily identify thelifetime of a value by observing the first point where a register is
written,and the last point where a register is used.

STACK MACHINE IR/CASE STATEMENTS


●​ An even more compact intermediate representation is a stack machine IR. Such a representation is designed
to execute on a virtual stack machine that has no traditional registers, but only a stack to hold intermediate
registers.
●​ A stack machine IR typically has a PUSH instruction which pushes a variable or literal value on to the
stack and a POP instruction which removes an item and stores it in memory
●​ Binary arithmetic operators (like FADD or FMUL) implicitly pop two values off the stack and push the
result on the stack
●​ Our example expression would look like this in a stack machine IR:
PUSH a
PUSH 10
ITOF
FADD
COPY
FMUL
POP x
If we suppose that a has the value 5.0, then executing the IR directly would result in this

Examples
float f( int a, int b, float x )
{
float y = a*x*x + b*x + 100;
return y;
}
GIMPLE - GNU Simple Representation
●​ The GNU Simple Representation (GIMPLE) is an internal IR used at the earliest stages of the GNU C compiler.
●​ All expressions have been broken down into individual operators on values in static single assignment form.
f (int a, int b, float x)
{
float D.1597D.1597;
float D.1598D.1598;
float D.1599D.1599;
float D.1600D.1600;
float D.1601D.1601;
float D.1602D.1602;
float D.1603D.1603;
float y;
D.1597D.1597 = (float) a;
D.1598D.1598 = D.1597D.1597 * x;
D.1599D.1599 = D.1598D.1598 * x;
D.1600D.1600 = (float) b;
D.1601D.1601 = D.1600D.1600 * x;
D.1602D.1602 = D.1599D.1599 + D.1601D.1601;
y = D.1602D.1602 + 1.0e+2;
D.1603D.1603 = y;
return D.1603D.1603;
}
LLVM - Low Level Virtual Machine
●​ The Low Level Virtual Machine (LLVM)3 project is a language and a corresponding suite of tools for building
optimizing compilers and interpreters.
●​ A variety of compiler front-ends support the generation of LLVM intermediate code, which can be optimized by a
variety of independent tools, andthen translated again into native machine code, or bytecode for virtualmachines
like Oracle’s JVM, or Microsoft’s CLR
●​ Note that the first few allocainstructions allocate space for local variables, followed by store instructions that
move the parameters to local variables
define float @f(i32 %a, i32 %b, float %x) #0 {
%1 = alloca i32, align 4
%2 = alloca i32, align 4
%3 = alloca float, align 4
%y = alloca float, align 4
store i32 %a, i32* %1, align 4
store i32 %b, i32* %2, align 4
store float %x, float* %3, align 4
%4 = load i32* %1, align 4
%5 = sitofp i32 %4 to float
%6 = load float* %3, align 4
%7 = fmul float %5, %6
%8 = load float* %3, align 4
%9 = fmul float %7, %8
%10 = load i32* %2, align 4
%11 = sitofp i32 %10 to float
%12 = load float* %3, align 4
%13 = fmul float %11, %12
%14 = fadd float %9, %13
%15 = fadd float %14, 1.000000e+02
store float %15, float* %y, align 4
%16 = load float* %y, align 4
ret float %16
}​
JVM - Java Virtual Machine
●​ The Java Virtual Machine (JVM) is an abstract definition of a stack-basedmachine. Highlevel code written in Java
is compiled into .class fileswhich contain a binary representation of the JVMbytecode.
●​ Note that eachof the iload instructions refers to a local variable, where parameters areconsidered as the first few
local variables. So, iload 1 pushes the firstlocal variable (int a) on to the stack, while fload 3 pushes the thirdlocal
variable (float x) on to the stack.
0: iload 1
1: i2f
2: fload 3
3: fmul
4: fload 3
5: fmul
6: iload 2
7: i2f
8: fload 3
9: fmul
10: fadd
11: ldc #2
12:fadd
13: fstore 4
14: fload 4
15: freturn
SYNTAX DIRECTED TRANSLATION
●​ Parser uses a CFG(Context-free-Grammar) to validate the input string and produce output for the
next phase of the compiler. Output could be either a parse tree or an abstract syntax tree. Now to
interleave semantic analysis with the syntax analysis phase of the compiler, we use Syntax Directed
Translation.

●​ Conceptually, with both syntax-directed definition and translation schemes, we parse the input
token stream, build the parse tree, and then traverse the tree as needed to evaluate the semantic rules at
the parse tree nodes.
●​ Evaluation of the semantic rules may generate code, save information in a symbol table, issue
error messages, or perform any other activities.
●​ The translation of the token stream is the result obtained by evaluating the semantic rules.
●​ Syntax Directed Translation has augmented rules to the grammar that facilitate semantic analysis.
●​ SDT involves passing information bottom-up and/or top-down to the parse tree in form of
attributes attached to the nodes.
●​ Syntax-directed translation rules use
1) lexical values of nodes
2) constants
3) attributes associated with the non-terminals in their definitions.
●​ The general approach to Syntax-Directed Translation is to construct a parse tree or syntax tree and
compute the values of attributes at the nodes of the tree by visiting them in some order.
●​ In many cases, translation can be done during parsing without building an explicit tree
Example
E -> E+T | T
T -> T*F | F
F -> INTLIT
●​ This is a grammar to syntactically validate an expression having additions and multiplications in
it. Now, to carry out semantic analysis we will augment SDT rules to this grammar, in order to pass
some information up the parse tree and check for semantic errors.
E -> E+T { E.val = E.val + T.val } PR#1
E -> T { E.val = T.val } PR#2
T -> T*F { T.val = T.val * F.val } PR#3
T -> F { T.val = F.val } PR#4
F -> INTLIT { F.val = INTLIT.lexval } PR#5
●​ Let’s take a string to see how semantic analysis happens – S = 2+3*4. Parse tree corresponding to
S would be
●​ To evaluate translation rules, we can employ one depth-first search traversal on the parse tree.
This is possible only because SDT rules don’t impose any specific order on evaluation until children’s
attributes are computed before parents for a grammar having all synthesized attributes.
●​ Otherwise, we would have to figure out the best-suited plan to traverse through the parse tree and
evaluate all the attributes in one or more traversals.
Advantages of Syntax Directed Translation:
1.​Ease of implementation: SDT is a simple and easy-to-implement method for translating a
programming language. It provides a clear and structured way to specify translation rules using
grammar rules.
2.​Separation of concerns: SDT separates the translation process from the parsing process, making it
easier to modify and maintain the compiler. It also separates the translation concerns from the parsing
concerns, allowing for more modular and extensible compiler designs.
3.​Efficient code generation: SDT enables the generation of efficient code by optimizing the translation
process. It allows for the use of techniques such as intermediate code generation and code optimization.
Disadvantages of Syntax Directed Translation:
1.​ Limited expressiveness: SDT has limited expressiveness in comparison to other translation methods,
such as attribute grammars. This limits the types of translations that can be performed using SDT.
2.​ Inflexibility: SDT can be inflexible in situations where the translation rules are complex and cannot be
easily expressed using grammar rules.
3.​ Limited error recovery: SDT is limited in its ability to recover from errors during the translation
process. This can result in poor error messages and may make it difficult to locate and fix errors in the
input program

UNIT V
5.1. CODE OPTIMIZATION
●​Code optimization is a crucial phase in compiler design aimed at enhancing the performance and
efficiency of the executable code.
●​By improving the quality of the generated machine code optimizations can reduce execution time,
minimize resource usage, and improve overall system performance.
●​This process involves the various techniques and strategies applied during compilation to produce more
efficient code without altering the program’s functionality.
●​The code optimization in the synthesis phase is a program transformation technique, which tries to
improve the intermediate code by making it consume fewer resources (i.e. CPU, Memory) so that
faster-running machine code will result. The compiler optimizing process should meet the following
objectives:

✔​ The optimization must be correct, it must not, in any way, change the meaning of the program.

✔​ Optimization should increase the speed and performance of the program.

✔​ The compilation time must be kept reasonable.

✔​ The optimization process should not delay the overall compiling process.

Types of Code Optimization


The optimization process can be broadly classified into two types:
1.Machine Independent Optimization: This code optimization phase attempts to improve
the intermediate code to get a better target code as the output. The part of the intermediate code which is
transformed here does not involve any CPU registers or absolute memory locations.
2.Machine Dependent Optimization: Machine-dependent optimization is done after the target code has
been generated and when the code is transformed according to the target machine architecture. It involves
CPU registers and may have absolute memory references rather than relative references.
Machine-dependent optimizers put efforts to take maximum advantage of the memory hierarchy.
5.2. PRINCIPLE SOURCE OF OPTIMIZATION
●​ A transformation of a program is called local if it can be performed by looking only at the statements
in a basic block; otherwise, it is called global.
●​ Many transformations can be performed at both the local and global levels. Local transformations are
usually performed first.
Function-Preserving Transformations
●​There are a number of ways in which a compiler can improve a program without changing the function it
computes.
●​Function preserving transformations examples:
1.​ Common sub expression elimination
2.​ Copy propagation,
3.​ Dead-code elimination
4.​ Constant folding
●​The other transformations come up primarily when global optimizations are performed.
●​Frequently, a program will include several calculations of the offset in an array. Some of the duplicate
calculations cannot be avoided by the programmer because they lie below the level of detail accessible
within the source language.
1. Common Sub expressions elimination:
●​An occurrence of an expression E is called a common sub-expression if E was previously computed, and
the values of variables in E have not changed since the previous computation.
●​We can avoid recomputing the expression if we can use the previously computed value.
For example
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t4: = 4*i
t5: = n
t6: = b [t4] +t5
The above code can be optimized using the common sub-expression elimination as
t1: = 4*i
t2: = a [t1]
t3: = 4*j
t5: = n
t6: = b [t1] +t5

●​The common sub expression t4: =4*i is eliminated as its computation is already in t1 and the value of i is
not been changed from definition to use.
2.Copy Propagation:
●​Assignments of the form f : = g called copy statements, or copies for short.
●​The idea behind the copy-propagation transformation is to use g for f, whenever possible after the copy
statement f: = g. Copy propagation means use of one variable instead of another.
●​This may not appear to be an improvement, but as we shall see it gives us an opportunity to eliminate x.
For example:
x=Pi;
A=x*r*r;
The optimization using copy propagation can be done as follows: A=Pi*r*r;
Here the variable x is eliminated
3. Dead-Code Eliminations:
●​ A variable is live at a point in a program if its value can be used subsequently; otherwise, it is dead at
that point. A related idea is dead or useless code, statements that compute values that never get used.
While the programmer is unlikely to introduce any dead code intentionally, it may appear as the result
of previous transformations.
Example:
i=0;
if(i=1)
{
a=b+5;
}
Here, ‘if’ statement is dead code because this condition will never get satisfied.
4.Constant folding:
●​Deducing at compile time that the value of an expression is a constant and using the constant instead is
known as constant folding. One advantage of copy propagation is that it often turns the copy statement
into dead code.
For example,
a=3.14157/2 can be replaced by
a=1.570 there by eliminating a division operation.
5.Loop Optimizations:
●​ In loops, especially in the inner loops, programs tend to spend the bulk of their time. The running time
of a program may be improved if the number of instructions in an inner loop is decreased, even if we
increase the amount of code outside that loop.
Three techniques are important for loop optimization:

✔​ Code motion, which moves code outside a loop;

✔​ Induction-variable elimination, which we apply to replace variables from inner loop.

✔​ Reduction in strength, which replaces and expensive operation by a cheaper one, such as a

multiplication by an addition.
6.Code Motion
●​An important modification that decreases the amount of code in a loop is code motion. This
transformation takes an expression that yields the same result independent of the number of times a loop
is executed (a loop-invariant computation) and places the expression before the loop.
●​Note that the notion “before the loop” assumes the existence of an entry for the loop. For example,
evaluation of limit-2 is a loop-invariant computation in the following while-statement:
while (i <= limit-2) /* statement does not change limit*/
Code motion will result in the equivalent of
t= limit-2;
while (i<=t) /* statement does not change limit or t */
7. Induction Variables :
●​Loops are usually processed inside out. For example consider the loop around B3. Note that the values of
j and t4 remain in lock-step; every time the value of j decreases by 1, that of t4 decreases by 4 because 4*j
is assigned to t4. Such identifiers are called induction variables.
●​When there are two or more induction variables in a loop, it may be possible to get rid of all but one, by
the process of induction-variable elimination. For the inner loop around B3 in Fig.5.3 we cannot get rid of
either j or t4 completely; t4 is used in B3 and j in B4.
●​However, we can illustrate reduction in strength and illustrate a part of the process of induction-variable
elimination. Eventually j will be eliminated when the outer loop of B2- B5 is considered.
Example:
●​ As the relationship t4:=4*j surely holds after such an assignment to t4 in Fig. and t4 is not changed
elsewhere in the inner loop around B3, it follows that just after the statement j:=j-1 the relationship t4:=
4*j-4 must hold.
●​ We may therefore replace the assignment t4:= 4*j by t4:= t4-4. The only problem is that t4 does not
have a value when we enter block B3 for the first time.
●​ Since we must maintain the relationship t4=4*j on entry to the block B3, we place an initializations
of t4 at the end of the block where j itself is initialized,
●​ The replacement of a multiplication by a subtraction will speed up the object code if multiplication
takes more time than addition or subtraction, as is the case on many machines.
Reduction In Strength:
●​Reduction in strength replaces expensive operations by equivalent cheaper ones on the target machine.
Certain machine instructions are considerably cheaper than others and can often be used as special cases
of more expensive operators.
●​ For example, x² is invariably cheaper to implement as x*x than as a call to an exponentiation routine.
Fixed-point multiplication or division by a power of two is cheaper to implement as a shift. Floating-point
division by a constant can be implemented as multiplication by a constant, which may be cheaper.
5.3. FUNCTION PRESERVING TRANSFORMATION
●​ Optimization is applied to the basic blocks after the intermediate code generation phase of the compiler.
Optimization is the process of transforming a program that improves the code by consuming fewer
resources and delivering high speed.
●​ In optimization, high-level codes are replaced by their equivalent efficient low-level codes.
Optimization of basic blocks can be machine-dependent or machine-independent. These transformations
are useful for improving the quality of code that will be ultimately generated from basic block.
There are two types of basic block optimizations:
1.​ Structure preserving transformations
2.​ Algebraic transformations

Structure-Preserving Transformations:
The structure-preserving transformation on basic blocks includes:
1.​ Dead Code Elimination
2.​ Common Subexpression Elimination
3.​ Renaming of Temporary variables
4.​ Interchange of two independent adjacent statements
1.Dead Code Elimination:
Dead code is defined as that part of the code that never executes during the program execution. So, for
optimization, such code or dead code is eliminated. The code which is never executed during the
program (Dead code) takes time so, for optimization and speed, it is eliminated from the code.
Eliminating the dead code increases the speed of the program as the compiler does not have to translate
the dead code.
Example:
// Program with Dead code
int main()
{
x=2
if (x > 2)
cout << "code"; // Dead code
else
cout << "Optimization";
return 0;
}
// Optimized Program without dead code
int main()
{
x = 2;
cout << "Optimization"; // Dead Code Eliminated
return 0;
}
2.Common Subexpression Elimination:
In this technique, the sub-expression which are common are used frequently are calculated only once
and reused when needed. DAG ( Directed Acyclic Graph ) is used to eliminate common subexpressions.
Example:
3.Renaming of Temporary Variables:
Statements containing instances of a temporary variable can be changed to instances of a new temporary
variable without changing the basic block value.
Example: Statement t = a + b can be changed to x = a + b where t is a temporary variable and x is a new
temporary variable without changing the value of the basic block.
4.Interchange of Two Independent Adjacent Statements:
If a block has two adjacent statements which are independent can be interchanged without affecting the
basic block value.
Example:
t1 = a + b
t2 = c + d
These two independent statements of a block can be interchanged without affecting the value of the
block.
Algebraic Transformation:
Countless algebraic transformations can be used to change the set of expressions computed by a basic
block into an algebraically equivalent set. Some of the algebraic transformation on basic blocks
includes:
1.​ Constant Folding
2.​ Copy Propagation
3.​ Strength Reduction
1. Constant Folding:
Solve the constant terms which are continuous so that compiler does not need to solve this expression.
Example:
x=2*3+y ⇒x=6+y (Optimized code)
2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)
Loop Optimization:
Loop optimization includes the following strategies:
1.​ Code motion & Frequency Reduction
2.​ Induction variable elimination
3.​ Loop merging/combining
4.​ Loop Unrolling
1. Code Motion & Frequency Reduction
Move loop invariant code outside of the loop.
// Program with loop variant inside loop
int main()
{
for (i = 0; i < n; i++) {
x = 10;
y = y + i;
}
return 0;
}
// Program with loop variant outside loop
int main()
{
x = 10;
for (i = 0; i < n; i++)
y = y + i;
return 0;
}
2. Induction Variable Elimination:
Eliminate various unnecessary induction variables used in the loop.
// Program with multiple induction variables
int main()
{
i1 = 0;
i2 = 0;
for (i = 0; i < n; i++) {
A[i1++] = B[i2++];
}
return 0;
}
// Program with one induction variable
int main()
{
for (i = 0; i < n; i++) {
A[i] = B[i]; // Only one induction variable
}
return 0;
}
3. Loop Merging/Combining:
If the operations performed can be done in a single loop then, merge or combine the loops.
// Program with multiple loops
int main()
{
for (i = 0; i < n; i++)
A[i] = i + 1;
for (j = 0; j < n; j++)
B[j] = j - 1;
return 0;
}
// Program with one loop when multiple loops are merged
int main()
{
for (i = 0; i < n; i++) {
A[i] = i + 1;
B[i] = i - 1;
}
return 0;
}
4. Loop Unrolling:
If there exists simple code which can reduce the number of times the loop executes then, the loop can be
replaced with these codes.
// Program with loops
int main()
{
for (i = 0; i < 3; i++)
cout << "Cd";
return 0;
}
// Program with simple code without loops
int main()
{
cout << "Cd";
cout << "Cd";
cout << "Cd";
return 0;
}
5.4 COMMON SUBEXPRESSION
●​ The expression or sub-expression that has been appeared and computed before and appears again
during the computation of the code is the common sub-expression. Elimination of that sub-expression is
known as Common sub-expression elimination.
●​ The advantage of this elimination method is to make the computation faster and better by
avoiding the re-computation of the expression. In addition, it utilizes memory efficiently.
Types of common sub-expression elimination
The two types of elimination methods in common sub-expression elimination are:
1. Local Common Sub-expression elimination– It is used within a single basic block. Where a basic
block is a simple code sequence that has no branches.
2. Global Common Sub-expression elimination– It is used for an entire procedure of common
sub-expression elimination.
Example 1:
Before elimination –
a = 10;
b = a + 1 * 2;
c = a + 1 * 2;
//’c’ has common expression as ‘b’
d = c + a;
After elimination –
a = 10;
b = a + 1 * 2;
d = b + a;

the result of ‘d’ would be similar with both expressions. So, we will eliminate one of the common
subexpressions, as it helps in faster execution and efficient memory utilization.
Example 2:
Before elimination –
x = 11;
y = 11 * 24;
z = x * 24;
//'z' has common expression as 'y' as 'x' can be evaluated directly as done in 'y'.
After elimination –
y = 11 * 24;
Benefits of Common Subexpression Elimination
●​Improved Performance: The primary advantage of CSE is enhanced performance. By reducing
redundant computations, CSE significantly decreases the execution time of a program. This optimization
is particularly beneficial in computationally intensive applications.
●​Simplified Code: CSE simplifies code by removing unnecessary redundancy. Cleaner code is easier to
read, maintain, and debug, leading to fewer programming errors and improved software quality.
●​Reduced Resource Usage: Reducing redundant calculations conserves computational resources such
as CPU time and memory. This can be especially advantageous in resource-constrained environments.
// before CSE
int main()
{
a = b + c * 2; //(c*2) is a common subexpression
x = y + c * 2;
ans = a + x;
}

// after CSE
int main()
{
int temp = c * 2; //(c*2) is assigned to a variable temp
a = b + temp; // thus saving the time of computing (c*2)
// again
x = y + temp;
ans = a + x;
}
5.5. COPY PROPAGATION
●​ Copy propagation is defined as an optimization technique used in compiler design. Copy propagation is
used to replace the occurrence of target variables that are the direct assignments with their values.
●​ Copy propagation is related to the approach of a common subexpression. In common subexpression, the
expression values are not changed since the first expression is computed.
●​ The goal of copy propagation is to reduce the unnecessary expression variables. Which in turn results in
faster execution and less memory utilization.
●​ Copy propagation can be used or applied only when the exact value of the variable is known at the time
of compilation and the value can be inferred from the context of the program.
Types of Copy Propagation
1. Local Copy Propagation
●​ Local copy propagation is a type of propagation that is limited or restricted within a specific block
of code. The optimization of code is limited to the current working block in local copy propagation and
cannot be extended beyond its boundaries.
// Example for Local Copy Propagation
// Before applying Copy Propagation
#include <iostream>
using namespace std;
int main()
{
int a = 1 + 2;
int b = a;
int ans = b + 6;
cout << "Before copy propagation, ans= " << ans;
return 0;
}
Output:
Before copy propagation, ans= 9
After Local Copy Propagation
// Output:After copy propagation, ans= 9

2. Global Copy Propagation


●​ Global copy propagation is a type of propagation that is not limited to a specific block of code. It can
access and use the values that belong to other blocks than the current working block.
●​ Global copy propagation has an advantage as compared to local copy propagation in that it
eliminates redundant computations on a large scale.
Example:
Before Global Copy Propagation
Example for Local Copy Propagation
// After applying Copy Propagation
#include <iostream>
using namespace std;
int main()
{
int a = 1 + 2;
int ans = a + 6;
cout << "After copy propagation, ans= " << ans;
return 0;
}

// Example of Global copy propagation

// Before Global copy propagation


#include <iostream>
using namespace std;
int globalVariable = 10;
int h;
int function1() { int h = globalVariable + 2; }
int function2()
{
int a = globalVariable;
int b = a + 15;
int ans = b;
return ans;
}
int main()
{
int result = function2();
cout << "Result before Global copy propagation: "
<< result;
return 0;
}

Output:
Result before Global copy propagation: 25
After Global Copy Propagation
// Example of Global copy propagation
// After Global copy propagation
#include <iostream>
using namespace std;
int globalVariable = 10;
int h;
int function1() { int h = globalVariable + 2; }
int function2()
{
int ans = globalVariable + 15;
return ans;
}
int main()
{
int result = function2();
cout << "Result after Global copy propagation: "
<< result;
return 0;
}
Output:
The result after Global copy propagation: 25
Advantages of Copy Propagation
●​Copy propagation reduces the required computation time by eliminating the redundant and unnecessary
variable assignments in the expressions.
●​Copy propagation ensures memory optimization. Only the required memory assignments and variables
require memory, irrelevant memory expressions are being eliminated.
●​Copy propagation simplifies the available code by eliminating the expressions that are not required
making code easily understandable.
5.6.OPTIMIZATION OF BASIC BLOCKS-ALGEBRAIC TRANFORMATION
Algebraic Transformation:
Countless algebraic transformations can be used to change the set of expressions computed by a basic
block into an algebraically equivalent set. Some of the algebraic transformation on basic blocks includes:
1.​ Constant Folding
2.​ Copy Propagation
3.​ Strength Reduction
1. Constant Folding:
Solve the constant terms which are continuous so that compiler does not need to solve this expression.
Example:
x = 2 * 3 + y ⇒ x = 6 + y (Optimized code)
2. Copy Propagation:
It is of two types, Variable Propagation, and Constant Propagation.
Variable Propagation:
x=y ⇒ z = y + 2 (Optimized code)
z=x+2
Constant Propagation:
x=3 ⇒ z = 3 + a (Optimized code)
z=x+a
3. Strength Reduction:
Replace expensive statement/ instruction with cheaper ones.
x = 2 * y (costly) ⇒ x = y + y (cheaper)
x = 2 * y (costly) ⇒ x = y << 1 (cheaper)
Loop Optimization:
Loop optimization includes the following strategies:
1.​ Code motion & Frequency Reduction
2.​ Induction variable elimination
3.​ Loop merging/combining
4.​ Loop Unrolling
1. Code Motion & Frequency Reduction
Move loop invariant code outside of the loop.
// Program with loop variant inside loop
int main()
{
for (i = 0; i < n; i++) {
x = 10;
y = y + i;
}
return 0;
}
// Program with loop variant outside loop
int main()
{
x = 10;
for (i = 0; i < n; i++)
y = y + i;
return 0;
}
2. Induction Variable Elimination:
Eliminate various unnecessary induction variables used in the loop.
// Program with multiple induction variables
int main()
{
i1 = 0;
i2 = 0;
for (i = 0; i < n; i++) {
A[i1++] = B[i2++];
}
return 0;
}
// Program with one induction variable
int main()
{
for (i = 0; i < n; i++) {
A[i] = B[i]; // Only one induction variable
}
return 0;
}
3. Loop Merging/Combining:
If the operations performed can be done in a single loop then, merge or combine the loops.
// Program with multiple loops
int main()
{
for (i = 0; i < n; i++)
A[i] = i + 1;
for (j = 0; j < n; j++)
B[j] = j - 1;
return 0;
}
// Program with one loop when multiple loops are merged
int main()
{
for (i = 0; i < n; i++) {
A[i] = i + 1;
B[i] = i - 1;
}
return 0;
}
4. Loop Unrolling:
If there exists simple code which can reduce the number of times the loop executes then, the loop can be
replaced with these codes.
// Program with loops
int main()
{
for (i = 0; i < 3; i++)
cout << "Cd";
return 0;
}
// Program with simple code without loops
int main()
{
cout << "Cd";
cout << "Cd";
cout << "Cd";
return 0;
}
5.7. LOOPS IN FLOW GRAPHS
●​ Flow graph is a directed graph. It contains the flow of control information for the set of basic
block.
●​ A control flow graph is used to depict that how the program control is being parsed among the
blocks. It is useful in the loop optimization.
●​ A flow graph is simply a directed graph. For the set of basic blocks, a flow graph shows the flow
of control information.
●​ A control flow graph is used to depict how the program control is being parsed among the blocks.
A flow graph is used to illustrate the flow of control between basic blocks once an intermediate code has
been partitioned into basic blocks.
●​ When the beginning instruction of the Y block follows the last instruction of the X block, an edge
might flow from one block X to another block Y.

Peephole Optimization
●​Peephole optimization technique is carried out at the assembly language level. This optimization
technique examines a short sequence of target instructions in a window (peephole) and replaces the
instructions by a faster and/or shorter sequence when possible.
●​Peephole optimization can also be carried out at the intermediate code level. The typical optimizations
that are carried out using peephole optimization techniques are the following
✔​ Redundant instruction elimination

✔​ Flow-of-control optimizations

✔​ Algebraic simplifications
✔​ Use of machine idioms

Redundant instruction elimination

●​ This optimization technique eliminates redundant loads and stores. Consider the following
sequence of instructions which are typical of the simple code generator algorithm that was discussed in
one of the previous modules:
MOV R0,a
MOV a,R0
●​ When this sequence is observed through a peephole, the second instruction can be deleted
provided, if it is not labeled with a target label. Peephole represents sequence of instructions with at
most one entry point. The first instruction can also be deleted by looking at the next-use information, if
live (a)=false.
Deleting Unreachable Code
●​ Codes that are never to be reached during a control flow could be removed. This optimization can
be carried out at the intermediate code level or final code level.
●​ Unlabeled blocks can be removed along with their instructions.
●​ The block that starts with the instruction, “b:=x+y” is never reachable and can be removed from
the control flow graph.
●​ Generally, a loop is a directed graph, whose nodes can reach all other nodes along some path. This
includes “unstructured” loops, with multiple entry and multiple exit points. A structured loop which is
called as a normal loop has one entry point, and (generally) a single point of exit.
●​ Loops created by mapping high- level source programs to intermediate code or assembly code are
normal. A “goto” statement can create any loop on the other hand a “break” statement creates additional
exits.
Reducible flow graphs:

●​ Reducible flow graphs are special flow graphs, for which several code optimization transformations are
especially easy to perform, loops are unambiguously defined, dominators can be easily calculated, data
flow analysis problems can also be solved efficiently.
●​ Exclusive use of structured flow-of-control statements such as if-then-else, while-do, continue, and
break statements produces programs whose flow graphs are always reducible.
●​ The most important properties of reducible flow graphs are that
1. There are no umps into the middle of loops from outside;
2. The only entry to a loop is through its header
5.8.CODE GENERATION
●​Code generator is used to produce the target code for three-address statements. It uses registers to store
the operands of the three address statement.
Example:
Consider the three address statement x:= y + z. It can have the following sequence of codes:
MOV x, R0
ADD y, R0
Register and Address Descriptors:
●​ A register descriptor contains the track of what is currently in each register. The register
descriptors show that all the registers are initially empty.
●​ An address descriptor is used to store the location where current value of the name can be found at
run time.
A code-generation algorithm:
The algorithm takes a sequence of three-address statements as input. For each three address statement of
the form a:= b op c perform the various actions. These are as follows:
1.​ Invoke a function getreg to find out the location L where the result of computation b op c should be
stored.
2.​ Consult the address description for y to determine y'. If the value of y currently in memory and register
both then prefer the register y' . If the value of y is not already in L then generate the instruction MOV
y' , L to place a copy of y in L.
3.​ Generate the instruction OP z' , L where z' is used to show the current location of z. if z is in both then
prefer a register to a memory location. Update the address descriptor of x to indicate that x is in location
L. If x is in L then update its descriptor and remove x from all other descriptor.
4.​ If the current value of y or z have no next uses or not live on exit from the block or in register then alter
the register descriptor to indicate that after execution of x : = y op z those register will no longer contain
y or z.
Generating Code for Assignment Statements:
The assignment statement d:= (a-b) + (a-c) + (a-c) can be translated into the following sequence of three
address code:
1.​ t:= a-b
2.​ u:= a-c
3.​ v:= t +u
4.​ d:= v+u
Code sequence for the example is as follows:

Statement Code Generated Register descriptor​ Address descriptor


Register empty

MOV a, R0​
t:= a - b R0 contains t t in R0
SUB b, R0

MOV a, R1​ R0 contains t​ t in R0​


u:= a - c
SUB c, R1 R1 contains u u in R1

R0 contains v​ u in R1​
v:= t + u ADD R1, R0
R1 contains u v in R1

ADD R1, R0​ d in R0​


d:= v + u R0 contains d
MOV R0, d d in R0 and memory

5.9.ISSUES IN THE DESIGN OF CODE GENERATOR


●​ Code generator converts the intermediate representation of source code into a form that can be readily
executed by the machine. A code generator is expected to generate the correct code. Designing of the
code generator should be done in such a way that it can be easily implemented, tested, and maintained.
The following issue arises during the code generation phase:
Input to code generator
●​ The input to the code generator is the intermediate code generated by the front end, along with
information in the symbol table that determines the run-time addresses of the data objects denoted by
the names in the intermediate representation.
●​ Intermediate codes may be represented mostly in quadruples, triples, indirect triples, Postfix
notation, syntax trees, DAGs, etc. The code generation phase just proceeds on an assumption that the
input is free from all syntactic and state semantic errors, the necessary type checking has taken place
and the type-conversion operators have been inserted wherever necessary.
Target program:
●​ The target program is the output of the code generator. The output may be absolute machine
language, relocatable machine language, or assembly language.
●​ Absolute machine language as output has the advantages that it can be placed in a fixed memory
location and can be immediately executed. For example, WATFIV is a compiler that produces the
absolute machine code as output.
●​ Relocatable machine language as an output allows subprograms and subroutines to be compiled
separately. Relocatable object modules can be linked together and loaded by a linking loader. But there
is added expense of linking and loading.
●​ Assembly language as output makes the code generation easier. We can generate symbolic
instructions and use the macro-facilities of assemblers in generating code. And we need an additional
assembly step after code generation.
Memory Management
●​ Mapping the names in the source program to the addresses of data objects is done by the front
end and the code generator. A name in the three address statements refers to the symbol table entry for
the name. Then from the symbol table entry, a relative address can be determined for the name
Instruction selection
●​ Selecting the best instructions will improve the efficiency of the program. It includes the
instructions that should be complete and uniform. Instruction speeds and machine idioms also play a
major role when efficiency is considered. But if we do not care about the efficiency of the target
program then instruction selection is straightforward. For example, the respective three-address
statements would be translated into the latter code sequence as shown below:
P:=Q+R
S:=P+T
MOV Q, R0
ADD R, R0
MOV R0, P
MOV P, R0
ADD T, R0
MOV R0, S
Here the fourth statement is redundant as the value of the P is loaded again in that statement that just has
been stored in the previous statement. It leads to an inefficient code sequence. A given intermediate
representation can be translated into many code sequences, with significant cost differences between the
different implementations. Prior knowledge of instruction cost is needed in order to design good
sequences, but accurate cost information is difficult to predict.
Register allocation issues – Use of registers make the computations faster in comparison to that of
memory, so efficient utilization of registers is important. The use of registers is subdivided into two
subproblems:
1.​ During Register allocation – we select only those sets of variables that will reside in the
registers at each point in the program.
2.​ During a subsequent Register assignment phase, the specific register is picked to access the
variable.
To understand the concept consider the following three address code sequence
t:=a+b
t:=t*c
t:=t/d
Their efficient machine code sequence is as follows:
MOV a,R0
ADD b,R0
MUL c,R0
DIV d,R0
MOV R0,t
Evaluation order – The code generator decides the order in which the instruction will be executed. The
order of computations affects the efficiency of the target code. Among many computational orders,
some will require only fewer registers to hold the intermediate results. However, picking the best order
in the general case is a difficult NP-complete problem.
Approaches to code generation issues: Code generator must always generate the correct code. It is
essential because of the number of special cases that a code generator might face. Some of the design
goals of code generator are:
●​ Correct
●​ Easily maintainable
●​ Testable
●​ Efficient
Disadvantages in the design of a code generator:
Limited flexibility: Code generators are typically designed to produce a specific type of code, and as a
result, they may not be flexible enough to handle a wide range of inputs or generate code for different
target platforms. This can limit the usefulness of the code generator in certain situations.
Maintenance overhead: Code generators can add a significant maintenance overhead to a project, as
they need to be maintained and updated alongside the code they generate. This can lead to additional
complexity and potential errors.
Debugging difficulties: Debugging generated code can be more difficult than debugging hand-written
code, as the generated code may not always be easy to read or understand. This can make it harder to
identify and fix issues that arise during development.
Performance issues: Depending on the complexity of the code being generated, a code generator may
not be able to generate optimal code that is as performant as hand-written code. This can be a concern in
applications where performance is critical.
Learning curve: Code generators can have a steep learning curve, as they typically require a deep
understanding of the underlying code generation framework and the programming languages being
used. This can make it more difficult to onboard new developers onto a project that uses a code
generator.
Over-reliance: It’s important to ensure that the use of a code generator doesn’t lead to over-reliance on
generated code, to the point where developers are no longer able to write code manually when
necessary. This can limit the flexibility and creativity of a development team, and may also result in
lower quality code overall.
5.10.THE TARGET MACHINE
Target code generation is the final Phase of Compiler.
1.​ Input : Optimized Intermediate Representation.
2.​ Output : Target Code.
3.​ Task Performed : Register allocation methods and optimization, assembly level code.
4.​ Method : Three popular strategies for register allocation and optimization.
5.​ Implementation : Algorithms.
●​ Target code generation deals with assembly language to convert optimized code into machine
understandable format. Target code can be machine readable code or assembly code.
●​ Each line in optimized code may map to one or more lines in machine (or) assembly code, hence
there is a 1:N mapping associated with them .

●​ Computations are generally assumed to be performed on high speed memory locations, known as
registers. Performing various operations on registers is efficient as registers are faster than cache memory.
This feature is effectively used by compilers,
●​ However registers are not available in large amount and they are costly. Therefore we should try to
use minimum number of registers to incur overall low cost.
Optimized code :
Example 1 :
L1: a = b + c * d
optimization :
t0 = c * d
a = b + t0
Example 2 :
L2: e = f - g / d
optimization :
t0 = g / d
e = f - t0
Register Allocation :
Register allocation is the process of assigning program variables to registers and reducing the number of
swaps in and out of the registers. Movement of variables across memory is time consuming and this is
the main reason why registers are used as they available within the memory and they are the fastest
accessible storage location.
Example 1:
R1<--- a
R2<--- b
R3<--- c
R4<--- d

MOV R3, c
MOV R4, d
MUL R3, R4
MOV R2, b
ADD R2, R3
MOV R1, R2
MOV a, R1
Example 2:
R1<--- e
R2<--- f
R3<--- g
R4<--- h
MOV R3, g
MOV R4, h
DIV R3, R4
MOV R2, f
SUB R2, R3
MOV R1, R2
MOV e, R1
Advantages :
●​ Fast accessible storage
●​ Allows computations to be performed on them
●​ Deterministic as it incurs no miss
●​ Reduce memory traffic
●​ Reduces overall computation time
Disadvantages :
●​ Registers are generally available in small amount ( up to few hundred Kb )
●​ Register sizes are fixed and it varies from one processor to another
●​ Registers are complicated
●​ Need to save and restore changes during context switch and procedure calls

You might also like