0% found this document useful (0 votes)
37 views67 pages

CD All Units

Uploaded by

kirlasurya2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views67 pages

CD All Units

Uploaded by

kirlasurya2002
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Unit 1 - Introduction

1. Explain overview of translation process.


 A translator is a kind of program that takes one form of program as input and converts
it into another form.
 The input is called source program and output is called target program.
 The source language can be assembly language or higher level language like C, C++,
FORTRAN, etc...
 There are three types of translators,
1. Compiler
2. Interpreter
3. Assembler

2. What is compiler? List major functions done by compiler.


 A compiler is a program that reads a program written in one language and translates
into an equivalent
ent program in another language.
language

Source Program Compiler Target Program

Error report
Fig.1.1. A Compiler
Major functions done by compiler:
 Compiler is used to convert one form of program to another.
 A compiler should convert the source program to a target machine code in such a way
that the generated target code should be easy to understand.
 Compiler should preserve
preserv the meaning of source code.
 Compiler should report errors that occur during compilation process.
 The compilation must be done efficiently.

3. Write the difference between compiler, interpreter and


assembler.
1. Compiler v/s Interpreter
No. Compiler Interpreter
1 Compiler takes entire program as an a Interpreter takes single instruction as aan
input. input.
2 Intermediate code is generated. No Intermediate code is generated.
3 Memory requirement is more. Memory requirement is less.
4 Error is displayed after entire program Error is displayed for every instruction
is checked. interpreted.
5 Example: C compiler Example: BASIC
Table 1.1 Difference between Compiler & Interpreter

Dixita Kagathara, CE Department | 170701 – Compiler Design 1


Unit 1 - Introduction

2. Compiler v/s Assembler


No. Compiler Assembler
1 It translates higher level language to It translates mnemonic operation code to
machine code. machine code.
2 Types of compiler, Types of assembler,
 Single pass compiler  Single pass assembler
 Multi pass compiler  Two pass assembler
3 Example: C compiler Example: 8085, 8086 instruction set
Table 1.2 Difference between Compiler & Assembler

4. Analysis synthesis model of compilation.


compilation OR
Explain structure of compiler. OR
Explain phases of compiler. OR
Write output of phases of a complier. for a = a + b * c * 2; type of
a, b, c are float
There are mainly two parts of compilation process.
1. Analysis phase: The main objective of the analysis phase is to break the source code
into parts and then arranges these pieces into a meaningful structure
structure.
2. Synthesis phase: Synthesis phase is concerned with generation of target language
statement which has the same meaning as the source statement.
Analysis Phase: Analysis part is divided into three sub parts,
I. Lexical analysis
II. Syntax analysis
III. Semantic analysis
Lexical analysis:
 Lexical analysis is also called linear analysis or scanning.
 Lexical analyzer reads the source program and then it is broken into stream of units.
Such units are called token.
 Then it classifies the units into different lexical classes. E.g. id’s, constants, keyword
etc...and enters them into different tables.
 For example, in lexical analysis the assignment statement a: = a + b * c * 2 would be
grouped into the following tokens:
tokens
a Identifier 1
= Assignment sign
a Identifier 1
+ The plus sign
b Identifier 2
* Multiplication
sign
c Identifier 3
* Multiplication
Dixita Kagathara, CE Department | 170701 – Compiler Design 2
Unit 1 - Introduction
sign
2 Number 2
Syntax Analysis:
 Syntax analysis is also called hierarchical analysis or parsing.
 The syntax analyzer checks each line of the code and spots every tiny mistake that the
programmer has committed while typing the code.
 If code is error free then syntax analyzer generates the tree.
=

+
a

a *

b *

c 2
Semantic analysis:
 Semantic analyzer determines the meaning of a source string.
 For example matching of parenthesis in the expression, or matching of if..else
statement or performing arithmetic operation that are type compatible, or checking the
scope of operation.
=

+
a

a *

b *

c 2

Int to float
Synthesis phase:: synthesis part is divided into three sub parts,
I. Intermediate code generation
II. Code optimization
III. Code generation
Intermediate code generation:
 The intermediate representation should have two important properties, it should be
Dixita Kagathara, CE Department | 170701 – Compiler Design 3
Unit 1 - Introduction
easy to produce and easy to translate into target program.
 We consider intermediate form called “three address code”.
 Three address code consist of a sequence of instruction, each of which has at most
three operands.
 The source program might appear in three address code as,
t1= int to real(2)
t2= id3 * t1
t3= t2 * id2
t4= t3 + id1
id1= t4
Code optimization:
 The code optimization phase attempt to improve the intermediate code.
 This is necessary to have a faster executing code or less consumption of memory.
 Thus by optimizing the code the overall running running time of a target program can be
improved.
t1= id3 * 2.0
t2= id2 * t1
id1 = id1 + t2
Code generation:
 In code generation phase the target code gets generated. The intermediate code
instructions are translated into sequence of machine instruction.
instructio
MOV id3, R1
MUL #2.0, R1
MOV id2, R2
MUL R2, R1
MOV id1, R2
ADD R2, R1
MOV R1, id1
Symbol Table
 A symbol table is a data structure used by a language translator such as a compiler or
interpreter.
 It is used to store names encountered in the source program, along with the relevant
attributes for those names.
 Information
tion about following entities is stored in the symbol table.
 Variable/Identifier
 Procedure/function
 Keyword
 Constant
 Class name
 Label name

Dixita Kagathara, CE Department | 170701 – Compiler Design 4


Unit 1 - Introduction

Source program

Lexical Analysis

Syntax Analysis

Semantic Analysis
Symbol Table Error detection
and recovery
Intermediate Code

Code Optimization

Code Generation

Target Program

Fig.1.2. Phases of Compiler

5. The context of a compiler.


compiler OR
Cousins of compiler. OR
What does the linker do? What does the loader do? What does
the Preprocessor do? Explain their role(s) in compilation
process.
 In addition to a compiler, several other programs may be required to create an
executable target program.
Preprocessor
Preprocessor produces input to compiler. They may perform the following functions,
1. Macro processing: A preprocessor may allow user to define
define macros that are shorthand
for longer constructs.
2. File inclusion: A preprocessor may include the header file into the program text.
3. Rational preprocessor: Such a preprocessor provides the user with built in macro for
construct like while statement or if
i statement.
4. Language extensions: this processors attempt to add capabilities to the language by
what amount to built-in
built in macros. Ex: the language equal is a database query language
embedded in C. statement beginning with ## are taken by preprocessor to be database
access statement unrelated to C and translated into procedure call on routines that
perform the database access.
Dixita Kagathara, CE Department | 170701 – Compiler Design 5
Unit 1 - Introduction

Skeletal source

Preprocessor

Source program

Compiler

Target assembly

Assembler

Relocatable M/C code

Linker / Loader

Absolute M/C code


Fig.1.3. Context of Compiler
Assembler
Assembler is a translator which takes the assembly program as an input and generates the
machine code as a output. An assembly is a mnemonic version n of machine code, in which
names are used instead of binary codes for operations.
Linker
Linker allows us to make a single program from a several files of relocatable machine code.
These file may have been the result of several
several different compilation, and one or more may be
library files of routine provided by a system.
Loader
The process of loading consists of taking relocatable machine code, altering the relocatable
address and placing the altered instructions and data in memory at the proper location.

6. Explain front end and back end in brief. (Grouping of phases)


 The phases are collected into a front end and back end.
Front end
 The front end consist of those phases, that depends primarily on source language and
largely independent of the target machine.
 Front end includes lexical analysis, syntax analysis, semantic analysis, intermediate code
generation and creation of symbol table.
 Certain amount of code optimization can be done by front end.
Back end
 The back end consists of those phases, that depends on target machine and do not
depend on source program.

Dixita Kagathara, CE Department | 170701 – Compiler Design 6


TYPES OF COMPILERS:
Based on the specific input it takes and the output it produces, the Compilers can be classified into
the following types
1) Traditional Compilers(C, C++, Pascal):
These Compilers convert a source program in a HLL into its equivalent in native machine code or
object code.
2)Interpreters(LISP, SNOBOL, Java1.0):
These Compilers first convert Source code into intermediate code, and then interprets (emulates) it
to its equivalent machine code.
3)Cross-Compilers:
These are the compilers that run on one machine and produce code for another machine.
4)Incremental Compilers: These compilers separate the source into user defined–steps;
Compiling/recompiling step- by- step; interpreting steps in a given order
5) Converters (e.g. COBOL to C++):
These Programs will be compiling from one high level language to another.
6)Just-In-Time (JIT) Compilers (Java, Micosoft.NET):
These are the runtime compilers from intermediate language (byte code, MSIL) to executable code
or native machine code. These perform type –based verification which makes the executable code
more trustworthy
7) Ahead-of-Time (AOT) Compilers (e.g., .NET ngen):
These are the pre-compilers to the native code for Java and .NET
8) Binary Compilation:
These compilers will be compiling object code of one platform into object code of another
platform.

LEXICAL ANALYSIS:
→ As the first phase of a compiler, the main task of the lexical analyzer is to read the input
characters of the source program, group them into lexemes, and produce as output tokens for each
lexeme in the source program. This stream of tokens is sent to the parser for syntax analysis. It is
common for the lexical analyzer to interact with the symbol table as well.
→When the lexical analyzer discovers a lexeme constituting an identifier, it needs to enter that
lexeme into the symbol table. This process is shown in the following figure.

→When lexical analyzer identifies the first token it will send it to the parser, the parser receives the
token and calls the lexical analyzer to send next token by issuing the getNextToken() command.
This Process continues until the lexical analyzer identifies all the tokens. During this process the
lexical analyzer will neglect or discard the white spaces and comment lines.
LEXICAL ANALYSIS Vs PARSING:
There are a number of reasons why the analysis portion of a compiler is normally separated into
lexical analysis and parsing (syntax analysis) phases.
 1. Simplicity of design is the most important consideration. The separation of Lexical
and Syntactic analysis often allows us to simplify at least one of these tasks. For
example, a parser that had to deal with comments and whitespace as syntactic units
would be considerably more complex than one that can assume comments and
whitespace have already been removed by the lexical analyzer.
 2. Compiler efficiency is improved. A separate lexical analyzer allows us to apply
specialized techniques that serve only the lexical task, not the job of parsing. In addition,
specialized buffering techniques for reading input characters can speed up the compiler
significantly.
 3. Compiler portability is enhanced: Input-device-specific peculiarities can be
restricted to the lexical analyzer.
INPUT BUFFERING:

some ways that the simple but important task of reading the source program can be speeded.
This task is made difficult by the fact that we often have to look one or more characters beyond
the next lexeme before we can be sure we have the right lexeme. There are many situations
where we need to look at least one additional character ahead. For instance, we cannot be sure
we've seen the end of an identifier until we see a character that is not a letter or digit, and
therefore is not part of the lexeme for id. In C, single-character operators like -, =, or <
could also be the beginning of a two-character operator like ->, ==, or <=. Thus, we shall
introduce a two-buffer scheme that handles large look aheads safely. We then consider an
improvement involving "sentinels" that saves time checking for the ends of buffers.
Buffer Pairs

Each buffer is of the same size N, and N is usually the size of a disk block, e.g., 4096
bytes. Using one system read command we can read N characters in to a buffer, rather than
using one system call per character. If fewer than N characters remain in the input file, then a
special character, represented by eof, marks the end of the source file and is different from any
possible character of the source program.
 Two pointers to the input are maintained:
1. The Pointer lexemeBegin, marks the beginning of the current lexeme, whose extent
we are attempting to determine.
2. Pointer forward scans ahead until a pattern match is found; the exact strategy
whereby this determination is made will be covered in the balance of this chapter.

→Once the next lexeme is determined, forward is set to the character at its right end. Then,
after the lexeme is recorded as an attribute value of a token returned to the parser, 1exemeBegin
is set to the character immediately after the lexeme just found. In Fig, we see forward has passed
the end of the next lexeme, ** (the FORTRAN exponentiation operator), and must be retracted
one position to its left.
Advancing forward requires that we first test whether we have reached the end of one
of the buffers, and if so, we must reload the other buffer from the input, and move forward to
the beginning of the newly loaded buffer. As long as we never need to look so far ahead of the
actual lexeme that the sum of the lexeme's length plus the distance we look ahead is greater
than N, we shall never overwrite the lexeme in its buffer before determining it.

LEX the Lexical Analyzer generator


Lex is a tool used to generate lexical analyzer, the input notation for the Lex tool is
referred to as the Lex language and the tool itself is the Lex compiler. Behind the scenes, the
Lex compiler transforms the input patterns into a transition diagram and generates code, in a
file called lex .yy .c, it is a c program given for C Compiler, gives the Object code. Here we need
to know how to write the Lex language. The structure of the Lex program is given below.
Structure of LEX Program : A Lex program has the following form:

The declarations section : includes declarations of variables, manifest constants (identifiers


declared to stand for a constant, e.g., the name of a token), and regular definitions. It appears
between %{. . .%}
→In the Translation rules section, We place Pattern Action pairs where each pair have the form
Pattern {Action}
→The auxiliary function definitions section includes the definitions of functions used to install
identifiers and numbers in the Symbol tale
LEX Program Example:
%{
/* definitions of manifest constants LT,LE,EQ,NE,GT,GE, IF,THEN, ELSE,ID, NUMBER,
RELOP */
%}
/* regular definitions */
delim [ \t\n]
ws { delim}+
letter [A-Za-z]
digit [o-91
id {letter} ({letter} | {digit}) *
number {digit}+ (\ . {digit}+)? (E [+-I]?{digit}+)?
%%
{ws} {/* no action and no return */}
if {return(1F) ; }
then {return(THEN) ; }
else {return(ELSE) ; }
(id) {yylval = (int) installID(); return(1D);}
(number) {yylval = (int) installNum() ; return(NUMBER) ; }
‖ < ‖ {yylval = LT; return(REL0P) ; )}
― <=‖ {yylval = LE; return(REL0P) ; }
―=‖ {yylval = EQ ; return(REL0P) ; }
―<>‖ {yylval = NE; return(REL0P);}
―<‖ {yylval = GT; return(REL0P);)}
―<=‖ {yylval = GE; return(REL0P);}
%%
int installID0() {/* function to install the lexeme, whose first character is pointed to by yytext,
and whose length is yyleng, into the symbol table and return a pointer
thereto */
int installNum() {/* similar to installID, but puts numerical constants into a separate table */}

SYNTAX ANALYSIS (PARSER)


THE ROLE OF THE PARSER:
In our compiler model, the parser obtains a string of tokens from the lexical analyzer,
as shown in the below Figure, and verifies that the string of token names can be generated
by the grammar for the source language. We expect the parser to report any syntax errors in
an intelligible fashion and to recover from commonly occurring errors to continue processing the
remainder of the program. Conceptually, for well-formed programs, the parser constructs a parse

tree and passes it to the rest of the compiler for further processing.
→During the process of parsing it may encounter some error and present the error information back
to the user
Syntactic errors include misplaced semicolons or extra or missing braces; that is,
―{" or "}." As another example, in C or Java, the appearance of a case statement without
an enclosing switch is a syntactic error (however, this situation is usually allowed by the
parser and caught later in the processing, as the compiler attempts to generate code).
→Based on the way/order the Parse Tree is constructed, Parsing is basically classified in to
following two types:
1. Top Down Parsing : Parse tree construction start at the root node and moves to the
children nodes (i.e., top down order).
2. Bottom up Parsing: Parse tree construction begins from the leaf nodes and proceeds
towards the root node (called the bottom up order).

LANGUAGE PROCESSING SYSTEM:


Based on the input the translator takes and the output it produces, a language translator can be
called as any one of the following.
Preprocessor: A preprocessor takes the skeletal source program as input and produces an extended
version of it, which is the resultant of expanding the Macros, manifest constants if any, and
including header files etc in the source file.
→For example, the C preprocessor is a macro processor
that is used automatically by the C compiler to transform our source before actual compilation.
Over and above a preprocessor performs the following activities:
 Collects all the modules, files in case if the source program is divided into different modules
stored at different files.
 Expands short hands / macros into source language statements.
Compiler: Is a translator that takes as input a source program written in high level language and
converts it into its equivalent target program in machine language. In addition to above the compiler
also
 Reports to its user the presence of errors in the source program.
 Facilitates the user in rectifying the errors, and execute the code.
Assembler: Is a program that takes as input an assembly language program and converts it into its
equivalent machine language code.
Loader / Linker: This is a program that takes as input a relocatable code and collects the library
functions, relocatable object files, and produces its equivalent absolute machine code.
Specifically,
 Loading consists of taking the relocatable machine code, altering the relocatable addresses,
and placing the altered instructions and data in memory at the proper locations.
 Linking allows us to make a single program from several files of relocatable machine
code. These files may have been result of several different compilations, one or more
may be library routines provided by the system available to any program that needs them.

→in addition to these translators, programs like interpreters, text formatters etc., may be used in
language processing system. To translate a program in a high level language program to an
executable one, the Compiler performs by default the compile and linking functions.
→Normally the steps in a language processing system includes Preprocessing the skeletal Source
program which produces an extended or expanded source program or a ready to compile unit of
the source program, followed by compiling the resultant, then linking / loading , and finally its
equivalent executable code is produced.
→As I said earlier not all these steps are mandatory. In
some cases, the Compiler only performs this linking and loading functions implicitly.
UNIT-II
Context Free Grammar (CFG):
→CFG used to describe or denote the syntax of the programming language constructs. The
CFG is denoted as G, and defined using a four tuple
notation.
→Let G be CFG, then G is written as, G= (V, T, P, S)
Where,
 V is a finite set of Non terminal; Non terminals are syntactic variables that denote sets of
strings. The sets of strings denoted by non terminals help define the language generated
by the grammar. Non terminals impose a hierarchical structure on the language that
is key to syntax analysis and translation.
 T is a Finite set of Terminal; Terminals are the basic symbols from which strings are
formed. The term "token name" is a synonym for '"terminal" and frequently we will use
the word "token" for terminal when it is clear that we are talking about just the token
name. We assume that the terminals are the first components of the tokens output by the
lexical analyzer.
 S is the Starting Symbol of the grammar, one non terminal is distinguished as the start
symbol, and the set of strings it denotes is the language generated by the grammar. P
is finite set of Productions; the productions of a grammar specify the manner in which the
erminals and non terminals can be combined to form strings, each production is in α->β
form, where α is a single non terminal, β is (VUT)*.Each production consists of:
(a) A non terminal called the head or left side of the production; this
production defines some of the strings denoted by the head.
(b) The symbol ->. Some times: = has been used in place of the arrow.
(c) A body or right side consisting of zero or more terminals and non-
terminals. The components of the body describe one way in which strings of the non
terminal at the head can be constructed.
 Conventionally, the productions for the start symbol are listed first.
Example: Context Free Grammar to accept Arithmetic expressions.

→The terminals are +, *, -, (,), id.


The Non terminal symbols are expression, term, factor and expression is the starting
symbol.
expression →expression + term
expression →expression – term
expression → term
term → term * factor
term →term / factor
term →factor
factor → ( expression )
factor → id

BACK TRACKING:
This parsing method uses the technique called Brute Force method
during the parse tree construction process. This allows the process to go back (back track)
and redo the steps by undoing the work done so far in the point of processing.
→Brute force method: It is a Top down Parsing technique, occurs when there is more
than one alternative in the productions to be tried while parsing the input string. It selects
alternatives in the order they appear and when it realizes that something gone wrong it tries
with next alternative.
→For example, consider the grammar bellow.
S cAd
A ab | a
To generate the input string ―cad‖, initially the first parse tree given below is generated.
As the string generated is not ―cad‖, input pointer is back tracked to position ―A‖, to
examine the next alternate of ―A‖. Now a match to the input string occurs

2nd parse tress

Constructing Predictive Or LL (1) Parse Table:


It is the process of placing the all productions of the grammar in the parse table based on the
FIRST and FOLLOW values of the Productions.
The rules to be followed to Construct the Parsing Table (M) are :
1. For Each production A-> α of the grammar, do the bellow steps.
2. For each terminal symbol ‗a‘ in FIRST (α), add the production A-> α to M [A, a].
3. i. If € is in FIRST (α) add production A->α to M [ A, b], where b is all terminals in
FOLLOW (A).
ii. If € is in FIRST(α) and $ is in FOLLOW (A) then add production A->α to
M [A, $].
4. Mark other entries in the parsing table as error .
LL (1) Parsing Algorithm:
The parser acts on basis on the basis of two symbols
i. A, the symbol on the top of the stack
ii. a, the current input symbol
There are three conditions for A and ‗a‘, that are used fro the parsing program.
1. If A=a=$ then parsing is Successful.
2. If A=a≠$ then parser pops off the stack and advances the current input pointer to the
next.
3. If A is a Non terminal the parser consults the entry M [A, a] in the parsing table. If
M[A, a] is a Production A-> X1X2..Xn, then the program replaces the A on the top of
the Stack by X1X2..Xn in such a way that X1 comes on the top.

RECURSIVE DESCENT PARSING :


A recursive-descent parsing program consists of a set of recursive procedures, one for each
nonterminal. Each procedure is responsible for parsing the constructs defined by its non
terminal,Execution begins with the procedure for the start symbol, which halts and
announces success if its procedure body scans the entire input string.
If the given grammar is
E TE′
E′ +TE′ | €
T FT′
T′ *FT′ | €
F (E) | id
Reccursive procedures for the recursive descent parser for the given grammar are given
below.
procedure E( )
{
T( );
E′( );
}
procedure T ( )
{
F( );
T′( );
}
Procedure E′( )
{
if input = ‗+‘
{
advance( );
T ( );
E′( );
return true;
}
else error;
}
procedure T′( )
{
if input = ‗*‘
{
advance( );
F()
T′( );
return true;
}
else return error;
}
procedure F( )
{
if input = ‗(‗
{
advance( );
E ( );
if input = ‗)‘
advance( );
return true;
}
else if input = ―id‖
{
advance( );
return true;
}
else return error;
}
advance()
{
input = next token;
}

What is Predictive Parsing?


→Predictive parsing is a form of recursive descent parsing, in which no backtracking is
needed, so it can predict which products to use as the replacement of the input
string. There's a version called LL(1) parsing that uses a table to help make decisions,
which makes it very efficient and straightforward.
→Predictive parsing is a parsing technique used in compiler construction and syntax
analysis. It helps compilers analyze and understand the structure of code by predicting the
part that comes next depending on what's already there. This simplifies the process of
translating code into machine-readable instructions.
→Predictive parsing is a parsing technique used in compiler design to analyze and
validate the syntactic structure of a given input string based on a grammar. It
predicts the production rules to apply without backtracking, making it efficient and
deterministic.

Type of Predictive Parsing


1. Top-down Parsing
In top-down parsing, the parser starts with the top-level non-terminal of the grammar and
attempts to construct a parse tree by repeatedly expanding non-terminals until the input
string is derived. It predicts the production rules to apply from left to right, hence the name
"top-down."
Eg. Consider the grammar:
S -> A B
A -> a
B -> b
Let's parse the input string “ab” using top-down predictive parsing.
1. Construct the predictive parsing table:

a|b

S S -> A B

A A -> a

B B-> b
2. Initialising stack and input buffer:
Stack: '$ S'
Input Buffer: 'ab $'
3. Apply top-down predictive parsing algorithm.
Stack Input Buffer Action

$S ab$ Match terminal ‘a’

$SA b$ Match terminal ‘b’

$SAB $ Accept
2. Bottom up Parsing
In bottom-up parsing, the parser begins with the input string and attempts to reduce it to
the start symbol of the grammar by applying production rules in a right-to-left manner. It
builds the parse tree from the bottom-up, hence the name "bottom-up."
Eg. Consider the grammar:
S -> A B
A -> a
B -> b
Let's parse the input string “ab” using bottom-up predictive parsing.
1. Initialize the stack and input buffer.
Stack: $
Input Buffer: ab $

2. Applying bottom-up parsing algorithm:


Stac Input
Action
k Buffer

$ ab$ Shift ‘a’ onto stack

$a b$ Shift ‘b’ onto stack

$ab $ Reduce using production B -> b

$aB $ Reduce using production A -> a

$A $ Reduce using production S -> A B

$S $ Accept

Components of Predictive Parsing


1. Input Buffer
The input buffer is a temporary storage area that holds the input symbols yet to be
processed by the parser. It provides a lookahead mechanism, allowing the parser to
examine the next input symbol without consuming it.
In this diagram, the input buffer is shown as a series of unprocessed tokens (T) and non-
terminals (N). The end-of-input marker is denoted by the "$" symbol. Based on the top of
the parsing stack and the current input symbol, the parsing table is a data structure that
directs the parser's activities. The parser's next token to process is indicated by the arrow
"^" in the input buffer. The arrow moves to the right as the parser reads tokens from the
input buffer, processing each token until the end-of-input marker is reached.
2. Stack
The stack, also known as the parsing stack or the pushdown stack, is a data structure
used by the parser to keep track of the current state and to guide the parsing process. It
stores grammar symbols, both terminals and non-terminals, during the parsing process.

The stack in predictive parsing is a vertical sequence of symbols representing terminals


(T) and non-terminals (N) from the language's grammar. The top of the stack, indicated by
the "^" symbol, represents the symbol currently being processed by the parser. By
comparing the input symbol with the top of the stack, the parser determines the next action
to take. The parser can shift new symbols onto the stack or reduce symbols using
grammar productions as it processes symbols from the input stream. The stack is crucial
for maintaining the parser's state and enabling parsing actions. It allows the parser to track
encountered symbols, make decisions based on the grammar, and interact with the
parsing table. In practical implementations, the stack may contain additional information,
such as attributes or semantic actions associated with the symbols.
3. Predictive Parsing Table
The predictive parsing table is a data structure used in predictive parsing. It maps the
combination of a non-terminal and a lookahead symbol to a production rule. It guides the
parser's decisions, determining which production to apply based on the current non-
terminal and the next input symbol.
This diagram represents the parsing table as a grid with rows and columns. The rows
correspond to the non-terminals in the grammar, and the columns represent the input
symbols (terminals and the end-of-input marker $).
Each cell in the table contains information about the action or the production to be applied
when a specific non-terminal and input symbol combination is encountered. Here's what
each entry in the table represents:
•S, A, B: Non-terminals in the grammar.
•S->, A->, B->: Indicates a production rule to be applied.
•a, b: Terminal symbols from the input alphabet.
•ε: Represents an empty or epsilon production.
•Empty cells: Indicate that no action or production is defined for that combination of
non-terminal and input symbols.

For example, if the parser is in state S (row) and the next input symbol is "a" (column), the
corresponding cell entry "S->aAB" instructs the parser to apply the production rule "S ->
aAB".
The parsing table is a critical component of predictive parsing as it systematically guides
the parser's decisions and actions. By consulting the table, the parser can determine the
appropriate production or action based on the current state and the lookahead input
symbol, enabling it to construct the parse tree or detect syntax errors during the parsing
process.

Drawback of Predictive Parsing


These are some drawbacks of Predictive Parsing:
•Left Recursion: Predictive parsing cannot handle left-recursive grammars directly.
Left recursion occurs when a non-terminal directly or indirectly references itself as the
first symbol in its production rule.
•Backtracking: Predictive parsing does not support backtracking, meaning it cannot
change the decision it has made based on the input. This limitation restricts the class
of grammars that can be handled by predictive parsing.
•Difficulty handling ambiguous grammars: Predictive parsing has difficulties when
dealing with ambiguous grammars, where a single input can have numerous correct
parse trees. The LL(1) parsing approach cannot directly handle ambiguity. To resolve
ambiguities, grammar adjustments or additional strategies like disambiguation rules or
semantic actions may be needed, which might make the parser implementation more
complex.

Algorithm to Construct Predictive Parsing Table


The algorithm to construct a Predictive Parsing table involves several steps:
1. Determine the First Sets for all non-terminals in the grammar. The First Set of a non-
terminal represents the set of terminals that can appear as the first symbol of any
derivation for that non-terminal.
2. Determine the Follow Sets for all non-terminals in the grammar. The Follow Set of a
non-terminal represents the set of terminals that can appear immediately after the non-
terminal in any derivation.
3. For each production rule A -> α in the grammar: a. For each terminal 'a' in First(α), add A
-> α to the table entry [A, 'a']. b. If ε (empty string) is in First(α), for each terminal 'b' in
Follow(A), add A -> α to the table entry [A, 'b'].
4. For each production rule A -> ε in the grammar, add A -> ε to the table entry [A, 'b'] for
each terminal 'b' in Follow(A).
Examples of Predictive Parsing
Consider the grammar:
S -> A B
A -> a
B -> b | ε
1. Determine the First Sets:
First(S) = {a}
First(A) = {a}
First(B) = {b, ε}
2. Determine the Follow Sets:
Follow(S) = {$}
Follow(A) = {b}
Follow(B) = {$}
3. Construct the Predictive Parsing Table:
a b $

S AB

A a

B b ε
4. Table entries based on the algorithm:.
.From the production S -> A B:
For each terminal 'a' in First(A), add S -> A B to the entry [S, 'a'].
•From the production A -> a:
Add A -> a to the entry [A, 'a'].
•From the production B -> b:
Add B -> b to the entry [B, 'b'].
•From the production B -> ε:
For each terminal 'b' in Follow(B), add B -> ε to the entry [B, 'b'].
Add B -> ε to the entry [B, '$'].

Q. What is a parser?
A parser is a component of a compiler or interpreter that divides data into smaller chunks
for easier translation into another language. A parser breaks down input in the form of a
sequence of tokens, interactive commands, or computer instructions into portions that can
be used by other programming components.
Q. What is a recursive descent parser?
It can be defined as a Parser that processes the input string using numerous recursive
procedures with no backtracking. Recursive Descent Parser employs Top-Down Parsing
without the need for retracing.
Q. What is the main purpose of parser?
Parsers are employed to abstractly represent input data, such as source code, as a
structured format. This facilitates syntax validation by verifying adherence to language
rules. Parsing is utilized in various coding languages and technologies for this purpose.
Q. What is the difference between predictive and recursive
parsing?
Predictive parsing uses lookahead tokens to make parsing decisions without
backtracking, suitable for LL grammars. Recursive parsing involves functions
calling themselves to process input, adaptable but potentially less efficient due to
backtracking in some cases.

SHIFT-REDUCE PARSING:
→Shift-reduce parsing is a form of bottom-up parsing in which a stack holds grammar
symbols and an input buffer holds the rest of the string to be parsed, We use $ to mark the
bottom of the stack and also the right end of the input. And it makes use of the process of
shift and reduce actions to accept the input string.
→Here, the parse tree is Constructed bottom up from the leaf nodes towards the root node.
When we are parsing the given input string, if the match occurs the parser takes the
reduce action otherwise it will go for shift action. And it can accept ambiguous grammars
also.
→For example, consider the below grammar to accept the input string ―id * id―, using S-
R parser

E→ E+T|T
T→ T*F | F
F →(E)|id
→Actions of the Shift-reduce parser using Stack implementation
STACK INPUT ACTION
$ Id*id$ Shift
$id *id$ Reduce with F d
$F *id$ Reduce with T F
$T *id$ Shift
$T* id$ Shift
$T*id $ Reduce with F id
$T*F $ Reduce with T T*F
$T $ Reduce with E T
$E $ Accept

Consider the following grammar:


S aAcBe
A Ab|b
Bd
Let the input string is ―abbcde‖. The series of shift and reductions to the start symbol are as
follows.
abbcde →aAbcde →aAcde →aAcBe S
Note: in the above example there are two actions possible in the second Step, these are as
follows :
1. Shift action going to 3rd Step
2. Reduce action, that is A->b
If the parser is taking the 1st action then it can successfully accepts the given input string,
if it is going for second action then it can‘t accept given input string. This is called shift
reduce
conflict. Where, S-R parser is not able take proper decision, so it not recommended for
parsing
LR Parsing:
Most prevalent type of bottom up parsing is LR (k) parsing. Where, L is left to right scan of
the
given input string, R is Right Most derivation in reverse and K is no of input symbols as the
Look ahead.
 It is the most general non back tracking shift reduce parsing method
 The class of grammars that can be parsed using the LR methods is a proper superset of
the class of grammars that can be parsed with predictive parsers.
 An LR parser can detect a syntactic error as soon as it is possible to do so, on a left to
right scan of the input.
LR Parser Consists of
 An input buffer that contains the string to be parsed followed by a $ Symbol, used to
indicate end of input.
 A stack containing a sequence of grammar symbols with a $ at the bottom of the stack,
which initially contains the Initial state of the parsing table on top of $.
 A parsing table (M), it is a two dimensional array M[ state, terminal or Non terminal] and
it contains two parts
1. ACTION Part
The ACTION part of the table is a two dimensional array indexed by state and the
input symbol, i.e. ACTION[state][input], An action table entry can have one of
following four kinds of values in it. They are:
1. Shift X, where X is a State number.
2. Reduce X, where X is a Production number.
3. Accept, signifying the completion of a successful parse.
4. Error entry.
2. GO TO Part
The GO TO part of the table is a two dimensional array indexed by state and a
Non terminal, i.e. GOTO[state][NonTerminal]. A GO TO entry has a state
number in the table.
 A parsing Algorithm uses the current State X, the next input symbol ‗a‘ to consult the
entry at action[X][a]. it makes one of the four following actions as given below:
1. If the action[X][a]=shift Y, the parser executes a shift of Y on to the top of the stack
and advances the input pointer.
2. If the action[X][a]= reduce Y (Y is the production number reduced in the State X), if
the production is Y->β, then the parser pops 2*β symbols from the stack and push Y
on to the Stack.
3. If the action[X][a]= accept, then the parsing is successful and the input string is
accepted.
4. If the action[X][a]= error, then the parser has discovered an error and calls the error
routine.
The parsing is classified in to
1. LR ( 0 )
2. Simple LR ( 1 )
3. Canonical LR ( 1 )
4. Look ahead LR ( 1 )
LALR (1) Parsing
The CLR Parser avoids the conflicts in the parse table. But it produces more number of
States when compared to SLR parser. Hence more space is occupied by the table in the
memory.
→So LALR parsing can be used. Here, the tables obtained are smaller than CLR parse
table. But it also as efficient as CLR parser. Here LR (1) items that have same productions
but different look- aheads are combined to form a single set of items.
→For example, consider the grammar in the previous example. Consider the states I4 and I7
as given below:
I4= Goto( I0, d)= Colsure( C->d•, c/d) = C->d•, c/d
I7 = Go to (I2 , d)= Closure(C->d•,$ ) = C->d•, $
→These states are differing only in the look-aheads. They have the same productions.
Hence these states are combined to form a single state called as I47.
Similarly the states I3 and I6 differing only in their look-aheads as given below:
I3= Goto(I0,c)=
C->c•C, c/d
C->•cC, c/d
C->•d , c/d
I6= Goto ( I2, c)=
C->c•C , $
C->•cC , $
C->•d,$
→These states are differing only in the look-aheads. They have the same productions.
Hence these states are combined to form a single state called as I36.
Similarly the States I8 and I9 differing only in look-aheads. Hence they combined to form
the state I89

Error recovery in parsing


A parser should be able to detect and report any error in the program. It is
expected that when an error is encountered, the parser should be able to
handle it and carry on parsing the rest of the input. Mostly it is expected from
the parser to check for errors but errors may be encountered at various stages
of the compilation process. A program may have the following kinds of errors
at various stages:

•Lexical : name of some identifier typed incorrectly


•Syntactical : missing semicolon or unbalanced parenthesis
•Semantical : incompatible value assignment
•Logical : code not reachable, infinite loop
There are four common error-recovery strategies that can be implemented in
the parser to deal with errors in the code.
Panic mode
When a parser encounters an error anywhere in the statement, it ignores the
rest of the statement by not processing input from erroneous input to
delimiter, such as semi-colon. This is the easiest way of error-recovery and
also, it prevents the parser from developing infinite loops.
Statement mode
When a parser encounters an error, it tries to take corrective measures so that
the rest of inputs of statement allow the parser to parse ahead. For example,
inserting a missing semicolon, replacing comma with a semicolon etc. Parser
designers have to be careful here because one wrong correction may lead to
an infinite loop.
Error productions
Some common errors are known to the compiler designers that may occur in
the code. In addition, the designers can create augmented grammar to be
used, as productions that generate erroneous constructs when these errors are
encountered.
Global correction
The parser considers the program in hand as a whole and tries to figure out
what the program is intended to do and tries to find out a closest match for it,
which is error-free. When an erroneous input (statement) X is fed, it creates a
parse tree for some closest error-free statement Y. This may allow the parser
to make minimal changes in the source code, but due to the complexity (time
and space) of this strategy, it has not been implemented in practice yet.
Abstract Syntax Trees
Parse tree representations are not easy to be parsed by the compiler, as they
contain more details than actually needed. Take the following parse tree as an
example:
If watched closely, we find most of the leaf nodes are single child to their
parent nodes. This information can be eliminated before feeding it to the next
phase. By hiding extra information, we can obtain a tree as shown below:

Abstract tree can be represented as:

ASTs are important data structures in a compiler with least unnecessary


information. ASTs are more compact than a parse tree and can be easily used
by a compiler.

Ambiguous Grammar
Introduction
Before heading towards ambiguous grammar, let's see about Context-Free Grammar. A
context-free grammar is a formal grammar used to generate all the possible patterns of
strings.
A context-free grammar is classified based on:
•Number of Derivation trees
•Number of Strings
•→The number of Derivation trees is further classified into
•Ambiguous grammar
•Unambiguous grammar.

Ambiguity in Grammar
A grammar or a Context-Free Grammar(CFG) is said to be ambiguous if there exists more
than one leftmost derivation(LMDT) or more than one rightmost derivation(RMDT), or
more than one parse tree for a given input string.
→Technically, we can say that context-free grammar(CFG) represented by G = (N, T, P,
S) is said to be ambiguous grammar if there exists more than one string in L(G).
Otherwise, the grammar will be unambiguous.
→One thing that should be transparent is that ambiguity is a property of grammar and not
a language.
→Since Ambiguous Grammar can produce two Parse trees for the same expression, it's
often confusing for a compiler to find out which one among all available Parse Trees is the
correct one according to the context of the work. This is the reason ambiguity is not
suitable for compiler construction.
Example 1
Let production rule is given as:
S -> AB|aaB
A -> a|Aa
B -> b
Let us generate string aab from the given grammar. Parse trees for generating string aab
are as follows :

Here for the same string, we are getting more than one parse tree. Hence, grammar is
ambiguous grammar.
Example 2
Let production rule is given as:
E -> EE+
E -> E(E)
E -> id
Parse tree for id(id)id + is:

Only one parse tree is possible for id(id)id+, so the given grammar is unambiguous.
Also read - Arden's theorem
Example 3
Check the given production is ambiguous or not
S → aSb | SS
S→ε
For the string "aabb" the above grammar can generate two parse trees.

Since there are two parse trees for a single string, "aabb", the grammar G is ambiguous.

YACC
YACC stands for Yet Another Compiler Compiler.

YACC provides a tool to produce a parser for a given grammar.

YACC is a program designed to compile a LALR (1) grammar.

It is used to produce the source code of the syntactic analyzer of the language
produced by LALR (1) grammar.

The input of YACC is the rule or grammar and the output is a C program.
These are some points about YACC:

Input: A CFG- file.y

Output: A parser y.tab.c (yacc)

The output file "file.output" contains the parsing tables.

→The file "file.tab.h" contains declarations.

The parser called the yyparse ().

Parser expects to use a function called yylex () to get tokens.


The basic operational sequence is as follows:

This file contains the desired grammar in YACC format.


It shows the YACC program.
This file contains the desired grammar in YACC format.

It shows the YACC program.

It is the c source program created by YACC.

C Compiler

Executable file that will parse grammar given in gram.Y

What is an Automatic Parser Generator?


An automatic parser generator is a software tool that helps in the development of parsers
for programming languages or other structured data formats. It simplifies the process of
creating parsers by automatically generating code based on a given set of grammar rules.
These generated parsers can then be used to analyze and interpret input data according to
the defined grammar.How Does an Automatic Parser Generator Work? An automatic parser
generator typically takes a formal grammar specification as input and generates code,
usually in a target programming language, that implements a parser for that grammar. The
grammar specification defines the syntax and structure of the language or data format to
be parsed.
The generated parser code can be integrated into a larger software project, allowing it to
process input data according to the grammar rules. The parser can identify and extract
specific elements or patterns from the input, enabling further analysis or manipulation of
the data.

Example: Parsing JSON with an Automatic Parser Generator


Let’s consider an example of parsing JSON (JavaScript Object Notation) using an automatic
parser generator. JSON is a popular data interchange format, commonly used in web
applications.

Suppose we have the following JSON data:{"name": "John Doe","age":


30,"email": "[email protected]"}

To parse this JSON data, we can define a grammar using a parser generator tool, such as
ANTLR (ANother Tool for Language Recognition). The grammar might look something like
this:
jsonObject : '{' (pair (',' pair)*)? '}';pair : STRING ':'
value;value : STRING | NUMBER | jsonObject;STRING : '"' ~["]*
'"';NUMBER : '-'? [0-9]+ ('.' [0-9]+)?;
Using the grammar specification, the automatic parser generator can generate code in a
target programming language, such as Java or C++. This generated code will include the
necessary logic to parse JSON data according to the defined grammar rules.
With the generated parser code, we can now parse the JSON data and extract specific
elements. For example, we can extract the name, age, and email from the JSON object and
use them in our application logic.Advantages of Using an Automatic Parser Generato
rSaves Development Time and EffortDeveloping a parser from scratch can be a
complex and time-consuming task. Automatic parser generators automate much of the
process, generating the necessary code based on a given grammar specification. This saves
developers significant time and effort, allowing them to focus on other aspects of their
project
.Ensures Correctness and ConsistencyAutomatic parser generators generate code that is
based on the specified grammar rules. This ensures that the generated parser will correctly
interpret input data according to the defined syntax and structure. It helps in avoiding
manual errors and inconsistencies that can arise when implementing a parser by hand.
Flexibility and Maintainability Automatic parser generators provide flexibility in
terms of the target programming language. Developers can choose the programming
language in which they want the generated parser code to be written. This allows them to
integrate the parser into their existing codebase seamlessly. Additionally, if the grammar
specification needs to be modified or updated, the parser generator can regenerate the
code, making it easier to maintain and adapt to changing requirements.
INTERMEDIATE CODE GENERATION
In Intermediate code generation we use syntax directed methods to translate the source
program into an intermediate form programming language constructs such as declarations,
assignments and flow-of-control statements.

intermediate code is:


 The output of the Parser and the input to the Code Generator.
 Relatively machine-independent and allows the compiler to be retargeted.
 Relatively easy to manipulate (optimiz

ABSTRACT SYNTAX TREE:


Is nothing but the condensed form of a parse tree, It is
 Useful for representing language constructs so naturally.
 The production S if B then s1 else s2 may appear as

In the next few slides we will see how abstract syntax trees can be constructed from syntax
directed definitions. Abstract syntax trees are condensed form of parse trees. Normally operators
and keywords appear as leaves but in an abstract syntax tree they are associated with the interior
nodes that would be the parent of those leaves in the parse tree. This is clearly indicated by the
examples in these slides. Chain of single productions may be collapsed, and operators move to the
parent nodes
Chain of single productions are collapsed into one node with the operators moving up to become
the node.

What is Three Address Code?


Three-address code is a sequence of statements of the general form : X := Y Op Z
where x, y, and z are names, constants, or compiler-generated temporaries; op stands for
any operator, such as a fixed- or floating-point arithmetic operator, or a logical operator on
Boolean-valued data. Note that no built-up arithmetic expressions are permitted, as there is only
one operator on the right side of a statement. Thus a source language expression like x+y*z
might be translated into a sequence
1 := y * z
t2 : = x + t1
Where t1 and t2 are compiler-generated temporary names. This unraveling of
complicated arithmetic expressions and of nested flow-of-control statements makes three-address
code desirable for target code generation and optimization. The use of names for the intermediate
values computed by a program allow- three-address code to be easily rearranged – unlike postfix
notation. Three - address code is a linearzed representation of a syntax tree or a dag in which
explicit names correspond to the interior nodes of the graph.
Intermediate code using Syntax for the above arithmetic expression
t1 := -c
t2 := b * t1
t3 := -c
t4 := b * t3
t5 := t2 + t4
a := t5
The reason for the term‖three-address code‖ is that each statement usually contains three
addresses, two for the operands and one for the result. In the implementations of three-address
code given later in this section, a programmer-defined name is replaced by a pointer tc a symbol-
table entry for that name.

Types of Three-Address Statements


Three-address statements are akin to assembly code. Statements can have symbolic labels
and there are statements for flow of control. A symbolic label represents the index of a three-
address statement in the array holding inter- mediate code. Actual indices can be substituted for
the labels either by making a separate pass, or by using ‖back patching,
1. Assignment statements of the form x: = y op z, where op is a binary arithmetic or logical
operation.
2. Assignment instructions of the form x:= op y, where op is a unary operation. Essential unary
operations include unary minus, logical negation, shift operators, and conversion operators that,
for example, convert a fixed-point number to a floating-point number.
3. Copy statements of the form x: = y where the value of y is assigned to x.
4. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.
.5 Conditional jumps such as if x relop y goto L. This instruction applies a relational operator
(<, =, >=, etc.) to x and y, and executes the statement with label L next if x stands in relation
relop to y. If not, the three-address statement following if x relop y goto L is executed next, as in
the usual sequence.
6. param x and call p, n for procedure calls and return y, where y representing a returned value
is optional. Their typical use is as the sequence of three-address statements
param x1
param x2
param xn
call p, n
Generated as part of a call of the procedure p(x,, x~,..., x‖). The integer n indicating the number
of actual parameters in ‖call p, n‖ is not redundant because calls can be nested. The
implementation of procedure calls is outline d in Section 8.7.
7. Indexed assignments of the form x: = y[ i ] and x [ i ]: = y. The first of these sets x to the
value in the location i memory units beyond location y. The statement x[i]:=y sets the contents of
the location i units beyond x to the value of y. In both these instructions, x, y, and i refer to data
objects.
8. Address and pointer assignments of the form x:= &y, x:= *y and *x: = y. The first of these
sets the value of x to be the location of y. Presumably y is a name, perhaps a temporary, that
denotes an expression with an I-value such as A[i, j], and x is a pointer name or temporary. That
is, the r-value of x is the l-value (location) of some object!. In the statement x: = ~y, presumably
y is a pointer or a temporary whose r- value is a location. The r-value of x is made equal to the
contents of that location. Finally, +x: = y sets the r-value of the object pointed to by x to the r-
value of y.

What is Polish Notation?


Polish notation is also known as prefix notation. Polish notation helps compilers
evaluate mathematical expressions following the order of operations using operator
precedence notation, which defines the order in which operators should be evaluated,
such as multiplication before addition.
In 1924 Jan Łukasiewicz thought of parenthesis-free notations, and this is where the
Polish Notation was invented.
Suppose you have to add 3 and 6 and multiply the result by 2. Normally we will write this
as (3+6).2, also called infix notation, because the operators are in-between the
operands. In this expression, parenthesis is a must. But when you write this in prefix
notation, the expression will be .+362. In prefix notation, there is no need for the
parenthesis. This expression will be evaluated as, firstly, you have to add 3 and 6 and then
multiply the result with 2, which will be 18.

Advantages
Here are some advantages of polish notation in compiler design:
1.No need for parentheses: In polish notation, there is no need for parentheses
while writing the arithmetic expressions as the operators come before the operands.
2.Efficient Evaluation: The evaluation of an expression is easier in polish notation
because in polish notation stack can be used for evaluation.
3.Easy parsing: In polish notation, the parsing can easily be done as compared to the
infix notation.
4.Less scanning: The compiler needs fewer scans as the parentheses are not used
in the polish notations, and the compiler does not need to scan the operators and
operands differently.
Disadvantages
Here are some disadvantages of polish notation in compiler design:
1.Vague: If someone sees the polish notation for the first time and doesn’t know about
it. It will be very hard to understand how to evaluate the expression.
2.Not used commonly: The polish notation is not commonly used in day-to-day life.
These are mostly used for scientific purposes.
3.Difficult for programmers: It will be difficult for programmers who need to become
more familiar with polish notations to read the expression.

ATTRIBUTE GRAMMARS: A CFG G=(V,T,P,S), is called an Attributed


Grammar iff ,
where in G, each grammar symbol XƐ VUT, has an associated set of attributes, and each
production, p ƐP, is associated with a set of attribute evaluation rules called Semantic Actions.
In an AG, the values of attributes at a parse tree node are computed by semantic rules.
→ There are two different specifications of AGs used by the Semantic Analyzer in
evaluating the semantics
of the program constructs. They are,-
Syntax directed definition(SDD)s
o High level specifications
o Hides implementation details
o Explicit order of evaluation is not specified
- Syntax directed Translation schemes(SDT)s
 Nothing but an SDD, which indicates order in which semantic rules are to be evaluated
and
 Allow some implementation details to be shown. An attribute grammar is the formal
expression of the syntax-derived semantic checks associated with a grammar. It represents
the rules of a language not explicitly imparted by the syntax.

There are two ways for writing attributes:


1) Syntax Directed Definition(SDD): Is a context free grammar in which a set of semantic
actions are embedded (associated) with each production of G. It is a high level specification
in which implementation details are hidden, e.g., S.sys =
A.sys + B.sys;
/* does not give any implementation details. It just tells us. This kind of attribute
equation we will be using, Details like at what point of time is it evaluated and in what
manner
are hidden from the programmer.*/
E E1 + T { E.val = E1 .val+ E2.val }
E T { E.val = T.val }
T T 1 * F { T.val = T1 .val+ F.val)
T F { T.val = F.val }
F (E) { F.val = E.val }
F id { F.val = id.lexval }
F num { F.val = num.lexval }
2) Syntax directed Translation(SDT) scheme: Sometimes we want to control the way the
attributes are evaluated, the order and place where they are evaluated. This is of a slightly
lower level.
An SDT is an SDD in which semantic actions can be placed at any position in the body of
the production.
For example , following SDT prints the prefix equivalent of an arithmetic expression consisting a
+ and * operators.
L En{ printf(„E.val‟) }
E { printf(„+‟) }E1 + T
ET
T { printf(„*‟) }T 1 * F
TF
F (E)
F { printf(„id.lexval‟) }id
F { printf(„num.lexval‟) } num
This action in an SDT, is executed as soon as its node in the parse tree is visited in a preorder
traversal of the tree.
Conceptually both the SDD and SDT schemes will:
 Parse input token stream
 Build parse tree
 Traverse the parse tree to evaluate the semantic rules at the parse tree nodes
Evaluation may:
 Generate code
 Save information in the symbol table
 Issue error messages
 Perform any other activity
symbol attributes
Number value
sign negative
list position, value
bit position, value
production Attribute rule number sign list list.position 0,
Syntax directed translation
In syntax directed translation, along with the grammar we associate some informal
notations and these notations are called as semantic rules.

So we can say that

1. Grammar + semantic rule = SDT (syntax directed translation)


→ In syntax directed translation, every non-terminal can get one or more than one
attribute or sometimes 0 attribute depending on the type of the attribute. The value
of these attributes is evaluated by the semantic rules associated with the production
rule.

→In the semantic rule, attribute is VAL and an attribute may hold anything like a
string, a number, a memory location and a complex record

→In Syntax directed translation, whenever a construct encounters in the


programming language then it is translated according to the semantic rules define
in that particular programming language.

Example
Production Semantic Rules
E→E+T E.val := E.val + T.val

E→T E.val := T.val

T→T*F T.val := T.val + F.val

T→F T.val := F.val

F → (F) F.val := F.val

F → num F.val := num.lexval

E.val is one of the attributes of E.

num.lexval is the attribute returned by the lexical analyzer.

SPECIFICATIONS OF A TYPE CHECKER: Consider a language which


consists of a sequence of declarations followed by a single expression
PD;E
D D ; D | id : T
T char | integer | array [ num] of T | ^ T
E literal | num | E mod E | E [E] | E ^
→A type checker is a translation scheme that synthesizes the type of each expression from the
types of its sub-expressions. Consider the above given grammar that generates programs
consisting of a sequence of declarations D followed by a single expression E.
Specifications of a type checker for the language of the above grammar: A program generated
by this grammar is
key : integer;
key mod 1999
Assumptions:
1. The language has three basic types: char , int and type-error
2. For simplicity, all arrays start at 1. For example, the declaration array[256] of char leads to the
type expression array ( 1.. 256, char).
Rules for Symbol Table entry
D → id : T addtype(id.entry, T.type)
T → char T.type = char
T → integer T.type = int
T → ^T1 T.type = pointer(T1 .type)
T → array [ num ] of T 1 T.type = array(1..num, T 1 .type)
TYPE CHECKING OF FUNCTIONS :
Consider the Syntax Directed Definition,
E E1 ( E2 ) E. type = if E2 .type == s and
E1 .type == s t
then t
else type-error
TYPE CHECKING FOR EXPRESSIONS: consider the following SDD for expressions
E → literal E.type = char
E →num E.type = integer
E → id E.type = lookup(id.entry)
E → E1 mod E2 E.type = if E 1 .type == integer and ,E2 .type==integer then integer
E → E1 [E2 ] E.type = if E2 .type==integer and
E1 . type==array(s,t) then t else type_error
E → E1 ^ E.type = if E1 .type==pointer(t) then t else type_error
TYPE CHECKING OF STATEMENTS: Statements typically do not have values. Special
basic type void can be assigned to them. Consider the SDD for the grammar below which
generates Assignment statements conditional, and looping statements.
S →id := E S.Type = if id.type == E.type then void else type_error
S →if E then S1 S.Type = if E.type == boolean then S1.type else type_error
S →while E do S1 S.Type = if E.type == boolean then S1.type else type_error
S→ S1 ; S2 S.Type = if S1.type == void
and S2.type == void then void else type_error

Symbol Table
→Symbol table is an important data structure used in a compiler.

→Symbol table is used to store the information about the occurrence of various entities
such as objects, classes, variable name, interface, function name etc. it is used by both the
analysis and synthesis phases.

→The symbol table used for following purposes:

 It is used to store the name of all entities in a structured form at one place.

 It is used to verify if a variable has been declared.

 It is used to determine the scope of a name.

 It is used to implement type checking by verifying assignments and expressions in


the source code are semantically correct.

1. <symbol name, type, attribute>


For example, suppose a variable store the information about the following variable
declaration:

1. static int salary


then, it stores an entry in the following format:

<salary, int, static>

The clause attribute contains the entries related to the name.


Implementation The symbol table can be implemented in the unordered list if the
compiler is used to handle the small amount of data.

A symbol table can be implemented in one of the following techniques:

Linear (sorted or unsorted) list

Hash table

Binary search tree


Symbol table are mostly implemented as hash table.

Operations
The symbol table provides the following operations:

Insert ()
Insert () operation is more frequently used in the analysis phase when the tokens
are identified and names are stored in the table.

The insert() operation is used to insert the information in the symbol table like the
unique name occurring in the source code.

1. int x;
Should be processed by the compiler as:

1. insert (x, int)

lookup()
In the symbol table, lookup() operation is used to search a name. It is used to determine:

The existence of symbol in the table.

The declaration of the symbol before it is used.

Check whether the name is used in the scope.

Initialization of the symbol.

Checking whether the name is declared multiple times.


The basic format of lookup() function is as follows:

1. lookup (symbol)
This format is varies according to the programming language.
ORGANIZATION FOR BLOCK STRUCTURES:
A block is a any sequence of operations or instructions that are used to perform a [sub] task. In
any programming language,
 Blocks contain its own local data structure.
 Blocks can be nested and their starting and ends are marked by a delimiter.
 They ensure that either block is independent of other or nested in another block. That is,
it is not possible for two blocks B1 and B2 to overlap in such a way that first block B1
begins, then B2, but B1 end before B2.
 This nesting property is called block structure. The scope of a declaration in a block-
structured language is given by the most closely nested rule:
1. The scope of a declaration in a block B includes B.
2. If a name X is not declared in a block B, then an occurrence of X in B is in the scope
of a declaration of X in an enclosing block B ' such that. B ' has a declaration of X, and. B
' is more closely nested around B then any other block with a declaration of X.

Department of Computer Science & Engineering Course File : Compiler Design


DECLARATION SCOPE
int a=0 B0 not including B2
int b=0 B0 not including B1
int b=1 B1 not including B3
int a =2 B2 only
int b =3 B3 only
The outcome of the print statement will be, therefore:
21
03
01
00
Blocks : . Blocks are simpler to handle than procedures
. Blocks can be treated as parameter less procedures
. Use stack for memory allocation
. Allocate space for complete procedure body at one time

BLOCK STRUCTURES AND NON BLOCK STRUCTURE STORAGE


ALLOCATION
Storage binding and symbolic registers : Translates variable names into addresses and the
process must occur before or during code generation
- . Each variable is assigned an address or addressing method
- . Each variable is assigned an offset with respect to base which changes with every
invocation
- . Variables fall in four classes: global, global static, stack, local (non-stack) static
- The variable names have to be translated into addresses before or during code generation.
Department of Computer Science & Engineering Course File : Compiler Design
There is a base address and every name is given an offset with respect to this base which changes
with every invocation. The variables can be divided into four categories:
a) Global Variables : fixed relocatable address or offset with respect to base as global pointer
b) Global Static Variables : .Global variables, on the other hand, have static duration (hence
also called static variables): they last, and the values stored in them persist, for as long as the
program does. (Of course, the values can in general still be overwritten, so they don't necessarily
persist forever.) Therefore they have fixed relocatable address or offset with respect to base as
global pointer.
c) Stack Variables : allocate stack/global in registers and registers are not indexable, therefore,
arays cannot be in registers
. Assign symbolic registers to scalar variables
. Used for graph coloring for global register allocation
d) Stack Static Variables : By default, local variables (stack variables) (those declared within a
function) have automatic duration: they spring into existence when the function is called, and
they (and their values) disappear when the function returns. This is why they are stored in stacks
and have offset from stack/frame pointer.
Register allocation is usually done for global variables. Since registers are not indexable,
therefore, arrays cannot be in registers as they are indexed data structures. Graph coloring is a
simple technique for allocating register and minimizing register spills that works well in practice.
→ The contents of one of the registers must be stored in memory to free it up for immediate
use. We assign symbolic registers to scalar variables which are used in the graph coloring.

Local Variables in Frame


 Assign to consecutive locations; allow enough space for each
 May put word size object in half word boundaries
 Requires two half word loads
 Requires shift, or, and
 Align on double word boundaries
 Wastes space
 And Machine may allow small offsets

STATIC ALLOCATION: In this, A call statement is implemented by a sequence of two


instructions.
 A move instruction saves the return address
 A goto transfers control to the target code.
The instruction sequence is
MOV #here+20, callee.static-area
GOTO callee.code-area
callee.static-area and callee.code-area are constants referring to address of the activation record
and the first address of called procedure respectively.
. #here+20 in the move instruction is the return address; the address of the instruction following
the goto instruction
. A return from procedure callee is implemented by
GOTO *callee.static-area
For the call statement, we need to save the return address somewhere and then jump to
the location of the callee function. and
callee.static- area is a fixed location in memory. 20 is added to #here here, as the code
corresponding to the call instruction takes 20 bytes (at 4 bytes for each parameter: 4*3 for this
instruction, and 8 for the next). Then we say GOTO callee. code-area, to take us to the code of
the callee, as callee.codearea is merely the address where the code of the callee starts. Then a
return from the callee is implemented by: GOTO *callee.static area. Note that this works only
because callee.static-area is a constant.
Example:
. Assume each 100: ACTION-l
action 120: MOV 140, 364
block takes 132: GOTO 200
bytes of space 140: ACTION-2
.Start address 160: HALT
of code for c :
and p is 200: ACTION-3
100 and 200 220: GOTO *364
The activation :
records 300:
arestatically 304:
allocated starting :
at addresses 364:
300 and 364. 368:
→This example corresponds to the code shown in slide 57. Statically we say that the code
for c starts at 100 and that for p starts at 200. At some point, c calls p. Using the strategy
discussed earlier, and assuming that callee.staticarea is at the memory location 364, we get the
code as given. Here we assume that a call to 'action' corresponds to a single machine instruction
which takes 20 bytes.

RUN TIME STORAGE MANAGEMENT:


To study the run-time storage management system it is sufficient to focus on the statements:
action, call, return and halt, because they by themselves give us sufficient insight into the
behavior shown by functions in calling each other and returning.
And the run-time allocation and de-allocation of activations occur on the call of functions and
when they return.
There are mainly two kinds of run-time allocation systems: Static allocation and Stack
Allocation. While static allocation is used by the FORTRAN class of languages, stack allocation
is used by the Ada class of languages.

Storage Allocation
The different ways to allocate memory are:

1.Static storage allocation

2.Stack storage allocation

3.Heap storage allocation

Static storage allocation


In static allocation, names are bound to storage locations.

If memory is created at compile time then the memory will be created in static
area and only once.

Static allocation supports the dynamic data structure that means memory is
created only at compile time and deallocated after program completion.
The drawback with static storage allocation is that the size and position of data
objects should be known at compile time.

Another drawback is restriction of the recursion procedure.

Stack Storage Allocation


In static storage allocation, storage is organized as a stack.

An activation record is pushed into the stack when activation begins and it is
popped when the activation end.

Activation record contains the locals so that they are bound to fresh storage in
each activation record. The value of locals is deleted when the activation ends.

It works on the basis of last-in-first-out (LIFO) and this allocation supports the
recursion process.

Heap Storage Allocation


Heap allocation is the most flexible allocation scheme.

Allocation and deallocation of memory can be done at any time and at any place
depending upon the user's requirement.

Heap allocation is used to allocate memory to the variables dynamically and when
the variables are no more used then claim it back.

Heap storage allocation supports the recursion process.

Example:
1. fact (int n)
2. {
3. if (n<=1)
4. return 1;
5. else
6. return (n * fact(n-1));
7. }
8. fact (6)
The dynamic allocation is as follows:
UNIT-IV
Considerations for optimization :
→ The code produced by the straight forward compiling
algorithms can often be made to run faster or take less space,or both. This
improvement is achieved by program transformations that are traditionally
called optimizations. Machine independent optimizations are program
transformations that improve the target code without taking into consideration
any properties of the target machine.
→Machine dependant optimizations are based on register allocation and
utilization of special machine-instruction sequences.
Criteria for code improvement transformations
- Simply stated, the best program transformations are those that yield the most
benefit for the least effort.
- First, the transformation must preserve the meaning of programs. That is, the
optimization must not change the output produced by a program for a given
input, or cause an error.
- Second, a transformation must, on the average, speed up programs by a
measurable amount.
- Third, the transformation must be worth the effort.
Some transformations can only be applied after detailed, often time-consuming
analysis of the source program, so there is little point in applying them to
programs that will be run only a few times

OBJECTIVES OF OPTIMIZATION:
The main objectives of the optimization techniques are
as follows
1. Exploit the fast path in case of multiple paths fro a given situation.
2. Reduce redundant instructions.
3. Produce minimum code for maximum work.
4. Trade off between the size of the code and the speed with which it gets
executed.
5. Place code and data together whenever it is required to avoid unnecessary
searching of
data/code
During code transformation in the process of optimization, the basic
requirements are as follows:
1. Retain the semantics of the source code.
2. Reduce time and/ or space.
3. Reduce the overhead involved in the optimization process.

Scope of Optimization:
Control-Flow Analysis Consider all that has happened up to this point in the
compiling process—lexical analysis, syntactic analysis, semantic analysis and
finally intermediate-code generation
→. The compiler has done an enormous amount of analysis, but it still doesn‘t
really know how the program does what it does. In control-flow analysis, the
compiler figures out even more information about how the program does its
work, only now it can assume that there are no
syntactic or semantic errors in the code.
→Control-flow analysis begins by constructing a control-flow graph, which is a
graph of the different possible paths program flow could take through a
function.
→To build the graph, we first divide the code into basic blocks. A basic block is
a segment of the code that a program must enter at the beginning and exit only
at the end.
→This means that only the first statement can be reached from outside the
block (there are no branches into the middle of the block) and all statements are
executed consecutively after the first one is (no branches or halts until the exit).
→Thus a basic block has exactly one entry point and one exit point. If a
program executes the first instruction in a basic block, it must execute every
instruction in the block sequentially after it.
→A basic block begins in one of several ways:
• The entry point into the function
• The target of a branch (in our example, any label)
• The instruction immediately following a branch or a return
A basic block ends in any of the following ways:
• A jump statement
• A conditional or unconditional branch
• A return statement
Now we can construct the control-flow graph between the blocks. Each basic
block is anode in the graph, and the possible different routes a program might
take are the connections, i.e
if a block ends with a branch, there will be a path leading from that block to the
branch target.
→The blocks that can follow a block are called its successors. There may be
multiple successors or just one. Similarly the block may have many, one, or no
predecessors.
Connect up the flow graph for Fibonacci basic blocks given above. What does
an if then-else look like in a flow graph?

LOCAL OPTIMIZATIONS
Optimizations performed exclusively within a basic block are called "local
optimizations". These are typically the easiest to perform since we do not
consider any control flow information; we just work with the statements within
the block.
Many of the local optimizations we will discuss have corresponding global
optimizations that operate on the same principle, but require additional analysis
to perform. We'll consider some of the more common local optimizations as
examples.
FUNCTION PRESERVING TRANSFORMATIONS
 Common sub expression elimination
 Constant folding
 Variable propagation
 Dead Code Elimination
 Code motion
 Strength Reduction
1. Common Sub Expression Elimination:
Two operations are common if they produce the same result. In such a case, it is
likely more efficient to compute the result once and reference it the second time
rather than re-evaluate it. An expression is alive if the operands used to compute
the expression have not been changed. An expression that is no longer alive is
dead.
Example :
a=b*c;
d=b*c+x-y;
2. Variable Propagation:
Let us consider the above code once again
c=a*b;
x=a;
d=x*b+4;
Department of Computer Science & Engineering Course File : Compiler Design
if we replace x by a in the last statement, we can identify a*b and x*b as
common sub expressions.
→This technique is called variable propagation where the use of one variable is
replaced by another variable if it has been assigned the value of same
Compile Time evaluation
a= 2*(22.0/7.0)*r;
Here, we can perform the computation 2*(22.0/7.0) at compile time itself.
3. Dead Code Elimination:
→If the value contained in the variable at a point is not used anywhere in the
program subsequently, the variable is said to be dead at that place. If an
assignment is made to a dead variable, then that assignment is a dead
assignment and it can be safely removed from the program.
→Similarly, a piece of code is said to be dead, which computes value that are
never used anywhere
in the program.
c=a*b;
x=a;
d=x*b+4;
Using variable propagation, the code can be written as follows:
c=a*b;
x=a;
d=a*b+4;
Using Common Sub expression elimination, the code can be written as follows:
t1= a*b;
c=t1;
x=a;
d=t1+4;
Here, x=a will considered as dead code. Hence it is eliminated.
t1= a*b;
c=t1;
d=t1+4;
We can evaluate an expression with constants operands at compile time and
replace that expression by a single value. This is called folding. Consider the
following statement:
a= 2*(22.0/7.0)*r;
Here, we can perform the computation 2*(22.0/7.0) at compile time itself.
4)code movement
The motivation for performing code movement in a program is to improve the
execution time of the program by reducing the evaluation frequency of
expressions.
This can be done by movingthe evaluation of an expression to other parts of the
program. Let us consider the bellow code:
If(a<10)
{
b=x^2-y^2;
}
else
{
b=5;
a=( x^2-y^2)*10;
}
At the time of execution of the condition a<10, x^2-y^2 is evaluated twice. So,
we can optimize
the code by moving the out side to the block as follows:
t= x^2-y^2;
If(a<10)
{
b=t;
}
else
{
b=5;
a=t*10;
}
5. Strength Reduction:
In the frequency reduction transformation we tried to reduce the execution
frequency of the expressions by moving the code. There is other class of
transformations which perform equivalent actions indicated in the source
program by reducing the strength of operators
.→ By strength reduction, we mean replacing the high strength operator with
low strength operator with out affecting the program meaning.

Flow Graph
Flow graph is a directed graph. It contains the flow of control information for the set of
basic block.

A control flow graph is used to depict that how the program control is being parsed
among the blocks. It is useful in the loop optimization.

Flow graph for the vector dot product is given as follows:

→Block B1 is the initial node. Block B2 immediately follows B1, so from B2 to B1 there is an
edge.

→The target of jump from last statement of B1 is the first statement B2, so from B1 to B2
there is an edge.

→B2 is a successor of B1 and B1 is the predecessor of B2.


Loop Optimization
Loop optimization is most valuable machine-independent optimization because program's
inner loop takes bulk to time of a programmer.

If we decrease the number of instructions in an inner loop then the running time of a
program may be improved even if we increase the amount of code outside that loop.

For loop optimization the following three techniques are important:

1.Code motion

2.Induction-variable elimination

3.Strength reduction

1.Code Motion:
Code motion is used to decrease the amount of code in loop. This transformation takes a
statement or expression which can be moved outside the loop body without affecting the
semantics of the program.

For example
In the while statement, the limit-2 equation is a loop invariant equation.

1. while (i<=limit-2) /*statement does not change limit*/


2. After code motion the result is as follows:
3. a= limit-2;
4. while(i<=a) /*statement does not change limit or a*/
2.Induction-Variable Elimination
Induction variable elimination is used to replace variable from inner loop.

It can reduce the number of additions in a loop. It improves both code space and run time
performance.

In this figure, we can replace the assignment t4:=4*j by t4:=t4-4. The only problem which
will be arose that t4 does not have a value when we enter block B2 for the first time. So we
place a relation t4=4*j on entry to the block B2.

3.Reduction in Strength
Strength reduction is used to replace the expensive operation by the cheaper once
on the target machine.

Addition of a constant is cheaper than a multiplication. So we can replace


multiplication with an addition within the loop.

Multiplication is cheaper than exponentiation. So we can replace exponentiation


with multiplication within the loop.

Example:
1. while (i<10)
2. {
3. j= 3 * i+1;
4. a[j]=a[j]-2;
5. i=i+2;
6. }
After strength reduction the code will be:

1. s= 3*i+1;
2. while (i<10)
3. {
4. j=s;
5. a[j]= a[j]-2;
6. i=i+2;
7. s=s+6;
8. }
In the above code, it is cheaper to compute s=s+6 than j=3

ReductionFrequency
Frequency reduction is a type in loop optimization process which is machine
independent. In frequency reduction code inside a loop is optimized to improve
the running time of program. Frequency reduction is used to decrease the
amount of code in a loop. A statement or expression, which can be moved
outside the loop body without affecting the semantics of the program, is moved
outside the loop. Frequency Reduction is also called Code Motion.
Objective of Frequency Reduction:
The objective of frequency reduction is:
•To reduce the evaluation frequency of expression.
•To bring loop invariant statements out of the loop.
Below is the example of Frequency Reduction:
Program
// This program does not uses frequency reduction.
#include <bits/stdc++.h>

using namespace std;

int main()
{
int a = 2, b = 3, c, i = 0;

while (i < 5) {
// c is calculated 5 times
c = pow(a, b) + pow(b, a);

// print the value of c 5 times


cout << c << endl;
i++;
}
return 0;
}

Output:
17
17
17
17
17
Explanation:
Program 2 is more efficient than Program 1 as in Program 1 the value of c is
calculated each time the while loop is executed. Hence the value of c is
calculated outside the loop only once and it reduces the amount of code in the
loop.

DAG representation for basic blocks


A DAG for basic block is a directed acyclic graph with the following labels on nodes:

1.The leaves of graph are labeled by unique identifier and that identifier can be
variable names or constants.

2.Interior nodes of the graph is labeled by an operator symbol.

3.Nodes are also given a sequence of identifiers for labels to store the computed
value.

Algorithm for construction of DAG


Input:It contains a basic block

Output: It contains the following information:

Each node contains a label. For leaves, the label is an identifier.

Each node contains a list of attached identifiers to hold the computed values.

1. Case (i) x:= y OP z


2. Case (ii) x:= OP y
3. Case (iii) x:= y
Method:
Step 1:If y operand is undefined then create node(y).

If z operand is undefined then for case(i) create node(z).

Step 2:For case(i), create node(OP) whose right child is node(z) and left child is node(y).

For case(ii), check whether there is node(OP) with one child node(y).

For case(iii), node n will be node(y).

Output:For node(x) delete x from the list of identifiers. Append x to attached identifiers list
for the node n found in step 2. Finally set node(x) to n.

Example:Consider the following three address statement:


1. S1:= 4 * i
2. S2:= a[S1]
3. S3:= 4 * i
4. S4:= b[S3]
5. S5:= s2 * S4
6. S6:= prod + S5
7. Prod:= s6
8. S7:= i+1
9. i := S7
10.if i<= 20 goto (1)

Stages in DAG Construction:


Global optimization
in compiler design refers to the process of optimizing code across an entire
program rather than focusing on individual functions or basic blocks. This
type of optimization takes advantage of the larger context of the program
to make more significant improvements in terms of performance, code size,
or both. Here are some key aspects and techniques involved in global
optimization:
1. Whole Program Analysis: Global optimization requires analyzing the entire program or a
significant portion of it to identify optimization opportunities. This can involve analyzing
control flow, data flow, and dependencies across different parts of the program.
2. Optimization Goals: The primary goals of global optimization include improving runtime
performance (speed), reducing memory usage, minimizing code size, and sometimes
improving code readability or maintainability.
3. Techniques:
• Inlining: This involves replacing function calls with the actual body of the function
to reduce overhead.
• Interprocedural Analysis and Optimization: Analyzing and optimizing across
different functions or procedures.
• Loop Optimization: Optimizing loops across the entire program to reduce overhead
and improve cache locality.
• Data Flow Analysis: Analyzing how data flows through the program to optimize
memory accesses and register allocation.
• Dead Code Elimination: Removing code that does not contribute to the final output
of the program.
• Constant Folding and Propagation: Evaluating constant expressions at compile-
time rather than runtime.
• Instruction Scheduling: Reordering instructions to minimize stalls and improve
pipelining efficiency.
4. Challenges:
• Increased Compilation Time: Global optimization often requires more
computational resources and time compared to local optimizations.
• Complexity: Analyzing and optimizing across the entire program introduces
complexity in terms of analysis algorithms and correctness guarantees.
• Maintainability: Highly optimized code can be harder to read and maintain, so
compilers often provide options to balance between optimization level and code
readability.
5. Compiler Support: Modern compilers employ sophisticated algorithms and heuristics to
perform global optimizations effectively. They often provide optimization flags or options
that allow programmers to control the level of optimization applied
Induction Variable Elimination Constant folding refers to the
evaluation at compile-time of expressions whose operands are known to be
constant.
→In its simplest form, it involves determining that all of the
operands in an expression are constant-valued, performing the evaluation
of the expression at compile-time, and then replacing the expression by its
value.
→ If an expression such as 10 + 2 * 3 is encountered, the compiler can
compute the result at compile-time (16) and emit code as if the
input contained the result rather than the original expression.
→ Similarly, constant conditions, such
as a conditional branch if a < b goto L1 else goto L2 where a and b are
constant can be replaced by a Goto L1 or Goto L2 depending on the truth
of the expression evaluated at compile-time.
→The constant expression has to be evaluated at least once, but if the
compiler does it, it means you don‘t have to do it again as needed during
runtime.
→ It should also respect the expected treatment of any exceptional
conditions (divide by zero, over/underflow). Consider the Decaf
code on the far left and its un optimized TAC translation in the middle,
which is then ransformed by constant-folding on the far right:
a = 10 * 5 + 6 - b; _tmp0 = 10 ;
_tmp1 = 5 ;
_tmp2 = _tmp0 * _tmp1 ;
_tmp3 = 6 ;
_tmp4 = _tmp2 + _tmp3 ;
_tmp5 = _tmp4 – b;
a = _tmp5 ;
_tmp0 = 56 ; _tmp1 = _tmp0 – b ; a = _tmp1

Live Variable Analysis


A variable is live at a certain point in the code if it holds a value that may
be needed in the future.
→Solve backwards:
Find use of a variable This variable is live between statements that have
found use as next statement Recursive until you find a definition of the
variable
→Using the sets use[B]and de f[B] de f[B] is the set of variables assigned
values in B prior to any use of that variable in B use [B] is the set
ofvariables whose values may be used in [B] prior to any definition of the
variable
→. A variable comes live into a block (in in[B]), if it is either used before
redefinition of it is live coming out of the block and is not redefined in the
block .A variable comes live out of a block (in out[B]) if and only if it is live
coming into one of its successors
In[B]=use[B] U (out[B]-de f[B])
Out[B]= U in[s] S succ[B]
Copy Propagation
This optimization is similar to constant propagation, but generalized to
non-constant values. If we have an assignment a = b in our instruction
stream, we can replace later occurrences of a with b (assuming there are no
changes to either variable in-between).
→Given the way we generate TAC code, this is a particularly valuable
optimization since it is able to eliminate a large number of instructions that
only serve to copy values from one variable to another.
→The code on the left makes a copy of tmp1 in tmp2 and a copy of tmp3 in
tmp4. In the optimized version on the right, we eliminated those
unnecessary copies and propagated the
original variable into the later uses:
tmp2 = tmp1 ;
tmp3 = tmp2 * tmp1;
tmp4 = tmp3 ;
tmp5 = tmp3 * tmp2 ;
c = tmp5 + tmp4 ;
tmp3 = tmp1 * tmp1 ;
tmp5 = tmp3 * tmp1 ;
c = tmp5 + tmp3 ;
We can also drive this optimization "backwards", where we can recognize
that the original assignment made to a temporary can be eliminated in
favor of direct assignment to the final goal:
tmp1 = LCall _Binky ;
a = tmp1;
tmp2 = LCall _Winky ;
b = tmp2 ;
tmp3 = a * b ;
c = tmp3 ;
a = LCall _Binky;
b = LCall _Winky;
c=a*b;
1.
OBJECT CODE FORMSObject code is the intermediate code
generated by a compiler after the syntax analysis, semantic analysis, and
optimization stages. It is a machine-readable representation of the source code,
executable directly by the computer’s CPU. There are two primary types of object
code forms in compiler design:
1.Relocatable Object Code: This form of object code contains symbolic addresses,
which are replaced with absolute addresses during the linking phase. Relocatable
object code is generated by most compilers and is suitable for multi-module
programs.
2.Absolute Object Code: This form of object code contains absolute addresses,
which are fixed and do not need to be changed during linking. Absolute object code
is less common and typically used for small, standalone programs.
Additionally, object code can be further categorized based on the target architecture and operating
system:
•Binary Object Code: A binary file format specific to the target architecture and
operating system, such as ELF (Executable and Linkable Format) or PE (Portable
Executable).
•Assembly Code: A human-readable, symbolic representation of machine code,
often used for debugging and optimization purposes.
In compiler design, object code generation involves several phases:
1.Intermediate Representation (IR): The compiler generates an IR, which is a
platform-independent, symbolic representation of the source code.
2.Code Generation: The IR is transformed into machine-specific object code, taking
into account the target architecture and operating system.
3.Optimization: The generated object code may undergo optimization techniques,
such as register allocation, instruction selection, and peephole optimization, to
improve its performance.
Overall, object code forms in compiler design play a crucial role in the compilation process,
enabling the generation of executable code that can be run directly on the target machine.

What Is Machine Dependent Code Optimization?


Machine dependent code optimization is a type of code optimization that focuses
on optimizing code for a specific type of hardware. This type of optimization is
usually done with assembly language, which is a low-level programming
language that is designed to be used with a specific type of hardware. By taking
advantage of the hardware’s specific features, the code can be optimized to run
faster and more efficiently. The advantage of machine dependent code
optimization is that it can provide significant performance gains. This is because
the code is specifically designed to take advantage of the specific features of the
hardware. The downside is that the code can only run on that specific hardware,
meaning that it cannot be transferred to a different type of hardware without
significant reworking.
Advantages of machine-dependent code:
•Improved performance: Machine-dependent code is written to take advantage

of the specific hardware and software environment it will be running in. As a


result, it can be optimized for that environment, leading to improved
performance.
•Greater control: When writing machine-dependent code, you have more control

over how the code will be executed. You can make use of specific hardware
features or take advantage of system-level APIs that may not be available to
more portable code.
•Reduced portability: One of the main drawbacks of machine-dependent code is

that it is not portable. It can only be run on the specific machine or environment it
was written for, which can be a limitation if you need to run the code on multiple
platforms.
•Higher maintenance costs: Machine-dependent code can be more difficult to

maintain and update, as it may require specific knowledge of the hardware and
software environment it was written for. This can lead to higher maintenance

costs over time.

Register Allocation Algorithms Register allocation is an


important method in the final phase of the compiler . Registers are faster
to access than cache memory . Registers are available in small size up to
few hundred Kb .Thus it is necessary to use minimum number of registers
for variable allocation . There are three popular Register allocation
algorithms .
1.Naive Register Allocation
2.Linear Scan Algorithm
3.Chaitin’s Algorithm
These are explained as following below.
1. Naïve Register Allocation :
•Naive (no) register allocation is based on the assumption that variables are
stored in Main Memory .
•We can’t directly perform operations on variables stored in Main Memory .
•Variables are moved to registers which allows various operations to be
carried out using ALU .
•ALU contains a temporary register where variables are moved before
performing arithmetic and logic operations .
•Once operations are complete we need to store the result back to the main
memory in this method .
•Transferring of variables to and fro from Main Memory reduces the overall
speed of execution .
a = b + c
d = a
c = a + d
Variables stored in Main Memory :

a b c d

4 6
2 8
f f
fp fp
p p

Machine Level Instructions :


LOAD R1, _4fp
LOAD R2, _6fp
ADD R1, R2
STORE R1, _2fp
LOAD R1, _2fp
STORE R1, _8fp
LOAD R1, _2fp
LOAD R2, _8fp
ADD R1, R2
STORE R1, _6fp

Generic Code Generation Algorithm


Assume that for each operator in the statement, there is a corresponding target language
operator The computed results can be left in registers as long as possible, storing them
only if the register is needed for another computation or just before a procedure call, jump
or labeled statement
Register and Address Descriptors
These are the primary data structures used by the code generator. They keep track of
what values are in each registers, as well as where a given value reside

→Each register has a register descriptor containing the list of variables currently stored in
this register. At the start of the basic block, all register descriptors are empty. It keeps track
of recent/current variable in each register. It is constructed whenever a new register is
needed

Each variable has an address descriptor containing the list of locations where this
variable is currently stored. Possibilities are its memory location and one or more registers
The memory location might be in the static area, the stack, r presumably the heap. The
register descriptors can be computed from the address descriptors. For each name of the
block, an address descriptors is maintained that keep track of location where current value
of name is found at runtime
There are basically three aspects to be considered in code generation:
Choosing registers
Generating instructions
Managing descriptors
Minimize the number of registers used:
 →When a register holds the values of a program variable and all subsequent
uses of this value are preceded by a redefinition, we could reuse this register.
But to know about all subsequent uses, one may require live/dead-on-exit
knowledge
Assume a, b, c and d are program variables and t, u, v are compiler generated
temporaries. These are represented as t 1,𝑡2 and t$3. The code generated for different
TACs is given below:
t = a –b
LD R1, a
LD R2, b
SUB R1, R2
U=a–c
LD R3, a
LD R2, c
SUB R3, R2
v=t+u
ADD R1, R3
a=d
LD R2, d
ST a, R2
d=v+u
ADD R1, R3
ST d, R1
Exit
DAG for Register allocation:
DAG (Directed Acyclic Graphs) are useful data structures for implementing
transformations on basic blocks. A DAG gives a picture of how the value
computed by a statement in a basic block is used in subsequent statements of
the block.
→Constructing a DAG
from three-address statements is a good way of determining common sub-
expressions within a block, determining which names are used inside
the block but evaluated outside the block, and determining which statements of
the block could have their computed value used outside the block.
→A DAG for a basic block is a directed cyclic graph with the following labels
on nodes:
1. Leaves are labeled by unique identifiers, either variable names or constants.
From the operator applied to a name we determine whether the l-value or r-
value of a name is needed; most leaves represent r- values.
2. Interior nodes are labeled by an operator symbol.
3. Nodes are also optionally given a sequence of identifiers for labels. The
intention is that interior nodes represent computed values, and the identifiers
labeling a node are deemed to have that value.
DAG representation Example:

For example, the slide shows a three-address code. The corresponding DAG is
shown. We observe that each node of the DAG represents a formula in terms of
the leaves, that is, the values possessed by variables and constants upon
entering the block
. For example, the node labeled t 4 represents the formula
that is, the value of the word whose address is 4*i bytes offset from address b,
which is the
intended value of t 4

You might also like