Semester Vi - Compiler Design (Cs8602) - Compressed
Semester Vi - Compiler Design (Cs8602) - Compressed
RAMAKRISHNAN COLLEGE OF
ENGINEERING ,TRICHY
INSTITUTE VISION
To achieve a prominent position among the top technical
institutions.
INSTITUTE MISSION
M1: To bestow standard technical education par excellence through
state of the art infrastructure, competent faculty and high ethical
standards.
M2: To nurture research and entrepreneurial skills among students
in cutting edge technologies.
M3: To provide education for developing high-quality professionals
to transform the society.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT VISION
To create eminent professionals of Computer Science and
Engineering by imparting quality education.
DEPARTMENT MISSION
M1: To provide technical exposure in the field of Computer
Science and Engineering through state of the art
infrastructure and ethical standards.
M2: To engage the students in research and development
activities in the field of Computer Science and Engineering.
M3: To empower the learners to involve in industrial and
multi-disciplinary projects for addressing the societal needs.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
PROGRAM EDUCATIONAL OBJECTIVES
Our graduates shall
PEO1: Analyse, design and create innovative products for
addressing social needs.
PEO2: Equip themselves for employability, higher studies and
research.
PEO3: Nurture the leadership qualities and entrepreneurial skills
for their successful career
PROGRAM SPECIFIC OUTCOMES
Students will be able to
PSO1: Apply the basic and advanced knowledge in developing
software, hardware and firmware solutions addressing real life
problems.
PSO2: Design, develop, test and implement product-based solutions
for their career enhancement.
PROGRAM OUTCOMES
PO1 - Engineering knowledge:
Apply the knowledge of mathematics, science, engineering fundamentals,
and an engineering specialization to the solution of complex engineering problems.
PO2- Problem analysis:
UNIT I
INTRODUCTION TO
COMPILERS
COURSE OUTCOME
SNO KNOWLE DESCRIPTION
DGE
LEVEL
Input Buffering
Specification of Tokens
Recognition of Tokens
LEX
Finite Automata
Minimizing DFA.
A compiler is program that reads a program written in
one language (source language) and translates it into
an equivalent program in another language (target
language) .
To convert human readable source code into machine
executable code.
Types of Translator:
1.Interpreter
2.Compiler
3.Assembler.
INTERPRETER COMPILER
ASSEMBLER
MACHINE LANGUAGE
(SECOND GENERATION)
Portability (machine-independent)
•Assembler
Interpreters
38
Input: Sequence of characters
Note: the blank separating the characters of these tokens would normally be eliminated
during lexical analysis
Syntax Analysis (Hierarchical Analysis or Parsing)
position +
*
initial
rate 60
Input: Parse tree + symbol table
1. a=b-c*20
2. A=B+C*50
3. ans:= a+ - b*5.3^2^(1-f)
4. X=Y=Z
ERRORS ENCOUNTERED IN DIFFERENT
PHASES
Lexical analysis:
Faulty sequence of characters which does not result in a token, e.g.Ö, 5EL, %K,
‟string
Syntax analysis:
Semantic analysis:
Code optimization:
Code generation:
Table management:
Usually, some errors are left to run time: array index out of bounds
1 Pass: Reading the source program once.
A collection of phases is done only once (single
pass) or multiple times (multi pass)
SINGLE PASS: usually requires everything to be
defined before being used in source program.
MULTI PASS: compiler may have to keep entire
program representation in memory.
Several phases can be grouped into one single pass
and the activities of these phases are interleaved
during the pass. For example, lexical analysis,
syntax analysis, semantic analysis and
intermediate code generation might be grouped
into one pass.
Manual approach – by hand
◦ To identify the occurrence of each lexeme
◦ To return the information about the identified token
54
Software development tools are available to
implement one or more compiler phases
Scanner generators
Parser generators
Syntax-directed translation engines
Automatic code generators
Data-flow engines
55
LEXICAL ANALYSIS
THE ROLE OF THE LEXICAL ANALYZER
Read input characters
To group them into lexemes
sentinel
◦ added at each buffer end
◦ can not be part of the source program
◦ character eof is a natural choice
retains the role of entire input end
when appears other than at the end of a buffer it means that the
input is at an end
LOOKAHEAD CODE WITH SENTINELS
switch(*forward++)
{
case eof:
if(forward is at the end of the first buffer)
{
reload second buffer;
forward = beginning of the second buffer;
}
else if(forward is at the end of the second buffer)
{
reload first buffer;
forward = beginning the first buffer;
}
else
/* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
}
ERROR RECOVERY IN LEXICAL ANALYZER
Simplest error recovery strategy is panic mode error
recovery.
Panic mode:
Operators used
+ (union)
* (closure)
. (concatenation)
TOKENS, PATTERNS, LEXEMES
Token - pair of:
<token type, [optional value]>
keyword, identifier, …
Pattern
◦ description of the form that the lexeme of a token may take
◦ e.g.
Identifiers : Letter followed by a sequence of letters
and digits
Number : Combination of one or more digits
String literal : Anything enclosed in “ and ”
Lexeme
◦ a sequence of characters in the source program matching a
pattern for a token
EXAMPLES OF TOKENS
Token Informal Description Sample Lexemes
if characters i, f if
else characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letters and pi, score, D2
digits
number Any numeric constant 3.14159, 0, 02e23
literal Anything but “, surrounded “core dumped”
by “
EXAMPLES OF TOKENS
One token for each keyword
◦ Keyword pattern = keyword itself
Tokens for operators
◦ Individually or in classes
One token for all identifiers
One or more tokens for constants
◦ Numbers, literal strings
Tokens for each punctuation symbol
◦ (),;
REGULAR EXPRESSIONS FOR =>
Identifiers
Keywords
Numbers/digits
An identifier is defined as a letter followed by zero or
more letters or digits.
The regular expression for an identifier is given as
letter (letter | digit)*
RE to NFA to DFA
(Modified subset construction method)
RE to DFA (Direct method)
METHOD 1
RE to NFA to DFA
(Modified subset construction)
FINITE AUTOMATA
Finite Automata are used as a model for:
7
8
TRANSITION GRAPH
An NFA can be diagrammatically represented by
a labeled directed graph called a transition graph
a S = {0,1,2,3}
= {a,b}
start a b b s0 = 0
0 1 2 3
F = {3}
b
7
9
TRANSITION TABLE
The mapping of an NFA can be represented in a
transition table
Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
8
0
THE LANGUAGE DEFINED BY AN NFA
An NFA accepts an input string x if and
only if there is some path with edges
labeled with symbols from x in sequence
from the start state to some accepting
state in the transition graph
A state transition from one state to
another on the path is called a move
The language defined by an NFA is the set
of input strings it accepts, such as
(ab)*abb for the example NFA
8
1
FROM REGULAR EXPRESSION TO -
NFA (THOMPSON’S CONSTRUCTION)
start
i f
82
a start a
i f
start N(r1)
r 1 r 2 i f
N(r2)
r1r2 start
i N(r1) N(r2) f
r* start
i N(r) f
COMBINING THE NFAS OF A SET OF
REGULAR EXPRESSIONS
start a
1 2
83
a { action1 }
abb { action2 } start a b b
a*b+ { action3 } 3 4 5 6
a b
start
7 b 8
a
1 2
start
0 3
a
4
b
5
b
6
a b
7 b 8
DETERMINISTIC FINITE AUTOMATA
A deterministic finite automaton is a special
case of an NFA
No state has an -transition
For each state s and input symbol a there is
atmost one edge labeled a leaving s
Each entry in the transition table is a single
state
At most one path exists to accept a string
Simulation algorithm is simple
8
4
EXAMPLE DFA
85
A DFA that accepts (ab)*abb
b
b
a
start a b b
0 1 2 3
a a
CONVERSION OF AN NFA INTO A DFA
The subset construction algorithm
converts an NFA into a DFA using:
-closure(s) = {s} {t s … t}
-closure(T) = sT -closure(s)
move(T,a) = {t s a t and s T}
The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new
DFA
8
6
-CLOSURE AND MOVE EXAMPLES
-closure({0}) = {0,1,3,7}
a move({0,1,3,7},a) = {2,4,7}
1 2 -closure({2,4,7}) = {2,4,7}
87
move({2,4,7},a) = {7}
-closure({7}) = {7}
start
0 3
a
4
b
5
b
6 move({7},b) = {8}
a b -closure({8}) = {8}
move({8},a) =
7 b 8
a a b a none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs
SUBSET CONSTRUCTION
a
2 3
start
88
a b b
0 1 6 7 8 9 10
4
b
5
b
Dstates
C A = {0,1,2,4,7}
B = {1,2,3,4,6,7,8}
b a b
C = {1,2,4,5,6,7}
start a b b D = {1,2,4,5,6,7,9}
A B D E E = {1,2,4,5,6,7,10}
a
a
a
SUBSET CONSTRUCTION EXAMPLE 2
a a1
1 2
start
a
89
b b a2
0 3 4 5 6
a b
7 b 8 a3
b
Dstates
a3
C A = {0,1,3,7}
b a B = {2,4,7}
b b C = {8}
start D = {7}
A D E = {5,8}
a F = {6,8}
a
b b
B E F
a1 a3 a2 a3
METHOD 2
RE to DFA (Direct method)
DIRECT METHOD
STEPS
1. Syntax tree construction
2. Compute NULLABLE, FIRSTPOS,
LASTPOS & FOLLOWPOS
3. Construct DFA from FOLLOWPOS
ANNOTATING THE TREE
nullable(n): the sub tree at node n generates
languages including the empty string
Leaf true
*
| true firstpos(c1) lastpos(c1)
c1
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: SYNTAX TREE OF (A|B)*ABB#
{1, 2, 3} {6}
94
{1, 2, 3} {5} {6} # {6}
6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4
firstpos lastpos
{3} a {3}
{1, 2}
* {1, 2} 3
{1, 2} | {1, 2}
{1} a {1} {2} b {2}
1 2
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: FOLLOWPOS
95
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i) firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i) firstpos(n)
end do
end if
end do
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: ALGORITHM
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T
96
for each input symbol a do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: EXAMPLE
Node followpos
1
97
1 {1, 2, 3}
2 {1, 2, 3} 3 4 5 6
3 {4}
2
4 {5}
5 {6}
6 -
b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
MINIMIZATION OF DFA
{definitions}
%% (required)
{transition rules}
%%
(optional)
{user subroutines}
The absolute minimum Lex program is thus
%%
LEX SOURCE PROGRAM
digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main()
{
yylex();
}
LEX SOURCE TO C PROGRAM
The table is translated to a C program (lex.yy.c)
which
reads an input stream
partitioning the input into strings which match the
given expressions and
copying it to an output stream if necessary
AN OVERVIEW OF LEX
Yacc
Yacc generates C code for syntax analyzer, or
parser.
Yacc uses grammar rules that allow it to
analyze tokens from Lex and create a syntax
tree.
LEX WITH YACC
Lex Yacc
lex.yy.c y.tab.c
call
Parsed
Input yylex() yyparse()
return Input
token
LEX REGULAR EXPRESSIONS (EXTENDED
REGULAR EXPRESSIONS)
A regular expression matches a set of
strings
Regular expression
Operators
Character classes
Arbitrary character
Optional expressions
Alternation and grouping
Context sensitivity
Repetitions and definitions
OPERATORS
“ \ [ ] ^ - ? . * + | ( ) $ / { } % < >
E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case
letters
[a-zA-Z][a-zA-Z0-9]* => all
alphanumeric strings with a leading
alphabetic character
PRECEDENCE OF OPERATORS
Level of precedence
Kleene closure (*), ?, +
concatenation
alternation (|)
a = b + c;
…
%%
<regexp> <action> a operator: ASSIGNMENT b + c;
<regexp> <action>
…
%%
%%
“=“ printf(“operator: ASSIGNMENT”);
TRANSITION RULES
regexp <one or more blanks> action (C code);
regexp <one or more blanks> { actions (C code) }
↓↓
a=b+c;d=b*c;
TRANSITION RULES (CONT’D)
…
%%
pink {npink++; REJECT;}
ink {nink++; REJECT;}
pin {npin++; REJECT;}
.|
\n ;
%%
…
LEX PREDEFINED VARIABLES
yytext -- a string containing the lexeme
yyleng -- the length of the lexeme
yyin -- the input stream pointer
the default input of default main() is stdin
yyout -- the output stream pointer
the default output of default main() is stdout.
cs20: %./a.out < inputfile > outfile
E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
LEX LIBRARY ROUTINES
yylex()
The default main() contains a call of yylex()
yymore()
return the next token
yyless(n)
retain the first n characters in yytext
yywarp()
is called whenever Lex reaches an end-of-file
The default yywarp() always returns 1
REVIEW OF LEX PREDEFINED
VARIABLES
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition
PLLab, NTHU,Cs2403 Programming Languages 126
USER SUBROUTINES SECTION
%{
int counter = 0;
%}
letter [a-zA-Z]
%%
{letter}+ {printf(“a word\n”); counter++;}
%%
main() {
yylex();
printf(“There are total %d words\n”, counter);
}
USAGE
To run Lex on a source file, type
lex scanner.l
It produces a file named lex.yy.c which is a C program
for the lexical analyzer.
To compile lex.yy.c, type
cc lex.yy.c –ll
To run the lexical analyzer program, type
./a.out < inputfile
VERSIONS OF LEX
AT&T -- lex
http://www.combo.org/lex_yacc_page/lex.html
GNU -- flex
http://www.gnu.org/manual/flex-2.5.4/flex.html
a Win32 version of flex :
http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html
or Cygwin :
http://sources.redhat.com/cygwin/
digit [0-9]
id {letter}({letter}:{digit})*
number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
ACTION - EVENT
%%
{ws} {/*noaction */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval=install_id(); return(ID);}
{number} {yylval=install_num(); return(NUMBER);}
Minimization of DFA
LEX
SELECTE
Lexical Error
SELECT
•The Google Play store currently houses over one
millions apps
SYNTAX ANALYZER
Syntax Analyzer creates the syntactic structure of the given
source program.
This syntactic structure is mostly a parse tree.
Syntax Analyzer is also known as parser.
The syntax of a programming is described by a context-free
grammar (CFG). We will use BNF (Backus-Naur Form)
notation in the description of CFGs.
The syntax analyzer (parser) checks whether a given source
program satisfies the rules implied by a context-free grammar
or not.
If it satisfies, the parser creates the parse tree of that program.
Otherwise the parser gives the error messages.
A context-free grammar
gives a precise syntactic specification of a programming language.
the design of the grammar is an initial phase of the design of a compiler.
a grammar can be directly converted into a parser by some tools.
PARSER
• Parser works on a stream of tokens.
Lexical token
source parse tree
program Analyzer Parser
get next token
145
146
PARSERS (CONT.)
We categorize the parsers into two groups:
1. Top-Down Parser
the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
the parse is created bottom to top; starting from the leaves
CONTEXT-FREE GRAMMARS
Inherently recursive structures of a programming
language are defined by a context-free grammar.
In a context-free grammar, we have:
A finite set of terminals (in our case, this will be the set of
tokens)
A finite set of non-terminals (syntactic-variables)
A finite set of productions rules in the following form
A where A is a non-terminal and
is a string of terminals and non-terminals
(including the empty string)
A start symbol (one of the non-terminal symbol)
Example:
E E+E | E–E | E*E | E/E | -E
E (E)
E id
148
DERIVATIONS
E E+E
E+E derives from E
we can replace E by E+E
to able to do this, we have to have a production rule EE+E in our
grammar.
CFG - TERMINOLOGY
L(G) is the language of G (the language generated by G)
which is a set of sentences.
A sentence of L(G) is a string of terminal symbols of G.
If S is the start symbol of G then
is a sentence of L(G) iff S + where is a string of terminals of G.
S
*
- If contains non-terminals, it is called as a sentential form
of G.
- If does not contain non-terminals, it is called as a sentence
of G.
150
DERIVATION EXAMPLE
E -E -(E) -(E+E) -(id+E) -(id+id)
OR
E -E -(E) -(E+E) -(E+id) -(id+id)
Right-Most Derivation
E -E rm-(E) rm -(E+id)
rm -(E+E) rm -(id+id)
rm
152
E -E E
-(E) E
-(E+E) E
- E - E - E
( E ) ( E )
E E E + E
- E - E
-(id+E) -(id+id)
( E ) ( E )
E + E E + E
id id id
AMBIGUITY
• A grammar produces more than one parse tree for a sentence is
called as an ambiguous grammar.
E
E E+E id+E id+E*E
153
E + E
id+id*E id+id*id
id E * E
id id
E + E id
id id
154
AMBIGUITY (CONT.)
For the most parsers, the grammar must be
unambiguous.
unambiguous grammar
unique selection of the parse tree for a
sentence
155
if E1 then if E2 then S1 else S2
stmt stmt
E2 S1 E2 S1 S2
Tree 1 Tree 2
AMBIGUITY (CONT.)
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.
156
stmt matchedstmt | unmatchedstmt
LEFT RECURSION
A grammar is left recursive if it has a non-terminal
A such that there is a derivation.
+
A A for some string
159
A’ A’ | an equivalent grammar
In general,
160
eliminate immediate left recursion
E T E’
E’ +T E’ |
T F T’
T’ *F T’ |
F id | (E)
LEFT-RECURSION -- PROBLEM
161
a grammar which is not left-recursive.
S Aa | b
A Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive.
S Aa Sca or
A Sc Aac causes to a left-recursion
ELIMINATE LEFT-RECURSION --
ALGORITHM
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai Aj
by
Ai 1 | ... | k
where Aj 1 | ... | k
}
- eliminate immediate left-recursions among Ai
productions
}
163
LEFT-FACTORING
A predictive parser (a top-down parser without
backtracking) insists that the grammar must be left-
factored.
LEFT-FACTORING (CONT.)
In general,
A 1 | 2 where is non-empty and
the first symbols
of 1 and 2 (if they have
one)are different.
when processing we cannot know whether expand
A to 1 or
A to 2
LEFT-FACTORING -- ALGORITHM
For each non-terminal A with two or more
alternatives (production rules) with a common
non-empty prefix, let say
convert it into
A A’ | 1 | ... | m
A’ 1 | ... | n
168
LEFT-FACTORING – EXAMPLE1
A abB | aB | cdg | cdeB | cdfB
A aA’ | cdg | cdeB | cdfB
A’ bB | B
A aA’ | cdA’’
A’ bB | B
A’’ g | eB | fB
169
LEFT-FACTORING – EXAMPLE2
A ad | a | ab | abc | b
A aA’ | b
A’ d | | b | bc
A aA’ | b
A’ d | | bA’’
A’’ | c
170
TOP-DOWN PARSING
The parse tree is created top to bottom.
Top-down parser
Recursive-Descent Parsing
Backtracking is needed (If a choice of a production rule does
not work, we backtrack to try other alternatives.)
It is a general parsing technique, but not widely used.
Not efficient
Predictive Parsing
no backtracking
efficient
needs a special form of grammars (LL(1) grammars).
Recursive Predictive Parsing is a special form of Recursive
Descent parsing without backtracking.
Non-Recursive (Table Driven) Predictive Parser is also
known as LL(1) parser.
172
S aBc
B bc | b
S S
input: abc
a B c
a B c
fails, backtrack
b
b c
173
PREDICTIVE PARSER
current token
174
proc A {
- match the current token with a, and move to
the next token;
- call ‘B’;
- match the current token with b, and move to
the next token;
}
176
proc A {
case of the current token {
‘a’: - match the current token with a, and move
to the next token;
- call ‘B’;
- match the current token with b, and move
to the next token;
‘b’: - match the current token with b, and move
to the next token;
- call ‘A’;
- call ‘B’;
}
}
177
A aA | bB |
first set of C
179
input buffer
stack
Non-recursive
output
Predictive Parser
Parsing Table
180
LL(1) PARSER
input buffer
◦ our string to be parsed. We will assume that its end is marked with a
special symbol $.
output
◦ a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.
stack
◦ contains the grammar symbols
◦ at the bottom of the stack, there is a special end marker symbol $.
◦ initially the stack contains only the symbol $ and the starting symbol S.
$S initial stack
◦ when the stack is emptied (ie. only $ left in the stack), the parsing is
completed.
parsing table
◦ a two-dimensional array M[A,a]
◦ each row is a non-terminal symbol
◦ each column is a terminal symbol or the special symbol $
◦ each entry holds a production rule.
181
3. If X is a non-terminal
parser looks at the parsing table entry M[X,a]. If M[X,a] holds a
production rule XY1Y2...Yk, it pops X from the stack and pushes
Yk,Yk-1,...,Y1 into the stack. The parser also outputs the production
rule XY1Y2...Yk to represent a step of the derivation.
Derivation(left-most): SaBaabBaabbBaabba
183
S
parse tree
a B a
b B
b B
LL(1) PARSER – EXAMPLE2
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
184
F (E) | id
id + * ( ) $
E E E TE’
TE’
E’ E’ +TE’ E’ E’
T T T FT’
FT’
T’ T’ T’ *FT’ T’ T’
F F id F (E)
185
FIRST EXAMPLE
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F (E) | id
FIRST(F) = {(,id} FIRST(TE’) =
{(,id}
FIRST(T’) = {*, } FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST() = {}
FIRST(E’) = {+, } FIRST(FT’) =
{(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST() = {}
FIRST((E)) = {(}
FIRST(id) = {id}
189
If ( A B is a production rule ) or
( A B is a production rule and is in FIRST() )
everything in FOLLOW(A) is in FOLLOW(B).
FOLLOW EXAMPLE
E TE’
E’ +TE’ |
T FT’
T’ *FT’ |
F (E) | id
FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
191
LL(1) GRAMMARS
A grammar whose parsing table has no multiply-
defined entries is said to be LL(1) grammar.
FIRST(iCtSE) = {i}
FIRST(a) = {a} a b e i t $
FIRST(eS) = {e}
S S S
FIRST() = {} a iCtSE
FIRST(b) = {b} EeS E
E
E
A GRAMMAR WHICH IS NOT LL(1) (CONT.)
What do we have to do it if the resulting parsing table contains multiply defined
entries?
◦ If we didn’t eliminate left recursion, eliminate the left recursion in the
grammar.
◦ If the grammar is not left factored, we have to left factor the grammar.
◦ If its (new grammar’s) parsing table still contains multiply defined entries,
that grammar is ambiguous or it is inherently not a LL(1) grammar.
A left recursive grammar cannot be a LL(1) grammar.
◦ A A |
any terminal that appears in FIRST() also appears FIRST(A) because
A .
If is , any terminal that appears in FIRST() also appears in FIRST(A)
and FOLLOW(A).
A grammar is not left factored, it cannot be a LL(1) grammar
• A 1 | 2
any terminal that appears in FIRST(1) also appears in FIRST(2).
If the substring is chosen correctly, the right most derivation of that string is
created in the reverse order.
Rightmost Derivation:
S aABe aAde aAbcde abbcde
HANDLES
Handle of a string: Substring that matches the RHS of some
production AND whose reduction to the non-terminal on the LHS is
a step along the reverse of some rightmost derivation.
It follows that:
S aABe is a handle of aABe in location 1.
B d is a handle of aAde in location 3.
A Abc is a handle of aAbcde in location 2.
A b is a handle of abbcde in location 2.
HANDLE PRUNING
A rightmost derivation in reverse can be obtained
by “handle-pruning.”
Apply this to the previous example.
S aABe
A Abc | b
Bd
$E +id*id$ shift
$E+ id*id$ shift
$E+id *id$ reduce by F id
$E+F *id$ reduce by T F
$E+T *id$ shift
$E+T* id$ shift
$E+T*id $ reduce by F id
$E+T*F $ reduce by T T*F
$E+T $ reduce by E E+T
$E $ accept
CONFLICTS DURING SHIFT-REDUCE
PARSING
There are context-free grammars for which shift-
reduce parsers cannot be used.
Stack contents and the next input symbol may not
decide action:
◦ shift/reduce conflict: Whether make a shift operation or a
reduction.
◦ reduce/reduce conflict: The parser cannot decide which
of several reductions to make.
If a shift-reduce parser cannot be used for a
grammar, that grammar is called as non-LR(k)
grammar.
Left to right right-most k lookhead
scanning derivation
1. Operator-Precedence Parser
◦ simple, but only a small class of grammars.
LR-Parsers
◦ covers wide range of grammars.
SLR – simple LR parser
Canonical LR – most general LR parser
LALR – intermediate LR parser (lookhead LR parser)
◦ SLR, Canonical LR and LALR work same, only their
parsing tables are different.
214
LR PARSERS
The most powerful shift-reduce parsing (yet efficient) is:
LR(k) parsing.
LR PARSERS
LR-Parsers
covers wide range of grammars.
SLR – simple LR parser
LR – most general LR parser
LALR – intermediate LR parser (look-head LR
parser)
SLR, LR and LALR work same (they used the same
algorithm), only their parsing tables are different.
LR PARSING ALGORITHM
input a1 ... ai ... an $
stack
Sm
216
Xm output
LR Parsing Algorithm
Sm-1
Xm-1
.
.
Action Table Goto Table
S1 terminals and $ non-terminal
X1 s s
t four different t each item is
S0 a actions a a state number
t t
e e
s s
217
A CONFIGURATION OF LR PARSING
ALGORITHM
A configuration of a LR parsing is:
ACTIONS OF A LR-PARSER
1. shift s -- shifts the next input symbol and the state s
onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ ) ( So X1 S1 ... Xm Sm ai s, ai+1 ... an
$)
REDUCE ACTION
pop 2|| (=r) items from the stack; let us
assume that = Y1Y2...Yr
then push A and s where s=goto[sm-r,A]
1) E E+T state id + * ( ) $ E T F
2) ET 0 s5 s4 1 2 3
3) T T*F 1 s6 acc
220
4) TF 2 r2 s7 r2 r2
5) F (E)
3 r4 r4 r4 r4
6) F id
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
221
.
A a Bb
.
(four different possibility)
A aB b
.
A aBb
Sets of LR(0) items will be the states of action and
goto table of the SLR parser.
A collection of sets of LR(0) items (the canonical
LR(0) collection) is the basis for constructing SLR
parsers.
Augmented Grammar:
G’ is G with a new production rule S’S where S’ is
the new starting symbol.
223
1.
2. .
Initially, every LR(0) item in I is added to closure(I).
.
If A B is in closure(I) and B is a
production rule of G; then B will be in the
closure(I). We will
apply this rule until no more new LR(0) items can
be added to closure(I).
224
GOTO OPERATION
If I is a set of LR(0) items and X is a grammar symbol
(terminal or non-terminal), then goto(I,X) is defined as
follows:
.
.
If A X in I
then every item in closure({A X }) will be in
goto(I,X).
Example:
I ={ E’ .. .. .
E, E E+T, E T,
T
F . . ..
T*F, T
(E), F
F,
id }
.. .
goto(I,E) = { E’ E , E E +T }
goto(I,T) = { E T , T T *F }
goto(I,F) = {T F
.. .. .
}
goto(I,() = { F ( E), E E+T, E T, T .
T*F, T . F,
F
.
goto(I,id) = { F id }
(E), F id }
226
Algorithm:
.
C is { closure({S’ S}) }
repeat the followings until no more set of LR(0) items can be
added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C
I5: F id.
TRANSITION DIAGRAM (DFA) OF
GOTO FUNCTION
I0 E I1 + I6 T I9 * to I7
F to I3
( to I4
T
id to I5
I2 I7
* I10
F
I3 F to I4
to I5
(
I4 I8
to I2 id I
( 11
I5 to I3 to I6
E
to I4 )
id id T
F +
(
228
229
state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc
230
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
231
SLR(1) GRAMMAR
An LR parser using SLR(1) parsing tables for a
grammar G is called as the SLR(1) parser for G.
If a grammar G has an SLR(1) parsing table, it is
called SLR(1) grammar (or SLR grammar in
short).
Every SLR grammar is unambiguous, but every
unambiguous grammar is not a SLR grammar.
232
SLR PARSER
235
CONFLICT EXAMPLE
S L=R I0: S’ .S I1:S’ S. I6: S L=.R I9: S
L=R.
SR S .L=R R .L
L *R S .R I2:S L.=R L .*R
L id L .*R R L. L .id
RL L .id
R .L I3:S R.
CONFLICT EXAMPLE2
S AaAb I0: S’ .S
S BbBa S .AaAb
A S .BbBa
B A.
B.
Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A b reduce by A
reduce by B reduce by B
reduce/reduce conflict reduce/reduce conflict
237
S AaAb SAaAbAabab
SBbBaBbaba
S BbBa
A Aab ab Bba
ba
B AaAb Aa b BbBa
Bb a
238
LR(1) ITEM
To avoid some of invalid reductions, the states
need to carry more information.
Extra information is put into a state by
including a terminal symbol as a second
component in an item.
◦ .
if A B,a in closure(I) and B is a
production rule of G; then B.,b will be in
the closure(I) for each terminal b in FIRST(a) .
241
GOTO OPERATION
can be written as
.
A ,a1/a2/.../an
244
S AaAb I0: S’ .S ,$ S
I1: S’ S. ,$ A
S BbBa S .AaAb ,$
A S .BbBa ,$ B A.aAb ,$ a
I2: S to I4
B A . ,a
B . ,b I3: S B.bBa ,$ b to I5
A b
I4: S Aa.Ab ,$ I6: S AaA.b ,$ I8: S AaAb. ,$
A . ,b
B a
I5: S Bb.Ba ,$ I7: S BbB.a ,$ I9: S BbBa. ,$
B . ,a
CANONICAL LR(1) COLLECTION – EXAMPLE2
S’ S I0:S’ .S,$ I1:S’ S.,$ I4:L *.R,$/= R to I7
1) S L=R S .L=R,$ S * R .L,$/= L
to I8
2) S R S .R,$ LI2:S L.=R,$ to I6 L .*R,$/= *
3) L *R L R L.,$ L .id,$/= to I4
245
id
4) L id .*R,$/= R to I5
I3:S R.,$ i
L .id,$/= I5:L id.,$/=
5) R L d
R .L,$
I9:S L=R.,$
R I13:L *R.,$
I6:S L=.R,$ to I9
R .L,$ L I10:R L.,$
to I10
L .*R,$ *
R
I4 and I11
L .id,$ to I11 I11:L *.R,$ to I13
id R .L,$ L I5 and I12
to I12 to I10
L .*R,$ *
I7:L *R.,$/= L .id,$ to I11 I7 and I13
id
I8: R L.,$/= to I12 I8 and I10
I12:L id.,$
246
id * = $ S L R
0 s5 s4 1 2 3
1 acc
247
2 s6 r5
3 r2 no shift/reduce or
4 s5 s4 8 7 no reduce/reduce conflict
5
6 s12 s11
r4 r4
10 9
so, it is a LR(1) grammar
7 r3 r3
8 r5 r5
9 r1
10 r5
11 s12 s11 10 13
12 r4
13 r3
248
Ex: .
S L =R,$ .
S L =R Core
. .
R L ,$ RL
.
I1:L id ,= A new state: .
I12: L id ,=
.
L id ,$
.
I2:L id ,$ have same core, merge them
We will do this for all states of a canonical LR(1) parser to get the
states of the LALR parser.
In fact, the number of the states of the LALR parser for a grammar
will be equal to the number of states of the SLR parser for that
grammar.
251
Find each core; find all sets having that same core; replace those sets
having same cores with a single set which is their union.
C={I0,...,In} C’={J1,...,Jm} where m n
Create the parsing tables (action and goto tables) same as the
construction of the parsing tables of LR(1) parser.
◦ Note that: If J=I1 ... Ik since I1,...,Ik have same cores
cores of goto(I1,X),...,goto(I2,X) must be same.
◦ So, goto(J,X)=K where K is the union of all sets of items having
same cores as goto(I1,X).
SHIFT/REDUCE CONFLICT
We say that we cannot introduce a shift/reduce conflict during the
shrink process for the creation of the states of a LALR parser.
Assume that we can introduce a shift/reduce conflict. In this case, a
state of LALR parser must have:
.
A ,a and .
B a,b
This means that a state of the canonical LR(1) parser must have:
.
A ,a and .
B a,c
But, this state has also a shift/reduce conflict. i.e. The original
canonical LR(1) parser has a conflict.
(Reason for this, the shift operation does not depend on lookaheads)
253
REDUCE/REDUCE CONFLICT
But, we may introduce a reduce/reduce conflict
during the shrink process for the creation of the
states of a LALR parser.
. .
. .
I1 : A ,a I2: A ,b
B ,b B ,c
.
I12: A ,a/b
.
reduce/reduce conflict
B ,b/c
CANONICAL LALR(1) COLLECTION – EXAMPLE2
S’ S I0:S’ .S,$ .
I1:S’ S ,$ .
I411:L * R,$/= R
1) S L=R S .
L=R,$ S *
.. .
R L,$/=
to I713
2) S R S .
R,$ L 2 .
I :S L =R,$ to I6 L *R,$/=
L
to I810
3) L *R L .
*R,$/= R L ,$
.
L id,$/=
*
to I411
254
4) L id L .
id,$/= R
. .
id
to I512
5) R L R .
L,$
I3:S R ,$ i
d
I512:L id ,$/=
.
I6:S L= R,$
R I9:S L=R ,$.
..
R L,$
L *R,$
L
*
to I9
to I810
Same Cores
I4 and I11
.
L id,$
id
to I411 I5 and I12
to I512
.
I713:L *R ,$/=
I7 and I13
.
I810: R L ,$/=
I8 and I10
LALR(1) PARSING TABLES – (FOR
EXAMPLE2)
id * = $ S L R
0 s5 s4 1 2 3
255
1 acc
2 s6 r5
no shift/reduce or
3 r2 no reduce/reduce conflict
4 s5 s4 8 7
5 r4 r4
6 s12 s11 10 9 so, it is a LALR(1) grammar
7 r3 r3
8 r5 r5
9 r1
APPLICATIONS
256
SQL QUERY SYNTAX
257
SELECT * FROM Book WHERE price > 100 ORDER BY title;
1: i = i + 1
2: t1 = a [ i ]
{ 3: if t1 < v goto 1
int i; int j; 4: j = j -1
float[100] a; float v; float x; 5: t2 = a [ j ]
while (true) { 6: if t2 > v goto 4
do i = i + 1; while ( a[i] < v ); 7: ifFalse i >= j goto 9
do j = j – 1; while ( a[j] > v ); 8: goto 14
if ( i>= j ) break; 9: x = a [ i ]
x = a[i]; a[i] = a[j]; a[j] = x; 10: t3 = a [ j ]
} 11: a [ i ] = t3
} 12: a [ j ] = x
13: goto 1
14:
A MODEL OF A COMPILER FRONT END
Symbol
Table
INTERMEDIATE CODE GENERATION
The translation of parse tree into intermediate form …
Intermediate
Syntax Intermediate Code
Parser Code
tree Code Generator
Generator
NEED FOR INTERMEDIATE CODE
MANY COMPILERS ARE NEEDED
Java JVM
C Intel Pentium
… …
NEED FOR INTERMEDIATE CODE
Suppose We have n-source languages and m-
Target languages.
Java JVM
Optimization
C Intel Pentium
Intermediate
Code
.Net IBM Cell
… …
Example:
i) C
C Compiler Machine
for 80X86 instructions for
Program
System 80X86 system
C Compiler Machine
C
for SPARC instructions for
Program
System SPARC system
ii) Compile
r Machine
Backend instructions for
for 80X86 system
C 80X86
C Compile Intermediat system
Program r Front e code Compile
End r
Backend Machine instructions
for for SPARC system
SPARC
Benefits of using intermediate code
1. Retargeting is facilitated; a compiler for a different machine
can be created by attaching a Back-end for the new machine to
an existing Front-end.
Syntax-directed definition
Build up a translation by attaching strings (semantic rules) as
attributes to the nodes in the parse tree.
Hides implementation details
expr.t = “9-5+2”
term.t = “9”
9 - 5 + 2
SYNTHESIZED ATTRIBUTES
Interior node represents an operator and its children represent its operands.
Difference between DAG and syntax tree
1. DAG – node in a common sub-expressions has more than one parent
2. Syntax tree - Common sub-expression would be represented as a duplicated
subtree.
Dag for a+a*(b-c)+(b-c)*d
INTERMEDIATE LANGUAGES
Syntax trees
Postfix notation
do-while
body < 1: i = i + 1
2: t1 = a [ i ]
3: if t1 < v goto 1
assign [] v
i + a i
Postfix notation
a*b-c : ab*c-
i 1
1. SYNTAX TREES
Graphical representation
Depicts the natural hierarchical structure of the
program.
DAG
Gives the same(Directed
information butAcyclic Graph)
in a more compact
way.
Commom sub expressions are identified
Graphical representation of : a:= b * -c +
b * -c
SYNTAX TREE
DAG
assig assig
n n
a + a +
* * *
a+b
Postfix : ? ab+
a:= b * -c + b * -c
Postfix : ?
3. THREE ADDRESS CODE:
A=B*-C+B*-C
assig
n
TAC
a + t1=uminus(c)
t2=b*t1
* *
t3=uminus(c)
t4=b*t3
b uminus b uminus
t5=t2+t4
a=t5
x=y op z
E → id E.nptr := mkleaf(id,id.place)
assign
a:= b * -c + b * -c
id
a
+
* *
id b id b
uminus uminus
id id
c c
Method 2:
Nodes are allocated from an array of records and the index or
position of the node serves as a pointer to the node.
All the nodes in the syntax tree can be visited by following
pointers.
0 id b
a:= b * -c + b * -c
1 id c
2 uminus 1
3 * 0 2
4 id b
5 id c
6 uminus 5
7 * 4 6
8 + 3 7
9 id a
10 assign 9 8
THREE ADDRESS CODE
Format: x = y op z
3 addresses :: 2 operands + 1 result
t1:= y*z
t2:= x+t1
param
xn
n indicates the number of actual parameters in call p,n
7. Indexed Assignments:
x := y[i]
x[i] := y
Quadruples
Triples
Indirect Triples
QUADRUPLES
A quadruple is a record structure with 4 fields :
op, arg1, arg2, result
op represents the operator.
Eg. The 3 addr stmt x=y op z is represented as
op arg1 arg2 result
op y z x
Statements with unary operators do not use arg2
op arg1 arg2
(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
(4) + (1) (3)
(5) assign a (4)
(0) []= x i
(1) = (0) y
(0) =[] y i
(1) = x (0)
INDIRECT TRIPLES
Listing pointer to triples instead of the triples
themselves
a:=b*-c+b*-c
st op arg1 arg2
mt (10) uminus c
(0) (10)
(11) * b (10)
(1) (11)
(12) uminus c
(2) (12)
(3) (13) (13) * b (12)
(4) (14) (14) + (11) (13)
(5) (15) (15) assign a (14)
COMPARISON
Quadruples have the limitation of more
temporary variables
DECLARATIONS
Entries are made into symbol table for type and relative address
Declarations in a procedure
Translation scheme for declarations in a procedure
productions
P →D
D →D ;D
D→ id :T
T→ integer
T→ real
T→array[num]of T1
T→ ↑T1
DECLARATIONS
productions Translation scheme
P → D {offset : = 0}
D →D ;D
D→ id :T {
enter(id.name,T.type,offset);
offset := offset+T.width
}
T→ integer { T.type := integer; T.width:=4
}
T→ real { T.type := real; T.width:=8 }
T→array[num]of { T.type:=array(num.val,T1.type);
T1 T.width:=num.val X T1.width }
T→ ↑T1
{ T.type:=pointer(T1.type); T.width:=4; }
DECLARATIONS IN NESTED
PROCEDURES
base + (i – low) x w
The expression can be partially evaluated at
compile time if it is rewritten as
i X w + (base – low x w)
BOOLEAN EXPRESSIONS
1. Numerical representation (short circuit or jumping
code)
2. Control flow translation of boolean expressions
highest not
and
least or
THREE ADDRESS CODE FOR
IF A<B THEN 1 ELSE 0
E.true
Semantic rules:
S -> while E do S1
{S.begin = newlabel;
E.true = newlabel;
E.false = S.next;
S1.next = S.begin;
S.code = gen(S.begin’:’)||E.code||gen(E.true’:’)||S1.code
||gen(‘goto”S.begin)}
SYNTAX DIRECTED DEFINITION FOR FLOW OF
CONTROL STATEMENTS
productions Syntax directed definition
switch expression
begin
case value :
statement
case value:
statement
case value:
statement
default : statement
end
SYNTAX DIRECTED TRANSLATION OF CASE
STATEMENTS
Code to evaluate E into t
Switch E goto test
begin L1 : code for s1
goto next
case V1: S1 L2 : code for s2
case V2: S2 goto next
State Val
… …
X X.x
Y Y.Y
top Z Z.Z
… …
Parser stack
MOVES MADE BY TRANSLATOR ON INPUT
3*5+4N
TYPE CHECKING
A compiler must check that the source program
follows both syntactic and semantic conventions
of the source language.
TYPE CHECKING
Semantic checks
Type checks
Flow of control checks
Uniqueness checks
Name related checks
TYPE SYSTEMS
A set of rules for associating a type expressions to
various parts of a program
P→D;E
D → D ; D | id : T
T → char | integer | array [ num ] of T | ↑ T
E → literal | num | id | E mod E | E [ E ] | E ↑
TRANSLATION SCHEME FOR DECLARATIONS
P→D;E
D→D;D
D → id : T { addtype(id.entry, T.type) }
T → char { T.type := char }
T → integer { T.type := integer }
T → ↑T1 { T.type := pointer(T1.type) }
T → array [ num ] of T1
{ T.type := array(1 .. num.val, T1.type) }
E → id { E.type := lookup(id.entry) }
Suppose we encounter an expression x+i where x has type float and i has type
int.
CPU instructions for addition could take EITHER float OR int as operands, but not
a mix.
350
TYPE COERCION
IMPLICIT (COERCION)
EXPLICIT
x = (int)y * 2;
351
APPLICATIONS
Java
WORKFLOW AS DAG
UNIT IV
RUN TIME ENVIRONMENTS AND CODE
GENERATION
Storage Organization
Parameter Passing
Symbol Tables
SOURCE LANGUAGE ISSUES
1. A program is made up of procedures.
Procedures
Activation trees
Control stack
Scope of declaration
Binding of names
PROCEDURE ACTIVATION FOR QUICKSORT()
ACTIVATION TREES
Assumptions about flow of control among
procedures during execution of a program:
Control flows sequentially
Each execution of a procedure starts at the beginning of the
procedure body and eventually returns to the point
immediately following the place where the procedure was
called.
q(1,9)
q(1,3)
q(2,3)
SCOPE OF DECLARATION
Environment State
Static Allocation
Stack Allocation
Heap Allocation
THREE KINDS OF MEMORY
Fixed memory
Stack memory
Heap memory
FIXED ADDRESS MEMORY
Executable code
Global variables
Other registers
Code
Code for PRDUCE
Activation record
CHARACTER *50 BUFFER
for CNSUME
INTEGER NEXT Static Data
CHARACTER C
Activation record
CHARACTER`*50 BUFFER
for PRDUCE
INTEGER NEXT
LIMITATIONS
Size of data objects must be known at compile
time
Recursive procedures are restricted
eg Fortran
FORTRAN
.f, .FOR, .for, .f77, .f90 and .f95
Example : Quicksort()
CALLING
AllocatesSEQUENCE
activation record
Enters information into its fields
RETURN SEQUENCE
S
S a: array
S
S a: array
r
r i : integer
S S
a: array
q(1,9)
r q(1,9) i : integer
S S
a: array
r q(1,9) q(1,9)
i : integer
p(1,9 q(1,3) q(1,3)
) i : integer
p(1,3 q(1,0)
)
LIMITATION OF STACK ALLOCATION
Dangling References
References to a storage that has been deallocated
Eg : main()
{
int *p;
p=dangle();
}
int *dangle() 23 i
{ 2000
int i=23;
return &i;
}
HEAP ALLOCATION
Allocation is done according to need
S Control Link
r q(1,9) r(1,9)
Control Link
q(1,9)
Control Link
{
// statements
}
EXAMPLE
{ int a=0;
int b=0;
{
int b=1;
{
B0 B2
B1 int a=2;
}
{
B2 int b=3
}
}
}
PARAMETER PASSING
Call by value
Call by address
Call by Copy-Restore
SYMBOL TABLES
Stores the symbol of the source program as the
compiler encounters them.
Symbol
table
396
Intermediate representation
1. Postfix notation
2. Quadruples
3. Stack machine code
4. syntax trees
5. DAGs
Assumptions made:
* Register allocation
* Register assignment
Register pairs
Multiplication : M x,y
x - Multiplicand - even register
y - single register
Product - entire even/odd register pair
401
Byte addressable
4 bytes / word
op source, destination
Instruction Costs
a :=b+c
404
1. MOV b, R0 cost =6
ADD c, R0
MOV R0, a
2. MOV b, a cost =6
ADD c, a
call
return
halt
ADDRESS DESCRIPTOR:
Keeps track of location where the current value of the
name can be found at run-time.
(Register, Stack or Memory Address)
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
422
1. MOV b, R0 cost =6
ADD c, R0
MOV R0, a
2. MOV b, a cost =6
ADD c, a
Constant folding
Code motion
Induction-variable elimination
Reduction in strength
Structure-Preserving Transformations
1. Common sub-expression elimination
2. Dead code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements.
Algebraic Transformations
Simplifying expressions or replacing expensive
operation by cheaper ones
436
437
438
by using the following rules:
439
The following code computes the inner product of two vectors.
440
begin
prod := 0;
i := 1;
do begin
prod := prod + a[i] * b[i];
i = i+ 1;
end
while i <= 20
end
Source code
The following code computes the inner product of two vectors.
(1) prod := 0
441
(2) i := 1
begin
(3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
Three-address code
The following code computes the inner product of two vectors.
Rule (i) (1) prod := 0
442
(2) i := 1
begin
(3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
(13) …
Three-address code
The following code computes the inner product of two vectors.
Rule (i) (1) prod := 0
443
(2) i := 1
begin
Rule (ii) (3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
(13) …
Three-address code
The following code computes the inner product of two vectors.
Rule (i) (1) prod := 0
444
(2) i := 1
begin
Rule (ii) (3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
Rule (iii) (13) …
Three-address code
B1 (1) prod := 0
(2) i := 1
445
B2 (3) t1 := 4 * i
Basic Blocks: (4) t2 := a[t1]
(5) t3 := 4 * i
(6) t4 := b[t3]
(7) t5 := t2 * t4
(8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)
B3 (13) …
QUICKSORT ROUTINE
Basic blocks :
BB1: (1)--(4)
BB2: (5)--(8)
BB3: (9)--(12)
BB4: (13)
BB5: (14)--(22)
BB6: (23)--(30)
QUICKSORT THREE ADDRESS CODE
(1) i := m-1 (16) t7 := 4*i
(2) j := n (17) t8 := 4*j
(3) t1 := 4*n (18) t9 := a[t8]
(4) v := a[t1] (19) a[t7] := t9
(5) i := i+1 (20) t10 := 4*j
(6) t2 := 4*i (21) a[t10]:= x
(7) t3 := a[t2]
(22) goto(5)
(8) if t3<v goto(5)
(23) t11 := 4*i
(9) j := j-1
(10) t4 := 4*j (24) x := a[t11]
(11) t5 := a[t4] (25) t12 := 4*i
(12) if t5>v goto (9) (26) t13 := 4*n
(13) if i>=j goto (23) (27) t14 := a[t13]
(14) t6 := 4*i (28) a[t12] := t14
(15) x := a[t6] (29) t15 := 4*n
(30) a[t15] := x
(31) 2 calls ...
Control Flow Graph B1
i := m-1
j := n
t1 := 4*n
v := a[t1]
B2
i := i+1
t2 := 4*i
t3 := a[t2]
if t3<v goto B2
B3
j := j-1
t4 := 4*j
t5 := a[t4]
if t5 > v goto B3
B4
if i >= j goto B6
B5 B6
t6 := 4 * i
t11 := 4*i
x := a[t6]
x := a[t11]
t7 := 4*i
t12 := 4*i
t8 := 4*j
t13 := 4*j
t9 := a[t8]
t14 := a[t13]
a[t7]:= t9
a[t12]:= t14
t10 := 4*j
t15 := 4*j
a[t10]:= x
a[t15]:= x
goto B2
Common sub expression • Code motion
elimination,
• Induction-variable elimination
Copy propagation,
Dead-code elimination, and
• Reduction in strength
Constant folding
B5 B6
t6 := 4 * i
x := a[t6] t11 := 4*i
t7 := 4*i x := a[t11]
t8 := 4*j t12 := 4*i
t9 := a[t8] t13 := 4*j
a[t7]:= t9 t14 := a[t13]
t10 := 4*j a[t12]:= t14
a[t10]:= x t15 := 4*j
goto B2 a[t15]:= x
t6 := 4 * i
x := a[t6] t11 := 4*i
t7 := 4*i x := a[t11]
t8 := 4*j t12 := 4*i
t9 := a[t8] t13 := 4*j
a[t7]:= t9 t14 := a[t13]
t10 := 4*j a[t12]:= t14
a[t10]:= x t15 := 4*j
goto B2 a[t15]:= x
t6 := 4 * i
x := a[t6] t11 := 4*i
t8 := 4*j x := a[t11]
t9 := a[t8] t13 := 4*j
a[t6]:= t9 t14 := a[t13]
a[t8]:= x a[t11]:= t14
goto B2 a[t13]:= x
B5 B6
t6 := 4 * i
x := a[t6] t11 := 4*i
t8 := 4*j x := a[t11]
t9 := a[t8] t13 := 4*j
a[t6]:= t9 t14 := a[t13]
a[t8]:= x a[t11]:= t14
goto B2 a[t13]:= x
x := t3 x := t3
a[t2]:= t5 t14 := a[t1]
a[t4]:= x a[t2]:= t14
goto B2 a[t1]:= x
B5 B6
x := t3 x := t3
a[t2]:= t5 t14 := a[t1]
a[t4]:= x a[t2]:= t14
goto B2 a[t1]:= x
t2 := t2+4
t3 := a[t2]
if t3<v goto B2
B3
t4 := t4-4
t5 := a[t4]
if t5 > v goto B3
B4
if i >= j goto B6
B5 B6
a[t2]:= t5 t14 := a[t1]
a[t4]:= t3 a[t2]:= t14
goto B2 a[t1]:= t3
1. Common sub-expression elimination
2. Dead code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements
a := b + c a := b + c
464
b := a - d b := a - d
c := b + c c := b + c
d := a - d d := b
t1 := b + c t2 := x + y
466
t2 := x + y t1 := b + c
2 * 3.14 = 6.28
x/1 = x
x * y = y * x (commutativity)
x*1 = x
Reduction in Strength
Expensive operators can be replaces by cheaper
ones
x ** 2 = x * x
2.0 * x = x + x
x / 2 = x * 0.5
Algebraic transformations:
468
x := x + 0
x := x * 1 x := y*y
x := y**2 z := x + x
z := 2*x
Changes such as y**2 into y*y and 2*x into x+x are also known as
strength reduction.
Dominators:
472
2. Dominates itself
4 3. Dominates all but 1,2
4. Dominates all but 1,2,3
5 6 5. Dominates itself
7 6. Dominates itself
7. Dominates 7,8,9,10
8. Dominates 8,9,10
8
9. Dominates itself
9 10 10. Dominates itself
Tree formation
•Initial is root
•Each node dominates only its descendants
•Each node has a unique immediate dominator
•Each node dominates itself
DOMINATOR TREE
1
List of dominators
2 3
1. Dominates all
2. Dominates itself 4
3. Dominates all but 1,2
4. Dominates all but 1,2,3
5. Dominates itself 5 6 7
6. Dominates itself
7. Dominates 7,8,9,10
8. Dominates 8,9,10 8
9. Dominates itself
10. Dominates itself
9 10
Process of collecting information which are useful
for the purpose of optimization.
476
Sk: V1 = V2 + V3
Sk is a definition of V1
Sk is an use of V2 and V3
A is between every 2 successive statements and
before the first statement and after the last statement.
B
d1: x := y-m 1
d2: y := m
B
2
d3: x := x+1
out
S gen
S inS kill
S
definition s definition s input to S that are
that reach generated not killed in S
the end of by S
statement S
Note: the dataflow equations depend on the problem statement
Data-flow analysis of structured programs:
S ::= id := E
This restricted syntax results in the |S;S
forms depicted below for flowgraphs | if E then S else S
| do S while E
E ::= id + id
| id
S1 S1
If E goto S1
S2
S1 S2 If E goto S1
S1 in [S1] = in [S]
S
in [S2] = out [S1]
S2 out [S] = out [S2]
in [S1] = in [S]
S S1 S2 in [S2] = in [S]
out [S] = out [S1] out [S2]
S1
in [S1] = in [S] out [S1]
S
out [S]= out [S1]
Goals:
- improve quality of target code
486
- reduce code size
Method:
1. Examine short sequences of target instructions
2. Replacing the sequence by a more efficient one.
CM
PUT
680 -
• Redundant-instruction elimination Com
• flow-of-control optimizations pile
• Algebraic simplifications r
Desi
• Use of machine idioms gn
• Unreachable code and
• Reduction in strength Opti
487
(1) LOAD R0, a
(2) STORE a, R0 (1) LOAD R0, a
488
if a < b goto L1 if a < b goto L2
... ...
L1: goto L2 L1: goto L2
x= x+0
x= x*1
490
Remove statements like
a=a+1
a= a – 1
491
...
Before: if debug 1 goto L2
print debugging information
L2:
debug = 0
...
After:
492
The exponentiation operator requires a
function call
x := y**2
x := y * y
A control flow graph (CFG), or simply a flow graph,
493
is a directed multigraph in which:
(i) the nodes are basic blocks; and
(ii) the edges represent flow of control
(branches or fall-through execution).
495
Rule (2)
B2 (3) t1 := 4 * i
B1 (4) t2 := a[t1]
(5) t3 := 4 * i
B2 (6) t4 := b[t3]
(7) t5 := t2 * t4
B3 (8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)
B3 (13) …
EXAMPLE : CONTROL FLOW GRAPH
FORMATION
B1 (1) prod := 0
(2) i := 1 Rule (1)
496
Rule (2)
B2 (3) t1 := 4 * i
B1 (4) t2 := a[t1]
(5) t3 := 4 * i
B2 (6) t4 := b[t3]
(7) t5 := t2 * t4
B3 (8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)
B3 (13) …
EXAMPLE : CONTROL FLOW GRAPH
FORMATION
B1 (1) prod := 0
(2) i := 1 Rule (1)
497
Rule (2)
B2 (3) t1 := 4 * i
B1 (4) t2 := a[t1]
(5) t3 := 4 * i
B2 (6) t4 := b[t3]
(7) t5 := t2 * t4
B3 (8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)
Rule (2)
B3 (13) …
498
Question: Given the control flow graph of a procedure,
how can we identify loops?
A node a in a CFG dominates a node b if every path from the start node
to node b goes through a. We say that node a is a dominator of node b.
500
N: set of vertices
E: set of edges
s: starting node.
and let a N, b N.
1. a dominates b, written a b, if
every path from s to b contains a.
501
(2, 10) 2
}
3
Direct Domination: 4
1 <d 2, 2 <d 3, …
5
6 7
Dominator Sets:
DOM(1) = {1} 8
DOM(2) = {1, 2}
9
DOM(3) = {1, 2, 3}
DOM(10) = {1, 2, 10) 10
502
Motivation: Programs spend most of the execution time in loops,
therefore there is a larger payoff for optimizations that exploit loop structure.
A dominator tree is a useful way to represent the
503
dominance relation.
In a dominator tree the start node s is the root, and each
node d dominates only its descendents in the tree.
DOMINATOR TREE
1
List of dominators
2 3
1. Dominates all
2. Dominates itself 4
3. Dominates all but 1,2
4. Dominates all but 1,2,3
5. Dominates itself 5 6 7
6. Dominates itself
7. Dominates 7,8,9,10
8. Dominates 8,9,10 8
9. Dominates itself
10. Dominates itself
9 10
Depth first ordering in iterative algorithms
Structure based data flow analysis
1. https://onlinecourses.nptel.ac.in/noc21_cs07/preview
2. https://nptel.ac.in/courses/106108113
3. https://archive.nptel.ac.in/courses/106/105/106105190/