0% found this document useful (0 votes)
54 views509 pages

Semester Vi - Compiler Design (Cs8602) - Compressed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views509 pages

Semester Vi - Compiler Design (Cs8602) - Compressed

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 509

K.

RAMAKRISHNAN COLLEGE OF
ENGINEERING ,TRICHY
INSTITUTE VISION
 To achieve a prominent position among the top technical
institutions.
INSTITUTE MISSION
 M1: To bestow standard technical education par excellence through
state of the art infrastructure, competent faculty and high ethical
standards.
 M2: To nurture research and entrepreneurial skills among students
in cutting edge technologies.
 M3: To provide education for developing high-quality professionals
to transform the society.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
DEPARTMENT VISION
 To create eminent professionals of Computer Science and
Engineering by imparting quality education.
DEPARTMENT MISSION
 M1: To provide technical exposure in the field of Computer
Science and Engineering through state of the art
infrastructure and ethical standards.
 M2: To engage the students in research and development
activities in the field of Computer Science and Engineering.
 M3: To empower the learners to involve in industrial and
multi-disciplinary projects for addressing the societal needs.
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
PROGRAM EDUCATIONAL OBJECTIVES
 Our graduates shall
 PEO1: Analyse, design and create innovative products for
addressing social needs.
 PEO2: Equip themselves for employability, higher studies and
research.
 PEO3: Nurture the leadership qualities and entrepreneurial skills
for their successful career
PROGRAM SPECIFIC OUTCOMES
 Students will be able to
 PSO1: Apply the basic and advanced knowledge in developing
software, hardware and firmware solutions addressing real life
problems.
 PSO2: Design, develop, test and implement product-based solutions
for their career enhancement.
PROGRAM OUTCOMES
 PO1 - Engineering knowledge:
Apply the knowledge of mathematics, science, engineering fundamentals,
and an engineering specialization to the solution of complex engineering problems.
 PO2- Problem analysis:

Identify, formulate, review research literature, and analyze complex


engineering problems reaching substantiated conclusions using first principles of
mathematics, natural sciences, and engineering sciences.
 PO3- Design/development of solutions:

Design solutions for complex engineering problems and design system


components or processes that meet the specified needs with appropriate
consideration for the public health and safety, and the cultural, societal, and
environmental considerations.
 PO4-Conduct investigations of complex problems:

Use research-based knowledge and research methods including design of


experiments, analysis and interpretation of data, and synthesis of the information
to provide valid conclusions.
 PO5 - Modern tool usage:

Create, select, and apply appropriate techniques, resources, and modern


engineering and IT tools including prediction and modeling to complex
engineering activities with an understanding of the limitations.
PROGRAM OUTCOMES
 PO6- The engineer and society:
Apply reasoning informed by the contextual knowledge to assess societal,
health, safety, legal and cultural issues and the consequent responsibilities
relevant to the professional engineering practice.
 PO7 - Environment and sustainability:

Understand the impact of the professional engineering solutions in


societal and environmental contexts, and demonstrate the knowledge of, and need
for sustainable development.
 PO8- Ethics:

Apply ethical principles and commit to professional ethics and


responsibilities and norms of the engineering practice.
 PO9 - Individual and team work:

Functio n effectively as an individual, and as a member or leader in diverse


teams, and in multidisciplinary settings.
 PO10- Communication:

Communicate effectively on complex engineering activities with the


engineering community and with society at large, such as, being able to
comprehend and write effective reports and design documentation, make effective
presentations, and give and receive clear instructions.
PROGRAM OUTCOMES
 PO11- Project management and finance:
Demonstrate knowledge and understanding of the engineering and
management principles and apply these to one’s own work, as a member and
leader in a team, to manage projects and in multidisciplinary environments.
 PO12 -Life-long learning:
Recognize the need for, and have the preparation and ability to engage in
independent and life-long learning in the broadest context of technological change.
CS8602
COMPILER DESIGN

UNIT I
INTRODUCTION TO
COMPILERS
COURSE OUTCOME
SNO KNOWLE DESCRIPTION
DGE
LEVEL

C313.1 K2 Understand fundamentals of the compiler


phases and lexical analyzer.

C313.2 K3 Analyze complex engineering problems using


the principles of grammars and different parsers
designs.
C313.3 K6 Design solutions for complex engineering
problems using syntax-directed translation
schemes and intermediate code generation.
C313.4 K3 Analyze run-time environment and Design a
simple code generator.

C313.5 K2 Understand the implementation of various code


optimization techniques.
UNIT I
 Structure of a compiler
 Lexical Analysis

 Role of Lexical Analyzer

 Input Buffering

 Specification of Tokens

 Recognition of Tokens

 LEX

 Finite Automata

 Regular Expressions to Automata

 Minimizing DFA.
A compiler is program that reads a program written in
one language (source language) and translates it into
an equivalent program in another language (target
language) .
 To convert human readable source code into machine
executable code.

 To run the program.

 To identify errors in the program.

All software is written in a programming language.


 Machines understand only 1’’s and 0’’s. High-level
languages, make it easier for the user to program in,
but not for the machine to understand.

 Once the programmer has written and edited the


program (in an Editor), it needs to be translated into
machine language (1‟s and 0‟s) before it can be
executed.

 Compilers are used to do this conversion


ALGORITHMS AND MODELS USED IN
COMPILERS

• Automata, Regular Expressions (Lexing)

• Context Free Grammars, Trees (Parsing)

• Hash tables (Symbol table)


 It is a program that translates a program written
in one language to an equivalent program in
another language.

 Types of Translator:
1.Interpreter
2.Compiler
3.Assembler.
INTERPRETER COMPILER

ASSEMBLER
MACHINE LANGUAGE
(SECOND GENERATION)

Machine coding(binary programming –punch holes) (first


generation)
The computer’s native language, binary digits (0s, 1s)
0100 0001 0110 1110 0100 0001
0001 0010 1100 0100 0000 1101

Programming in machine code is


• very slow,
• error prone,
• requires a detailed knowledge of the relevant computer architecture,
• difficult to understand other people’s code,
• code becomes obsolete if the machine is changed.
8B542408 83FA0077 06B80000
0000C383 FA027706 B8010000
00C353BB 01000000 B9010000
008D0419 83FA0376 078BD98B
C84AEBF1 5BC3
ASSEMBLY LANGUAGE
(SECOND GENERATION)

Assembly Language(second generation)

One-to-one correspondence to machine language


MOV AX, 5h
MOV DX, 3h
ADD A

Assembler –translates assembly language programs into machine


language
 Languages more Close to Machine Instructions

 C Programming Supports Inline Assembly Language


Programs.

 Using inline assembly language feature in C we can


directly access system registers.

 C Programming is used to access memory directly using


pointer.

 C Programming also Supports high Level Language


Features.

 It is more User friendly as compare to Previous


languages so C programming is Middle Level Language.
Procedural Languages (Third generation)

Instructions translate into machine language instructions

Uses common words rather than abbreviated mnemonics

C++, Java, Fortran


A= 3
B= A * 2 -1
D= A / B + A^5

Compiler -translates the entire program at once

Interpreter -translates and executes one source program


statement at a time
 Expressions: such as +, -, *, /

 Data Types: simple types (e.g. Boolean, int, float) as


well as composite structures (records) and arrays can
be defined by the programmer

 Control Structures: allow programming of selective


computation as well as iterative computation

 Declaration: introduce identifiers to indicate const.


values, variables, procedures etc.
 Understandability (readability)

 Naturalness (languages for different applications)

 Portability (machine-independent)

 Efficient to use (development time)

 Extensibility (easy to extend)

 Communicability (easy to explain code to others)


 Interpreters are another class of translators

 Compiler: translates a program once and for all into target


language, e.g.: C++.

 Interpreter: effectively translates a source program every


time it is run

 Compilers and interpreters can be used together (hybrids),


e.g. “Java”.

 Java compiled and stored as Java byte code.

 Byte code interpreted by a Java Virtual Machine (JVM).


NO COMPILER INTERPRETER
Compiler Takes Entire Interpreter Takes Single
1
program as input instruction as input .
Intermediate Object Code is No Intermediate Object Code
2
Generated is Generated
Conditional Control Conditional Control
3
Statements Executes faster Statements Executes slower
Memory Requirement : More
4 (Since Object Code is Memory Requirement is Less
Generated)
Every time higher level
Program need not be compiled
5 program is converted into
every time
lower level program
Errors are displayed for every
Errors are displayed after
6 instruction interpreted (if
entire program is checked
any)
7 Example : C Compiler Example : BASIC
Beginners All Purpose Symbolic Instruction code

BASIC is called an interpreted language


 Single Pass Compilers
 Multi-Pass Compilers
 Load and Go Compilers
 Cross Compilers
 Optimizing Compilers
•Preprocessor

•Assembler

•Loader / Link editor


 A preprocessor is a program that processes its
input data to produce output that is used as input to
another program.

They may perform the following


functions:
1. Macro processing

2. File Inclusion

3. Rational Preprocessors

4. Language extension


 Assembler creates object code by translating
assembly instruction mnemonics into machine code.

 There are two types of assemblers:


ONE-PASS ASSEMBLERS
TWO-PASS ASSEMBLERS

 One-Pass Assemblers go through the source code


once and assume that all symbols will be defined
before any instruction that references them.

 Two-Pass Assemblers create a table with all


symbols and their values in the first pass, and then
use the table in a second pass to generate code.
 A linker or link editor is a program that takes
one or more objects generated by a compiler and
combines them into a single executable program.

 Three tasks of the linker are :


1. Searches the program to find library routines used by
program, e.g. printf(), math routines.
2. Determines the memory locations that code from each
module will occupy and relocates its instructions by
adjusting absolute references
3. Resolves references among files.
A Loader is the part of an operating
system that is responsible for loading
programs in memory, one of the essential
stages in the process of starting a program.

 Stepsin program execution


1. Source program → object program
(compiling)
2. Linking, loading → absolute program
3. Input → output
There are two parts of
compilation

 Part1: Analysis: breaks up the


source program into constituent
pieces and creates an intermediate
representation of the source
program.

 Part2: Synthesis: constructs the


desired target program from the
intermediate representation. It
requires the most specialized
techniques
Analysis consists of three phases:

 Lexical Analysis ( linear analyzer or scanner):


read from left-to-right and grouped into tokens that
are sequences of characters having a collective
meaning.

 Syntax Analysis ( hierarchical analyzer or


parser): characters or tokens are grouped
hierarchically into nested collections with collective
meaning.

 Semantic Analysis: certain checks are performed to


ensure that the components of a program fit together
meaningfully
 Structure Editors (syntax highlighting)

 Pretty printers (e.g. Doxygen)

 Static checkers (e.g. Lint and Splint)

 Interpreters

38
 Input: Sequence of characters

 Output: Tokens (Groups of successive characters which belong together logically).

 Translate the input program, entered as a sequence of characters, into a sequence of


words or symbols (tokens). For example, the keyword for should be treated as a single
entity, not as a 3 character string.

position := initial + rate * 60


 The assignment statement would be grouped into the following tokens
1. The identifier position
2. The assignment symbol :=
3. The identifier initial
4. The plus sign +
5. The identifier rate
6. The multiplication sign *
7. The number 60

Note: the blank separating the characters of these tokens would normally be eliminated
during lexical analysis
 Syntax Analysis (Hierarchical Analysis or Parsing)

 Input: Sequence of tokens

 Output: Parse tree, error messages

 It involves grouping the tokens of the source program into


grammatical phrases that are used by the compiler to
synthesize output. Usually, the grammatical phrases of the
source program are represented by a parse tree such as the
following:

 Determine the structure of the program, for example,


identify the components of each statement and expression
and check for syntax errors.
:=

position +

*
initial

rate 60
 Input: Parse tree + symbol table

 Output: annotated tree (abstract tree with attributes) symbol


table variables information on their type...

 Checks the source program for semantic errors and gathers


type information for subsequent code generation phase

 It uses the hierarchy structure determined by the syntax-


analysis phase

 Check that the program is reasonable, for example, that it


does not include references to undefined variables.

 An important component of semantic analysis is Type


Checking
 Intermediate Code Generation: as a program for an abstract
machine. It should be easy to produce and easy to translate into
the target program.
[Three Address Code]

 Code Optimization: attempts to improve the intermediate code.


The program can be fixed during the code optimization phase.

 Machine code/assembly code Generation: memory locations are


selected for each of the variables used by the program.
Intermediate instructions are each translated into a sequence of
machine instructions that perform the same task. A crucial
aspect is the assignment of variables to registers. [MOV, ADD,
SUB,JMP, . . . ]
Construct syntax trees for the following expressions:

1. a=b-c*20

2. A=B+C*50

3. ans:= a+ - b*5.3^2^(1-f)

4. X=Y=Z
ERRORS ENCOUNTERED IN DIFFERENT
PHASES
 Lexical analysis:

Faulty sequence of characters which does not result in a token, e.g.Ö, 5EL, %K,
‟string

 Syntax analysis:

Syntax error (e.g. missing semicolon), (4 * (y + 5) -12))

 Semantic analysis:

Type conflict, e.g. ‟HEJ‟+5

 Code optimization:

Uninitialized variables, anomaly detection.

 Code generation:

Too large integers, run out of memory.

 Table management:

Double declaration, table overflow.

A good compiler finds an error at the earliest occasion.

Usually, some errors are left to run time: array index out of bounds
 1 Pass: Reading the source program once.
 A collection of phases is done only once (single
pass) or multiple times (multi pass)
 SINGLE PASS: usually requires everything to be
defined before being used in source program.
 MULTI PASS: compiler may have to keep entire
program representation in memory.
 Several phases can be grouped into one single pass
and the activities of these phases are interleaved
during the pass. For example, lexical analysis,
syntax analysis, semantic analysis and
intermediate code generation might be grouped
into one pass.
 Manual approach – by hand
◦ To identify the occurrence of each lexeme
◦ To return the information about the identified token

 Automatic approach - lexical-analyzer generator


◦ Compiles lexeme patterns into code that functions as
a lexical analyzer
◦ e.g. Lex, Flex, …
◦ Steps
 Regular expressions - notation for lexeme patterns
 Nondeterministic automata
 Deterministic automata
 Driver - code which simulates automata
Compiler front end and back end Phases
 Front end: analysis (machine independent)
Largely dependent on source language.
Independent of target machine

 Back end: synthesis (machine dependent)


Largely dependent on target machine architecture
Independent of Source language

54
 Software development tools are available to
implement one or more compiler phases

 Scanner generators
 Parser generators
 Syntax-directed translation engines
 Automatic code generators
 Data-flow engines

55
LEXICAL ANALYSIS
THE ROLE OF THE LEXICAL ANALYZER
 Read input characters
 To group them into lexemes

 Produce as output a sequence of tokens


 input for the syntactical analyzer
 Interact with the symbol table
 Insert identifiers
THE ROLE OF THE LEXICAL ANALYZER
 to strip out
 comments
 whitespaces: blank, newline, tab, …
 other separators

 to correlate error messages generated by the


compiler with the source program
 to keep track of the number of newlines seen
 to associate a line number with each error message
LEXICAL ANALYSIS VS. PARSING
 Simplicity of design
◦ Separation of lexical from syntactical analysis ->
simplify at least one of the tasks
◦ e.g. parser dealing with white spaces -> complex
◦ Cleaner overall language design
 Improved compiler efficiency
◦ Liberty to apply specialized techniques that serves
only lexical tasks, not the whole parsing
◦ Speedup reading input characters using specialized
buffering techniques
 Enhanced compiler portability
◦ Input device peculiarities are restricted to the lexical
analyzer
ATTRIBUTE VALUES PAIR OF TOKENS
 E = M * C ** 2
 <id, pointer to symbol table entry for E>
 <assign_op>
 <id, pointer to symbol-table entry for M>
 <mult_op>
 <id, pointer to symbol-table entry for C>
 <exp_op>
 <number, integer value 2>
INPUT BUFFERING
 How to speed the reading of source program ?
 to look one additional character ahead
 e.g.
◦ to see the end of an identifier you must see a character
 which is not a letter or a digit
 not a part of the lexeme for id
◦ in C
 - ,= , <
 ->, ==, <=
 two buffer scheme that handles large lookaheads
safely
 sentinels – improvement which saves time checking
buffer ends
BUFFER PAIRS
 Buffer size N
 N = size of a disk block (4096)
 read N characters into a buffer
◦ one system call
◦ not one call per character

 read < N characters we encounter eof


 two pointers to the input are maintained
◦ lexemeBegin – marks the beginning of the current
lexeme
◦ forward – scans ahead until a pattern match is found
SENTINELS
 forward pointer
◦ to test if it is at the end of the buffer
◦ to determine what character is read (multiway branch)

 sentinel
◦ added at each buffer end
◦ can not be part of the source program
◦ character eof is a natural choice
 retains the role of entire input end
 when appears other than at the end of a buffer it means that the
input is at an end
LOOKAHEAD CODE WITH SENTINELS
switch(*forward++)
{
case eof:
if(forward is at the end of the first buffer)
{
reload second buffer;
forward = beginning of the second buffer;
}
else if(forward is at the end of the second buffer)
{
reload first buffer;
forward = beginning the first buffer;
}
else
/* eof within a buffer marks the end of input */
terminate lexical analysis;
break;
}
ERROR RECOVERY IN LEXICAL ANALYZER
 Simplest error recovery strategy is panic mode error
recovery.
 Panic mode:

 Successive characters are deleted from the remaining


input until the lexical analyzer can find a well formed
token.

 Possible error recovery actions are:


· Deleting an extraneous character
· Inserting a missing character
· Replacing incorrect character by a correct character
· Transposing two adjacent characters
EXPRESSING TOKENS BY RE
EXPRESSING TOKENS BY REGULAR
EXPRESSIONS

 Regular Expressions : Algebraic notation to


specify tokens

 Operators used
 + (union)
 * (closure)

 . (concatenation)
TOKENS, PATTERNS, LEXEMES
 Token - pair of:
<token type, [optional value]>
keyword, identifier, …

 Pattern
◦ description of the form that the lexeme of a token may take
◦ e.g.
 Identifiers : Letter followed by a sequence of letters
and digits
 Number : Combination of one or more digits
 String literal : Anything enclosed in “ and ”
 Lexeme
◦ a sequence of characters in the source program matching a
pattern for a token
EXAMPLES OF TOKENS
Token Informal Description Sample Lexemes
if characters i, f if
else characters e, l, s, e else
comparison < or > or <= or >= or == or != <=, !=
id Letter followed by letters and pi, score, D2
digits
number Any numeric constant 3.14159, 0, 02e23
literal Anything but “, surrounded “core dumped”
by “
EXAMPLES OF TOKENS
 One token for each keyword
◦ Keyword pattern = keyword itself
 Tokens for operators
◦ Individually or in classes
 One token for all identifiers
 One or more tokens for constants
◦ Numbers, literal strings
 Tokens for each punctuation symbol
◦ (),;
REGULAR EXPRESSIONS FOR =>

 Identifiers
 Keywords

 Symbols & Operators

 Numbers/digits
An identifier is defined as a letter followed by zero or
more letters or digits.
The regular expression for an identifier is given as
letter (letter | digit)*

A number is defined as a digit followed by a sequence


of digits
digit(digit)*
REGULAR EXPRESSIONS
 Identifiers : letter(letter/digit)*
 Keywords : if, else, do, int, float,case
 Symbols & Operators : <=, >=, <>, +*,$, (, )., . . .

 Numbers/digits : digit(digit)* or digit+


REGULAR EXPRESSION TO DFA
TWO METHODS

 RE to NFA to DFA
(Modified subset construction method)
 RE to DFA (Direct method)
METHOD 1
RE to NFA to DFA
(Modified subset construction)
FINITE AUTOMATA
 Finite Automata are used as a model for:

 Software for designing digital circuits

 Lexical analyzer of a compiler

 Searching for keywords in a file or on the web.

 Software for verifying finite state systems, such as


communication protocols.
NONDETERMINISTIC FINITE AUTOMATA
 An NFA is a 5-tuple (S, , , s0, F) where

S is a finite set of states


 is a finite set of symbols, the alphabet
 is a mapping from S   to a set of states
s0  S is the start state
F  S is the set of accepting (or final) states

7
8
TRANSITION GRAPH
 An NFA can be diagrammatically represented by
a labeled directed graph called a transition graph

a S = {0,1,2,3}
 = {a,b}
start a b b s0 = 0
0 1 2 3
F = {3}
b
7
9
TRANSITION TABLE
 The mapping  of an NFA can be represented in a
transition table

Input Input
State
(0,a) = {0,1} a b
(0,b) = {0} 0 {0, 1} {0}
(1,b) = {2} 1 {2}
(2,b) = {3}
2 {3}
8
0
THE LANGUAGE DEFINED BY AN NFA
 An NFA accepts an input string x if and
only if there is some path with edges
labeled with symbols from x in sequence
from the start state to some accepting
state in the transition graph
 A state transition from one state to
another on the path is called a move
 The language defined by an NFA is the set
of input strings it accepts, such as
(ab)*abb for the example NFA
8
1
FROM REGULAR EXPRESSION TO -
NFA (THOMPSON’S CONSTRUCTION)


start
i  f

82
a start a
i f

start  N(r1) 
r 1 r 2 i f
 N(r2) 
r1r2 start
i N(r1) N(r2) f


r* start
i  N(r)  f


COMBINING THE NFAS OF A SET OF
REGULAR EXPRESSIONS

start a
1 2

83
a { action1 }
abb { action2 } start a b b
a*b+ { action3 } 3 4 5 6
a b

start
7 b 8
a
1 2

start
0  3
a
4
b
5
b
6
a b

7 b 8
DETERMINISTIC FINITE AUTOMATA
A deterministic finite automaton is a special
case of an NFA
 No state has an -transition
 For each state s and input symbol a there is
atmost one edge labeled a leaving s
 Each entry in the transition table is a single
state
 At most one path exists to accept a string
 Simulation algorithm is simple

8
4
EXAMPLE DFA

85
A DFA that accepts (ab)*abb

b
b
a
start a b b
0 1 2 3

a a
CONVERSION OF AN NFA INTO A DFA
 The subset construction algorithm
converts an NFA into a DFA using:
-closure(s) = {s}  {t  s  …  t}
-closure(T) = sT -closure(s)
move(T,a) = {t  s a t and s  T}
 The algorithm produces:
Dstates is the set of states of the new DFA
consisting of sets of states of the NFA
Dtran is the transition table of the new
DFA
8
6
-CLOSURE AND MOVE EXAMPLES
-closure({0}) = {0,1,3,7}
a move({0,1,3,7},a) = {2,4,7}
1 2 -closure({2,4,7}) = {2,4,7}

87
move({2,4,7},a) = {7}
-closure({7}) = {7}
start
0  3
a
4
b
5
b
6 move({7},b) = {8}
a b -closure({8}) = {8}
 move({8},a) = 
7 b 8

a a b a none
0 2 7 8
1 4
3 7
7 Also used to simulate NFAs
SUBSET CONSTRUCTION

a
2 3

start   

88
a b b
0 1 6 7 8 9 10

4
b
5


b
Dstates
C A = {0,1,2,4,7}
B = {1,2,3,4,6,7,8}
b a b
C = {1,2,4,5,6,7}
start a b b D = {1,2,4,5,6,7,9}
A B D E E = {1,2,4,5,6,7,10}
a
a
a
SUBSET CONSTRUCTION EXAMPLE 2
a a1
1 2

start
 a

89
b b a2
0 3 4 5 6
a b

7 b 8 a3
b
Dstates
a3
C A = {0,1,3,7}
b a B = {2,4,7}
b b C = {8}
start D = {7}
A D E = {5,8}
a F = {6,8}
a
b b
B E F
a1 a3 a2 a3
METHOD 2
RE to DFA (Direct method)
DIRECT METHOD

 STEPS
1. Syntax tree construction
2. Compute NULLABLE, FIRSTPOS,
LASTPOS & FOLLOWPOS
3. Construct DFA from FOLLOWPOS
ANNOTATING THE TREE
 nullable(n): the sub tree at node n generates
languages including the empty string

 firstpos(n): set of positions that can match


the first symbol of a string generated by the
sub tree at node n
 lastpos(n): the set of positions that can match
the last symbol of a string generated be the
sub tree at node n
 followpos(i): the set of positions that can
follow position i in the tree
Node n nullable(n) firstpos(n) lastpos(n)

Leaf  true  

Leaf i false {i} {i}

| nullable(c1) firstpos(c1) lastpos(c1)


/ \ or  
c1 c2 nullable(c2) firstpos(c2) lastpos(c2)

if nullable(c1) then if nullable(c2) then


• nullable(c1)
firstpos(c1)  lastpos(c1) 
/ \ and
firstpos(c2) lastpos(c2)
c1 c2 nullable(c2)
else firstpos(c1) else lastpos(c2)

*
| true firstpos(c1) lastpos(c1)
c1
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: SYNTAX TREE OF (A|B)*ABB#

{1, 2, 3} {6}

94
{1, 2, 3} {5} {6} # {6}
6
{1, 2, 3} {4} {5} b {5}
nullable 5
{1, 2, 3} {3} {4} b {4}
4
firstpos lastpos
{3} a {3}
{1, 2}
* {1, 2} 3

{1, 2} | {1, 2}
{1} a {1} {2} b {2}
1 2
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: FOLLOWPOS

95
for each node n in the tree do
if n is a cat-node with left child c1 and right child c2 then
for each i in lastpos(c1) do
followpos(i) := followpos(i)  firstpos(c2)
end do
else if n is a star-node
for each i in lastpos(n) do
followpos(i) := followpos(i)  firstpos(n)
end do
end if
end do
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: ALGORITHM
s0 := firstpos(root) where root is the root of the syntax tree
Dstates := {s0} and is unmarked
while there is an unmarked state T in Dstates do
mark T

96
for each input symbol a   do
let U be the set of positions that are in followpos(p)
for some position p in T,
such that the symbol at position p is a
if U is not empty and not in Dstates then
add U as an unmarked state to Dstates
end if
Dtran[T,a] := U
end do
end do
FROM REGULAR EXPRESSION TO DFA
DIRECTLY: EXAMPLE

Node followpos
1

97
1 {1, 2, 3}
2 {1, 2, 3} 3 4 5 6
3 {4}
2
4 {5}
5 {6}
6 -

b b
a
start a 1,2, b 1,2, b 1,2,
1,2,3
3,4 3,5 3,6
a
a
MINIMIZATION OF DFA

 To reduce the number of states of DFA


Without altering the language accepted by
the DFA

 METHOD : π - Partitioning Algorithm


π - Partitioning Algorithm
1.Construct initial partition π of the set of
states with 2 groups
 Accepting states
 Non Accepting states

2. Apply the procedure to π and construct a


new partition π-new
3. If π-new = π make π-final= π, continue
step4 else repeat step2 with π= π-new
4. Pick a representation for each group and it
will be the state in M99'
5. Start state of M' is as same as that of original DFA
M
THE DFA FOR (A|B) *ABB
APPLYING Π - PARTITIONING ALGORITHM
A LANGUAGE FOR SPECIFYING
LEXICAL ANALYZERS : LEX
COMPILATION SEQUENCE
WHAT IS LEX?
 The main job of a lexical analyzer (scanner) is to
break up an input stream into more usable
elements (tokens)
a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI

 Lex is an utility to help you rapidly generate your


scanners
LEX – LEXICAL ANALYZER
 Lexical analyzers tokenize input streams
 Tokens are the terminals of a language
 English
 words, punctuation marks, …
 Programming language
 Identifiers, operators, keywords, …
 Regular expressions define terminals/tokens
LEX SOURCE

 Lex source is separated into three sections by


%% delimiters
 The general format of Lex source is

{definitions}
%% (required)
{transition rules}
%%
(optional)
{user subroutines}
 The absolute minimum Lex program is thus
%%
LEX SOURCE PROGRAM

 Lex source is a table of


 regular expressions and
 corresponding program fragments

digit [0-9]
letter [a-zA-Z]
%%
{letter}({letter}|{digit})* printf(“id: %s\n”, yytext);
\n printf(“new line\n”);
%%
main()
{
yylex();
}
LEX SOURCE TO C PROGRAM
 The table is translated to a C program (lex.yy.c)
which
 reads an input stream
 partitioning the input into strings which match the
given expressions and
 copying it to an output stream if necessary
AN OVERVIEW OF LEX

Lex source lex.yy.c


program
Lex

lex.yy.c C compiler a.out

input a.out tokens


LEX V.S. YACC
 Lex
 Lex generates C code for a lexical analyzer, or
scanner
 Lex uses patterns that match strings in the
input and converts the strings to tokens

 Yacc
 Yacc generates C code for syntax analyzer, or
parser.
 Yacc uses grammar rules that allow it to
analyze tokens from Lex and create a syntax
tree.
LEX WITH YACC

Lex source Yacc source


(Lexical Rules) (Grammar Rules)

Lex Yacc

lex.yy.c y.tab.c
call
Parsed
Input yylex() yyparse()
return Input
token
LEX REGULAR EXPRESSIONS (EXTENDED
REGULAR EXPRESSIONS)
A regular expression matches a set of
strings
 Regular expression
 Operators
 Character classes
 Arbitrary character
 Optional expressions
 Alternation and grouping
 Context sensitivity
 Repetitions and definitions
OPERATORS
“ \ [ ] ^ - ? . * + | ( ) $ / { } % < >

 Ifthey are to be used as text characters, an


escape should be used
\$ = “$”
\\ = “\”
 Every character but blank, tab (\t), newline (\n)
and the list above is always a text character
CHARACTER CLASSES []
 [abc] matches a single character, which may
be a, b, or c
 Every operator meaning is ignored except \ -
and ^
 e.g.
[ab] => a or b
[a-z] => a or b or c or … or z
[-+0-9] => all the digits and the two signs
[^a-zA-Z] => any character which is not a
letter
ARBITRARY CHARACTER .
 To match almost character, the operator
character . is the class of all characters except
newline

 [\40-\176] matches all printable characters in


the ASCII character set, from octal 40 (blank) to
octal 176 (tilde~)
OPTIONAL & REPEATED
EXPRESSIONS
 a? => zero or one instance of a
 a* => zero or more instances of a
 a+ => one or more instances of a

 E.g.
ab?c => ac or abc
[a-z]+ => all strings of lower case
letters
[a-zA-Z][a-zA-Z0-9]* => all
alphanumeric strings with a leading
alphabetic character
PRECEDENCE OF OPERATORS
 Level of precedence
 Kleene closure (*), ?, +
 concatenation
 alternation (|)

 All operators are left associative.


 Ex: a*b|cd* = ((a*)b)|(c(d*))
PATTERN MATCHING PRIMITIVES
Metacharacter Matches
. any character except newline
\n newline
* zero or more copies of the preceding expression
+ one or more copies of the preceding expression
? zero or one copy of the preceding expression
^ beginning of line / complement
$ end of line
a|b a or b
(ab)+ one or more copies of ab (grouping)
[ab] a or b
a{3} 3 instances of a
“a+b” literal “a+b” (C escapes still work)
PLLab, NTHU,Cs2403 Programming Languages 119
RECALL: LEX SOURCE
 Lex source is a table of
 regular expressions and
 corresponding program fragments (actions)

a = b + c;

%%
<regexp> <action> a operator: ASSIGNMENT b + c;
<regexp> <action>

%%

%%
“=“ printf(“operator: ASSIGNMENT”);
TRANSITION RULES
 regexp <one or more blanks> action (C code);
 regexp <one or more blanks> { actions (C code) }

A null statement ; will ignore the input (no


actions)
[ \t\n] ;
 Causes the three spacing characters to be ignored
a = b + c;
d = b * c;

↓↓
a=b+c;d=b*c;
TRANSITION RULES (CONT’D)

 Four special options for actions:


|, ECHO;, BEGIN, and REJECT;
 | indicates that the action for this rule is
from the action for the next rule
 [ \t\n] ;
 “ “ |
“\t” |
“\n” ;
 The unmatched token is using a default
action that ECHO from the input to the
output
TRANSITION RULES (CONT’D)
 REJECT
 Go do the next alternative


%%
pink {npink++; REJECT;}
ink {nink++; REJECT;}
pin {npin++; REJECT;}
.|
\n ;
%%

LEX PREDEFINED VARIABLES
 yytext -- a string containing the lexeme
 yyleng -- the length of the lexeme
 yyin -- the input stream pointer
 the default input of default main() is stdin
 yyout -- the output stream pointer
 the default output of default main() is stdout.
 cs20: %./a.out < inputfile > outfile

 E.g.
[a-z]+ printf(“%s”, yytext);
[a-z]+ ECHO;
[a-zA-Z]+ {words++; chars += yyleng;}
LEX LIBRARY ROUTINES
 yylex()
 The default main() contains a call of yylex()
 yymore()
 return the next token
 yyless(n)
 retain the first n characters in yytext
 yywarp()
 is called whenever Lex reaches an end-of-file
 The default yywarp() always returns 1
REVIEW OF LEX PREDEFINED
VARIABLES
Name Function
char *yytext pointer to matched string
int yyleng length of matched string
FILE *yyin input stream pointer
FILE *yyout output stream pointer
int yylex(void) call to invoke lexer, returns token
char* yymore(void) return the next token
int yyless(int n) retain the first n characters in yytext
int yywrap(void) wrapup, return 1 if done, 0 if not done
ECHO write matched string
REJECT go to the next alternative rule
INITAL initial start condition
BEGIN condition switch start condition
PLLab, NTHU,Cs2403 Programming Languages 126
USER SUBROUTINES SECTION

 Youcan use your Lex routines in the same


ways you use routines in other
programming languages.
%{
void foo();
%}
letter [a-zA-Z]
%%
{letter}+ foo();
%%

void foo() {

}
USER SUBROUTINES SECTION
(CONT’D)
 The section where main() is placed

%{
int counter = 0;
%}
letter [a-zA-Z]

%%
{letter}+ {printf(“a word\n”); counter++;}

%%
main() {
yylex();
printf(“There are total %d words\n”, counter);
}
USAGE
 To run Lex on a source file, type
lex scanner.l
 It produces a file named lex.yy.c which is a C program
for the lexical analyzer.
 To compile lex.yy.c, type
cc lex.yy.c –ll
 To run the lexical analyzer program, type
./a.out < inputfile
VERSIONS OF LEX

 AT&T -- lex
http://www.combo.org/lex_yacc_page/lex.html
 GNU -- flex
http://www.gnu.org/manual/flex-2.5.4/flex.html
 a Win32 version of flex :
http://www.monmouth.com/~wstreett/lex-yacc/lex-yacc.html
or Cygwin :
http://sources.redhat.com/cygwin/

 Lex on different machines is not created equal.


DESIGN OF LEXICAL
ANALYZERS
TO IDENTIFY TOKENS IN PROGRAMMING
LANGUAGE
 The main job of a lexical analyzer (scanner) is
to break up an input stream into more usable
elements (tokens)
a = b + c * d;
ID ASSIGN ID PLUS ID MULT ID SEMI

 Lex is an utility to help you rapidly generate


your scanners
REGULAR DEFINITIONS
 delim [\t\n]
 ws {delim}+
 letter [A-Za-z]

 digit [0-9]

 id {letter}({letter}:{digit})*
 number {digit}+(\.{digit}+)?(E[+\-]?{digit}+)?
ACTION - EVENT
%%
{ws} {/*noaction */}
if {return(IF);}
then {return(THEN);}
else {return(ELSE);}
{id} {yylval=install_id(); return(ID);}
{number} {yylval=install_num(); return(NUMBER);}

"<" {yylval=LT; return(RELOP);}


"<=" {yylval=LE; return(RELOP);}
"=" {yylval=EQ; return(RELOP);}
"<>" {yylval=NE; return(RELOP);}
">" {yylval=GT; return(RELOP);}
">=" {yylval=GE; return(RELOP);}
";" {yylval=SEMI; return(SEMI);}
%%
OVERVIEW
 Need and Role of Lexical Analyzer
 Lexical Errors

 Expressing Tokens by Regular Expressions

 Converting Regular Expression to DFA

 Minimization of DFA

 Language for Specifying Lexical Analyzers

 LEX

 Design of Lexical Analyzer for a sample


Language
SQL QUERY
Lexical Analyzer
identifies the
keywords

SELECTE
Lexical Error

SELECT
•The Google Play store currently houses over one
millions apps

•To solve everyday issues and speed up how we do


things.
143
144

SYNTAX ANALYZER
 Syntax Analyzer creates the syntactic structure of the given
source program.
 This syntactic structure is mostly a parse tree.
 Syntax Analyzer is also known as parser.
 The syntax of a programming is described by a context-free
grammar (CFG). We will use BNF (Backus-Naur Form)
notation in the description of CFGs.
 The syntax analyzer (parser) checks whether a given source
program satisfies the rules implied by a context-free grammar
or not.
 If it satisfies, the parser creates the parse tree of that program.
 Otherwise the parser gives the error messages.
 A context-free grammar
 gives a precise syntactic specification of a programming language.
 the design of the grammar is an initial phase of the design of a compiler.
 a grammar can be directly converted into a parser by some tools.
PARSER
• Parser works on a stream of tokens.

• The smallest item is a token.

Lexical token
source parse tree
program Analyzer Parser
get next token

145
146

PARSERS (CONT.)
 We categorize the parsers into two groups:

1. Top-Down Parser
 the parse tree is created top to bottom, starting from the root.
2. Bottom-Up Parser
 the parse is created bottom to top; starting from the leaves

 Both top-down and bottom-up parsers scan the input


from left to right (one symbol at a time).
 Efficient top-down and bottom-up parsers can be
implemented only for sub-classes of context-free
grammars.
 LL for top-down parsing
 LR for bottom-up parsing
147

CONTEXT-FREE GRAMMARS
 Inherently recursive structures of a programming
language are defined by a context-free grammar.
 In a context-free grammar, we have:
 A finite set of terminals (in our case, this will be the set of
tokens)
 A finite set of non-terminals (syntactic-variables)
 A finite set of productions rules in the following form
 A   where A is a non-terminal and
 is a string of terminals and non-terminals
(including the empty string)
 A start symbol (one of the non-terminal symbol)

 Example:
E E+E | E–E | E*E | E/E | -E
E (E)
E  id
148

DERIVATIONS
E  E+E
 E+E derives from E
 we can replace E by E+E
 to able to do this, we have to have a production rule EE+E in our
grammar.

E  E+E  id+E  id+id


 A sequence of replacements of non-terminal symbols is called a
derivation of id+id from E.
 In general a derivation step is
A   if there is a production rule A in our grammar
where  and  are arbitrary strings of terminal and non-
terminal symbols
1  2  ...  n (n derives from 1 or 1 derives n )

* : derives in one step


+ : derives in zero or more steps
 : derives in one or more steps
149

CFG - TERMINOLOGY
 L(G) is the language of G (the language generated by G)
which is a set of sentences.
 A sentence of L(G) is a string of terminal symbols of G.
 If S is the start symbol of G then
 is a sentence of L(G) iff S +  where  is a string of terminals of G.

 If G is a context-free grammar, L(G) is a context-free


language.
 Two grammars are equivalent if they produce the same
language.

S
*
 - If  contains non-terminals, it is called as a sentential form
of G.
- If  does not contain non-terminals, it is called as a sentence
of G.
150

DERIVATION EXAMPLE
E  -E  -(E)  -(E+E)  -(id+E)  -(id+id)
OR
E  -E  -(E)  -(E+E)  -(E+id)  -(id+id)

 At each derivation step, we can choose any of the non-


terminal in the sentential form of G for the replacement.

 If we always choose the left-most non-terminal in each


derivation step, this derivation is called as left-most
derivation.

 If we always choose the right-most non-terminal in each


derivation step, this derivation is called as right-most
derivation.
151

LEFT-MOST AND RIGHT-MOST DERIVATIONS


Left-Most Derivation
E  -E  -(E) 
lm
-(E+E)
lm
 -(id+E)
lm
 -(id+id)lm
lm

Right-Most Derivation
E  -E rm-(E)  rm  -(E+id)
rm -(E+E) rm  -(id+id)
rm

 Top-down parsers try to find the left-most derivation


of the given source program.

 Bottom-up parsers try to find the right-most


derivation of the given source program in the reverse
order.
PARSE TREE
• Inner nodes of a parse tree are non-terminal symbols.
• The leaves of a parse tree are terminal symbols.

• A parse tree can be seen as a graphical representation of a derivation.

152
E  -E E
 -(E) E
 -(E+E) E
- E - E - E

( E ) ( E )

E E E + E
- E - E
 -(id+E)  -(id+id)
( E ) ( E )

E + E E + E

id id id
AMBIGUITY
• A grammar produces more than one parse tree for a sentence is
called as an ambiguous grammar.

E
E  E+E  id+E  id+E*E

153
E + E
 id+id*E  id+id*id
id E * E

id id

E  E*E  E+E*E  id+E*E E


 id+id*E  id+id*id *
E E

E + E id

id id
154

AMBIGUITY (CONT.)
 For the most parsers, the grammar must be
unambiguous.

 unambiguous grammar
 unique selection of the parse tree for a
sentence

 We should eliminate the ambiguity in the grammar


during the design phase of the compiler.
 An unambiguous grammar should be written to
eliminate the ambiguity.
 We have to prefer one of the parse trees of a sentence
(generated by an ambiguous grammar) to
disambiguate that grammar to restrict to this choice.
AMBIGUITY (CONT.)

stmt  if expr then stmt |


if expr then stmt else stmt | otherstmts

155
if E1 then if E2 then S1 else S2

stmt stmt

if expr then stmt else stmt if expr then stmt

E1 if expr then stmt S2 E1 if expr then stmt else stmt

E2 S1 E2 S1 S2
Tree 1 Tree 2
AMBIGUITY (CONT.)
• We prefer the second parse tree (else matches with closest if).
• So, we have to disambiguate our grammar to reflect this choice.

• The unambiguous grammar will be:

156
stmt  matchedstmt | unmatchedstmt

matchedstmt  if expr then matchedstmt else matchedstmt | otherstmts

unmatchedstmt  if expr then stmt |


if expr then matchedstmt else unmatchedstmt
157

AMBIGUITY – OPERATOR PRECEDENCE


 Ambiguous grammars (because of ambiguous
operators) can be disambiguated according to the
precedence and associativity rules.

E  E+E | E*E | E^E | id | (E)


disambiguate the grammar
 precedence: ^ (right to left)
* (left to right)
+ (left to right)
E  E+T | T
T  T*F | F
F  G^F | G
G  id | (E)
158

LEFT RECURSION
 A grammar is left recursive if it has a non-terminal
A such that there is a derivation.
+
A  A for some string 

 Top-down parsing techniques cannot handle left-


recursive grammars.

 So, we have to convert our left-recursive grammar


into an equivalent grammar which is not left-
recursive.

 The left-recursion may appear in a single step of the


derivation (immediate left-recursion), or may appear
in more than one step of the derivation.
IMMEDIATE LEFT-RECURSION
AA|  where  does not start with A

 eliminate immediate left recursion


A   A’

159
A’   A’ |  an equivalent grammar

In general,

A  A 1 | ... | A m | 1 | ... | n where 1 ... n do not start with A

 eliminate immediate left recursion


A  1 A’ | ... | n A’
A’  1 A’ | ... | m A’ |  an equivalent grammar
IMMEDIATE LEFT-RECURSION -- EXAMPLE
E  E+T | T
T  T*F | F
F  id | (E)

160
 eliminate immediate left recursion

E  T E’
E’  +T E’ | 
T  F T’
T’  *F T’ | 
F  id | (E)
LEFT-RECURSION -- PROBLEM

• A grammar cannot be immediately left-recursive, but it still can be


left-recursive.
• By just eliminating the immediate left-recursion, we may not get

161
a grammar which is not left-recursive.

S  Aa | b
A  Sc | d This grammar is not immediately left-recursive,
but it is still left-recursive.

S  Aa  Sca or
A  Sc  Aac causes to a left-recursion

• So, we have to eliminate all left-recursions from our grammar


162

ELIMINATE LEFT-RECURSION --
ALGORITHM
- Arrange non-terminals in some order: A1 ... An
- for i from 1 to n do {
- for j from 1 to i-1 do {
replace each production
Ai  Aj 
by
Ai  1  | ... | k 
where Aj  1 | ... | k
}
- eliminate immediate left-recursions among Ai
productions
}
163

ELIMINATE LEFT-RECURSION -- EXAMPLE


S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: S, A
for S:
- we do not enter the inner loop.
- there is no immediate left recursion in S.
for A:
- Replace A  Sd with A  Aad | bd
So, we will have A  Ac | Aad | bd | f
- Eliminate the immediate left-recursion in A
A  bdA’ | fA’
A’  cA’ | adA’ | 

So, the resulting equivalent grammar which is not left-recursive is:


S  Aa | b
A  bdA’ | fA’
A’  cA’ | adA’ | 
164

ELIMINATE LEFT-RECURSION – EXAMPLE2


S  Aa | b
A  Ac | Sd | f
- Order of non-terminals: A, S
for A:
- we do not enter the inner loop.
- Eliminate the immediate left-recursion in A
A  SdA’ | fA’
A’  cA’ | 
for S:
- Replace S  Aa with S  SdA’a | fA’a
So, we will have S  SdA’a | fA’a | b
- Eliminate the immediate left-recursion in S
S  fA’aS’ | bS’
S’  dA’aS’ | 
So, the resulting equivalent grammar which is not left-recursive is
S  fA’aS’ | bS’
S’  dA’aS’ | 
A  SdA’ | fA’
A’  cA’ | 
165

LEFT-FACTORING
 A predictive parser (a top-down parser without
backtracking) insists that the grammar must be left-
factored.

grammar  a new equivalent grammar suitable for


predictive parsing

stmt  if expr then stmt else stmt |


if expr then stmt

 when we see if, we cannot now which production


rule to choose to re-write stmt in the derivation.
166

LEFT-FACTORING (CONT.)
 In general,
A  1 | 2 where  is non-empty and
the first symbols
of 1 and 2 (if they have
one)are different.
 when processing  we cannot know whether expand
A to 1 or
A to 2

 But, if we re-write the grammar as follows


A  A’
A’  1 | 2 so, we can immediately
expand A to A’
167

LEFT-FACTORING -- ALGORITHM
 For each non-terminal A with two or more
alternatives (production rules) with a common
non-empty prefix, let say

A  1 | ... | n | 1 | ... | m

convert it into

A  A’ | 1 | ... | m
A’  1 | ... | n
168

LEFT-FACTORING – EXAMPLE1
A  abB | aB | cdg | cdeB | cdfB

A  aA’ | cdg | cdeB | cdfB
A’  bB | B

A  aA’ | cdA’’
A’  bB | B
A’’  g | eB | fB
169

LEFT-FACTORING – EXAMPLE2
A  ad | a | ab | abc | b

A  aA’ | b
A’  d |  | b | bc

A  aA’ | b
A’  d |  | bA’’
A’’   | c
170

NON-CONTEXT FREE LANGUAGE


CONSTRUCTS
 There are some language constructions in the
programming languages which are not context-free. This
means that, we cannot write a context-free grammar for
these constructions.

 L1 = { c |  is in (a|b)*} is not context-free


 declaring an identifier and checking whether it is
declared or not later. We cannot do this with a
context-free language. We need semantic analyzer
(which is not context-free).

 L2 = {anbmcndm | n1 and m1 } is not context-free


 declaring two functions (one with n parameters, the
other one with
m parameters), and then calling them with actual
parameters.
171

TOP-DOWN PARSING
 The parse tree is created top to bottom.
 Top-down parser
 Recursive-Descent Parsing
 Backtracking is needed (If a choice of a production rule does
not work, we backtrack to try other alternatives.)
 It is a general parsing technique, but not widely used.
 Not efficient
 Predictive Parsing
 no backtracking
 efficient
 needs a special form of grammars (LL(1) grammars).
 Recursive Predictive Parsing is a special form of Recursive
Descent parsing without backtracking.
 Non-Recursive (Table Driven) Predictive Parser is also
known as LL(1) parser.
172

RECURSIVE-DESCENT PARSING (USES


BACKTRACKING)
 Backtracking is needed.
 It tries to find the left-most derivation.

S  aBc
B  bc | b
S S
input: abc
a B c
a B c
fails, backtrack
b

b c
173

PREDICTIVE PARSER

 When re-writing a non-terminal in a derivation


step, a predictive parser can uniquely choose a
production rule by just looking the current
symbol in the input string.

A  1 | ... | n input: ... a .......

current token
174

PREDICTIVE PARSER (EXAMPLE)


stmt  if ...... |
while ...... |
begin ...... |
for .....

 When we are trying to write the non-terminal stmt, if the


current token is if we have to choose first production rule.

 When we are trying to write the non-terminal stmt, we can


uniquely choose the production rule by just looking the
current token.

 We eliminate the left recursion in the grammar, and left


factor it. But it may not be suitable for predictive parsing
(not LL(1) grammar).
175

RECURSIVE PREDICTIVE PARSING


 Each non-terminal corresponds to a procedure.

Ex: A  aBb (This is only the production rule


for A)

proc A {
- match the current token with a, and move to
the next token;
- call ‘B’;
- match the current token with b, and move to
the next token;
}
176

RECURSIVE PREDICTIVE PARSING (CONT.)


A  aBb | bAB

proc A {
case of the current token {
‘a’: - match the current token with a, and move
to the next token;
- call ‘B’;
- match the current token with b, and move
to the next token;
‘b’: - match the current token with b, and move
to the next token;
- call ‘A’;
- call ‘B’;
}
}
177

RECURSIVE PREDICTIVE PARSING (CONT.)


 When to apply -productions.

A  aA | bB | 

 If all other productions fail, we should apply an


-production. For example, if the current token is
not a or b, we may apply the -
production.
 Most correct choice: We should apply an -
production for a non-terminal A when the
current token is in the follow set of A (which
terminals can follow A in the sentential forms).
178

RECURSIVE PREDICTIVE PARSING


(EXAMPLE)
A  aBe | cBd | C
B  bB | 
Cf
proc C { match the current token with
f,
proc A { and move to the next token; }
case of the current token {
a: - match the current token with a,
and move to the next token; proc B {
- call B; case of the current token {
- match the current token with e, b: - match the current token
with b,
and move to the next token; and move to the next token;
c: - match the current token with c, - call B
and move to the next token; e,d: do nothing
- call B; }
- match the current token with d, }
and move to the next token;
f: - call C
}
}
follow set of B

first set of C
179

NON-RECURSIVE PREDICTIVE PARSING --


LL(1) PARSER
 Non-Recursive predictive parsing is a table-driven
parser.
 It is a top-down parser.
 It is also known as LL(1) Parser.

input buffer

stack
Non-recursive
output
Predictive Parser

Parsing Table
180

LL(1) PARSER
input buffer
◦ our string to be parsed. We will assume that its end is marked with a
special symbol $.

output
◦ a production rule representing a step of the derivation sequence (left-most
derivation) of the string in the input buffer.

stack
◦ contains the grammar symbols
◦ at the bottom of the stack, there is a special end marker symbol $.
◦ initially the stack contains only the symbol $ and the starting symbol S.
$S  initial stack
◦ when the stack is emptied (ie. only $ left in the stack), the parsing is
completed.

parsing table
◦ a two-dimensional array M[A,a]
◦ each row is a non-terminal symbol
◦ each column is a terminal symbol or the special symbol $
◦ each entry holds a production rule.
181

LL(1) PARSER – PARSER ACTIONS


 The symbol at the top of the stack (say X) and the current symbol in
the input string (say a) determine the parser action.
 There are four possible parser actions.

1. If X and a are $  parser halts (successful completion)

2. If X and a are the same terminal symbol (different from $)


 parser pops X from the stack, and moves the next symbol in the
input buffer.

3. If X is a non-terminal
 parser looks at the parsing table entry M[X,a]. If M[X,a] holds a
production rule XY1Y2...Yk, it pops X from the stack and pushes
Yk,Yk-1,...,Y1 into the stack. The parser also outputs the production
rule XY1Y2...Yk to represent a step of the derivation.

4. none of the above  error


◦ all empty entries in the parsing table are errors.
◦ If X is a terminal symbol different from a, this is also an error case.
182

LL(1) PARSER – EXAMPLE1

S  aBa LL(1) Parsing Table


a b $
B  bB |  S S  aBa
B B B  bB
stack input output
$S abba$ S  aBa
$aBa abba$
$aB bba$ B  bB
$aBb bba$
$aB ba$ B  bB
$aBb ba$
$aB a$ B
$a a$
$ $ accept, successful completion
LL(1) PARSER – EXAMPLE1 (CONT.)
Outputs: S  aBa B  bB B  bB B

Derivation(left-most): SaBaabBaabbBaabba

183
S
parse tree

a B a

b B

b B


LL(1) PARSER – EXAMPLE2
E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 

184
F  (E) | id

id + * ( ) $
E E E  TE’
TE’
E’ E’  +TE’ E’   E’  
T T T  FT’
FT’
T’ T’   T’  *FT’ T’   T’  
F F  id F  (E)
185

LL(1) PARSER – EXAMPLE2


stack input output
$E id+id$ E  TE’
$E’T id+id$ T  FT’
$E’ T’F id+id$ F  id
$ E’ T’id id+id$
$ E’ T’ +id$ T’  
$ E’ +id$ E’  +TE’
$ E’ T+ +id$
$ E’ T id$ T  FT’
$ E’ T’ F id$ F  id
$ E’ T’id id$
$ E’ T’ $ T’  
$ E’ $ E’  
$ $ accept
186

CONSTRUCTING LL(1) PARSING TABLES


 Two functions are used in the construction of LL(1)
parsing tables:
 FIRST FOLLOW

 FIRST() is a set of the terminal symbols which occur


as first symbols in strings derived from  where  is any
string of grammar symbols.
 if  derives to , then  is also in FIRST() .

 FOLLOW(A) is the set of the terminals which occur


immediately after (follow) the non-terminal A in the
*
strings derived from the starting symbol.
 a terminal a is in FOLLOW(A) *if S  Aa
 $ is in FOLLOW(A) if S  A
187

COMPUTE FIRST FOR ANY STRING X


 If X is a terminal symbol  FIRST(X)={X}

 If X is a non-terminal symbol and X   is a production rule


  is in FIRST(X).
 If X is a non-terminal symbol and X  Y1Y2..Yn is a
production rule  if a terminal a in FIRST(Yi) and
 is in all FIRST(Yj) for j=1,...,i-1 then a is in FIRST(X).
 if  is in all FIRST(Yj) for j=1,...,n
then  is in FIRST(X).
 If X is  FIRST(X)={}
 If X is Y1Y2..Yn  if a terminal a in FIRST(Yi) and  is in all
FIRST(Yj) for j=1,...,i-1 then a is in FIRST(X).
 if  is in all FIRST(Yj) for j=1,...,n
then  is in FIRST(X).
188

FIRST EXAMPLE
E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id
FIRST(F) = {(,id} FIRST(TE’) =
{(,id}
FIRST(T’) = {*, } FIRST(+TE’ ) = {+}
FIRST(T) = {(,id} FIRST() = {}
FIRST(E’) = {+, } FIRST(FT’) =
{(,id}
FIRST(E) = {(,id} FIRST(*FT’) = {*}
FIRST() = {}
FIRST((E)) = {(}
FIRST(id) = {id}
189

COMPUTE FOLLOW (FOR NON-


TERMINALS)

 If S is the start symbol  $ is in FOLLOW(S)

 if A  B is a production rule


 everything in FIRST() is FOLLOW(B) except 

 If ( A  B is a production rule ) or
( A  B is a production rule and  is in FIRST() ) 
everything in FOLLOW(A) is in FOLLOW(B).

We apply these rules until nothing more can be added to any


follow set.
190

FOLLOW EXAMPLE
E  TE’
E’  +TE’ | 
T  FT’
T’  *FT’ | 
F  (E) | id

FOLLOW(E) = { $, ) }
FOLLOW(E’) = { $, ) }
FOLLOW(T) = { +, ), $ }
FOLLOW(T’) = { +, ), $ }
FOLLOW(F) = {+, *, ), $ }
191

CONSTRUCTING LL(1) PARSING TABLE --


ALGORITHM
 for each production rule A   of a grammar G
 for each terminal a in FIRST()
 add A   to M[A,a]
 If  in FIRST()
 for each terminal a in FOLLOW(A) add A   to
M[A,a]
 If  in FIRST() and $ in FOLLOW(A)
 add A   to M[A,$]

 All other undefined entries of the parsing table


are error entries.
192

CONSTRUCTING LL(1) PARSING TABLE --


EXAMPLE
E  TE’ FIRST(TE’)={(,id}  E  TE’ into
M[E,(] and M[E,id]
E’  +TE’ FIRST(+TE’ )={+}  E’  +TE’ into M[E’,+]
E’   FIRST()={}  none
but since  in FIRST()
and FOLLOW(E’)={$,)}  E’   into M[E’,$] and M[E’,)]

T  FT’ FIRST(FT’)={(,id}  T  FT’ into M[T,(] and


M[T,id]
T’  *FT’ FIRST(*FT’ )={*}  T’  *FT’ into M[T’,*]
T’   FIRST()={}  none
but since  in FIRST()
and FOLLOW(T’)={$,),+} T’   into M[T’,$], M[T’,)] and
M[T’,+]
F  (E) FIRST((E) )={(}  F  (E) into M[F,(]
F  id FIRST(id)={id}  F  id into M[F,id]
193

LL(1) GRAMMARS
 A grammar whose parsing table has no multiply-
defined entries is said to be LL(1) grammar.

one input symbol used as a look-head symbol do


determine parser action
LL(1) left most derivation
input scanned from left to right

 The parsing table of a grammar may contain


more than one production rule. In this case, we
say that it is not a LL(1) grammar.
194

A GRAMMAR WHICH IS NOT LL(1)


SiCtSE | a FOLLOW(S) = { $,e }
EeS |  FOLLOW(E) = { $,e }
Cb FOLLOW(C) = { t }

FIRST(iCtSE) = {i}
FIRST(a) = {a} a b e i t $
FIRST(eS) = {e}
S S S
FIRST() = {} a iCtSE
FIRST(b) = {b} EeS E
E
E 

two production rules for M[E,e]


C C
Problem  ambiguity b
195


A GRAMMAR WHICH IS NOT LL(1) (CONT.)
What do we have to do it if the resulting parsing table contains multiply defined
entries?
◦ If we didn’t eliminate left recursion, eliminate the left recursion in the
grammar.
◦ If the grammar is not left factored, we have to left factor the grammar.
◦ If its (new grammar’s) parsing table still contains multiply defined entries,
that grammar is ambiguous or it is inherently not a LL(1) grammar.
 A left recursive grammar cannot be a LL(1) grammar.
◦ A  A | 
 any terminal that appears in FIRST() also appears FIRST(A) because
A  .
 If  is , any terminal that appears in FIRST() also appears in FIRST(A)
and FOLLOW(A).
 A grammar is not left factored, it cannot be a LL(1) grammar
• A  1 | 2
 any terminal that appears in FIRST(1) also appears in FIRST(2).

 An ambiguous grammar cannot be a LL(1) grammar.


196

PROPERTIES OF LL(1) GRAMMARS


 A grammar G is LL(1) if and only if the
following conditions hold for two distinctive
production rules A   and A  

1. Both  and  cannot derive strings starting with


same terminals.

2. At most one of  and  can derive to .

3. If  can derive to , then  cannot derive to any


string starting with a terminal in FOLLOW(A).
197

ERROR RECOVERY IN PREDICTIVE


PARSING
 An error may occur in the predictive parsing (LL(1)
parsing)
 if the terminal symbol on the top of stack does not match with
the current input symbol.
 if the top of stack is a non-terminal A, the current input
symbol is a, and the parsing table entry M[A,a] is empty.

 What should the parser do in an error case?


 The parser should be able to give an error message (as much
as possible meaningful error message).
 It should be recover from that error case, and it should be able
to continue the parsing with the rest of the input.
198

ERROR RECOVERY TECHNIQUES


 Panic-Mode Error Recovery
◦ Skipping the input symbols until a synchronizing token is found.
 Phrase-Level Error Recovery
◦ Each empty entry in the parsing table is filled with a pointer to a
specific error routine to take care that error case.
 Error-Productions
◦ If we have a good idea of the common errors that might be
encountered, we can augment the grammar with productions that
generate erroneous constructs.
◦ When an error production is used by the parser, we can generate
appropriate error diagnostics.
◦ Since it is almost impossible to know all the errors that can be made
by the programmers, this method is not practical.
 Global-Correction
◦ Ideally, we we would like a compiler to make as few change as
possible in processing incorrect inputs.
◦ We have to globally analyze the input to find the error.
◦ This is an expensive method, and it is not in practice.
BOTTOM UP PARSING
PARSING TECHNIQUES
Top-down parsers (LL(1), recursive descent)
 Start at the root of the parse tree from the start symbol and grow toward leaves
(similar to a derivation)
 Pick a production and try to match the input
 Bad “pick”  may need to backtrack
 Some grammars are backtrack-free (predictive parsing)

Bottom-up parsers (LR(1), operator precedence)


 Start at the leaves and grow toward root
 We can think of the process as reducing the input string to the start symbol
 At each reduction step a particular substring matching the right-side of a
production is replaced by the symbol on the left-side of the production
 Bottom-up parsers handle a large class of grammars
BOTTOM-UP PARSING
A general style of bottom-up syntax analysis, known as shift-
reduce parsing.

Bottom-up parsing is also known as shift-reduce parsing because


its two main actions are shift and reduce.

• At each shift action, the current symbol in the input string is


pushed to a stack.
• At each reduction step, the symbols at the top of the stack
(this symbol sequence is the right side of a production) will
replaced by the non-terminal at the left side of that
production.

• There are also two more actions: accept and error.


SHIFT-REDUCE PARSING
 A shift-reduce parser tries to reduce the given input string into the starting
symbol.

a string  the starting symbol


reduced to
 At each reduction step, a substring of the input matching to the right side of
a production rule is replaced by the non-terminal at the left side of that
production rule.

 If the substring is chosen correctly, the right most derivation of that string is
created in the reverse order.

Rightmost Derivation: S

Shift-Reduce Parser finds:   ...  S


BOTTOM UP PARSING
 “Shift-Reduce” Parsing
 Reduce a string to the start symbol of the grammar.
 At every step a particular sub-string is matched (in left-to-
right fashion) to the right side of some production and then
it is substituted by the non-terminal in the left hand side of
the production.
abbcde Reverse
Consider:
aAbcde order
S  aABe
aAde
A  Abc | b
aABe
Bd S

Rightmost Derivation:
S  aABe  aAde  aAbcde  abbcde
HANDLES
 Handle of a string: Substring that matches the RHS of some
production AND whose reduction to the non-terminal on the LHS is
a step along the reverse of some rightmost derivation.

 A handle of a right sentential form  ( ) is a production rule


A   and a position of 
where the string  may be found and replaced by A to produce
the previous right-sentential form in a rightmost derivation of .
S  A  
i.e. A   is a handle of  at the location immediately after the end
of ,
 If the grammar is unambiguous, then every right-sentential form of
the grammar has exactly one handle.
  is a string of terminals
EXAMPLE
Consider:
S  aABe
A  Abc | b
Bd

S  aABe  aAde  aAbcde  abbcde

It follows that:
S  aABe is a handle of aABe in location 1.
B  d is a handle of aAde in location 3.
A  Abc is a handle of aAbcde in location 2.
A  b is a handle of abbcde in location 2.
HANDLE PRUNING
 A rightmost derivation in reverse can be obtained
by “handle-pruning.”
 Apply this to the previous example.

S  aABe
A  Abc | b
Bd

abbcde Find the handle = b at loc. 2


aAbcde b at loc. 3 is not a handle:
aAAcde ... blocked.
HANDLE-PRUNING, BOTTOM-UP PARSERS

 The process of discovering a handle & reducing it to the


appropriate left-hand side is called handle pruning.
Handle pruning forms the basis for a bottom-up parsing method.

 To construct a rightmost derivation


S=0  1  2  ...  n-1  n= 
input string
Apply the following simple algorithm
 Start from n, find a handle Ann in n,
and replace n by An to get n-1.
 Then find a handle An-1n-1 in n-1,
and replace n-1 by An-1 to get n-2.
 Repeat this, until we reach S.
A SHIFT-REDUCE PARSER
E  E+T | T Right-Most Derivation of id+id*id
T  T*F | F E  E+T  E+T*F  E+T*id  E+F*id
F  (E) | id  E+id*id  T+id*id  F+id*id  id+id*id

Right-Most Sentential Form Reducing Production


id+id*id F  id
F+id*id TF
T+id*id ET
E+id*id F  id
E+F*id TF
E+T*id F  id
E+T*F T  T*F
E+T E  E+T
E
Handles are red and underlined in the right-sentential forms
A STACK IMPLEMENTATION OF A SHIFT-
REDUCE PARSER
 There are four possible actions of a shift-parser
action:
1. Shift : The next input symbol is shifted onto the top
of the stack.
2. Reduce: Replace the handle on the top of the stack
by the non-terminal.
3. Accept: Successful completion of parsing.
4. Error: Parser discovers a syntax error, and calls an
error recovery routine.
 Initial stack just contains only the end-marker $.
 The end of the input string is marked by the
end-marker $.
SHIFT REDUCE PARSING WITH A STACK
 Two problems:
◦ locate a handle and
◦ decide which production to use (if there are more than
two candidate productions).
 General Construction: using a stack:
◦ “shift” input symbols into the stack until a handle is
found on top of it.
◦ “reduce” the handle to the corresponding non-terminal.
◦ other operations:
 “accept” when the input is consumed and only the start symbol is on the
stack, also: “error”
A STACK IMPLEMENTATION OF A
SHIFT-REDUCE PARSER
Stack Input Action
$ id+id*id$ shift
$id +id*id$ reduce by F  id
$F +id*id$ reduce by T  F
$T +id*id$ reduce by E  T

$E +id*id$ shift
$E+ id*id$ shift
$E+id *id$ reduce by F  id
$E+F *id$ reduce by T  F
$E+T *id$ shift
$E+T* id$ shift
$E+T*id $ reduce by F  id
$E+T*F $ reduce by T  T*F
$E+T $ reduce by E  E+T
$E $ accept
CONFLICTS DURING SHIFT-REDUCE
PARSING
 There are context-free grammars for which shift-
reduce parsers cannot be used.
 Stack contents and the next input symbol may not
decide action:
◦ shift/reduce conflict: Whether make a shift operation or a
reduction.
◦ reduce/reduce conflict: The parser cannot decide which
of several reductions to make.
 If a shift-reduce parser cannot be used for a
grammar, that grammar is called as non-LR(k)
grammar.
Left to right right-most k lookhead
scanning derivation

 An ambiguous grammar can never be a LR


grammar.
SHIFT-REDUCE PARSERS
 There are two main categories of shift-reduce parsers

1. Operator-Precedence Parser
◦ simple, but only a small class of grammars.

LR-Parsers
◦ covers wide range of grammars.
 SLR – simple LR parser
 Canonical LR – most general LR parser
 LALR – intermediate LR parser (lookhead LR parser)
◦ SLR, Canonical LR and LALR work same, only their
parsing tables are different.
214

LR PARSERS
 The most powerful shift-reduce parsing (yet efficient) is:
LR(k) parsing.

left to right right-most k lookhead


scanning derivation (k is omitted  it is 1)
 LR parsing is attractive because:
◦ LR parsing is most general non-backtracking shift-reduce parsing,
yet it is still efficient.
◦ The class of grammars that can be parsed using LR methods is a
proper superset of the class of grammars that can be parsed with
predictive parsers.
LL(1)-Grammars  LR(1)-Grammars
◦ An LR-parser can detect a syntactic error as soon as it is possible
to do so a left-to-right scan of the input.
215

LR PARSERS
 LR-Parsers
 covers wide range of grammars.
 SLR – simple LR parser
 LR – most general LR parser
 LALR – intermediate LR parser (look-head LR
parser)
 SLR, LR and LALR work same (they used the same
algorithm), only their parsing tables are different.
LR PARSING ALGORITHM
input a1 ... ai ... an $
stack

Sm

216
Xm output
LR Parsing Algorithm
Sm-1
Xm-1
.
.
Action Table Goto Table
S1 terminals and $ non-terminal
X1 s s
t four different t each item is
S0 a actions a a state number
t t
e e
s s
217

A CONFIGURATION OF LR PARSING
ALGORITHM
 A configuration of a LR parsing is:

( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )

Stack Rest of Input

 Sm and ai decides the parser action by consulting the


parsing action table. (Initial Stack contains just So )

 A configuration of a LR parsing represents the right


sentential form:

X1 ... Xm ai ai+1 ... an $


218

ACTIONS OF A LR-PARSER
1. shift s -- shifts the next input symbol and the state s
onto the stack
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )  ( So X1 S1 ... Xm Sm ai s, ai+1 ... an
$)

2. reduce A (or rn where n is a production number)


◦ pop 2|| (=r) items from the stack;
◦ then push A and s where s=goto[sm-r,A]
( So X1 S1 ... Xm Sm, ai ai+1 ... an $ )  ( So X1 S1 ... Xm-r Sm-r A s, ai ...
an $ )
◦ Output is the reducing production reduce A

3. Accept – Parsing successfully completed

4. Error -- Parser detected an error (an empty entry in the


action table)
219

REDUCE ACTION
 pop 2|| (=r) items from the stack; let us
assume that  = Y1Y2...Yr
 then push A and s where s=goto[sm-r,A]

( So X1 S1 ... Xm-r Sm-r Y1 Sm-r ...Yr Sm, ai ai+1 ... an $


)
 ( So X1 S1 ... Xm-r Sm-r A s, ai ... an $ )

 In fact, Y1Y2...Yr is a handle.

X1 ... Xm-r A ai ... an $  X1 ... Xm Y1...Yr ai ai+1 ...


an $
(SLR) PARSING TABLES FOR EXPRESSION
GRAMMAR Action Table Goto Table

1) E  E+T state id + * ( ) $ E T F
2) ET 0 s5 s4 1 2 3
3) T  T*F 1 s6 acc

220
4) TF 2 r2 s7 r2 r2
5) F  (E)
3 r4 r4 r4 r4
6) F  id
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
221

ACTIONS OF A (S)LR-PARSER -- EXAMPLE


stack input action output
0 id*id+id$ shift 5
0id5 *id+id$ reduce by Fid Fid

0F3 *id+id$ reduce by TF TF

0T2 *id+id$ shift 7


0T2*7 id+id$ shift 5
0T2*7id5 +id$ reduce by Fid Fid
0T2*7F10 +id$ reduce by TT*F TT*F
0T2 +id$ reduce by ET ET
0E1 +id$ shift 6
0E1+6 id$ shift 5
0E1+6id5 $ reduce by Fid
Fid
0E1+6F3 $ reduce by TF TF
0E1+6T9 $ reduce by EE+T EE+T
0E1 $ accept
222

CONSTRUCTING SLR PARSING TABLES


– LR(0) ITEM
 An LR(0) item of a grammar G is a production of G
a dot at the some position of the right side.

.
Ex: A  aBb
aBb
Possible LR(0) Items: A

.
A  a Bb
.
(four different possibility)

A  aB b
.
A  aBb
 Sets of LR(0) items will be the states of action and
goto table of the SLR parser.
 A collection of sets of LR(0) items (the canonical
LR(0) collection) is the basis for constructing SLR
parsers.
 Augmented Grammar:
G’ is G with a new production rule S’S where S’ is
the new starting symbol.
223

THE CLOSURE OPERATION


 If I is a set of LR(0) items for a grammar G,
then closure(I) is the set of LR(0) items
constructed from I by the two rules:

1.
2. .
Initially, every LR(0) item in I is added to closure(I).

.
If A   B is in closure(I) and B is a
production rule of G; then B  will be in the
closure(I). We will
apply this rule until no more new LR(0) items can
be added to closure(I).
224

THE CLOSURE OPERATION -- EXAMPLE


E’  E .
closure({E’  E}) =
E  E+T .
{ E’  E
kernel items
ET .
E  E+T
T  T*F E T.
T F .
T  T*F
F  (E) T F.
F  id .
F  (E)
.
F  id }
225

GOTO OPERATION
 If I is a set of LR(0) items and X is a grammar symbol
(terminal or non-terminal), then goto(I,X) is defined as
follows:
.
.
 If A   X in I
then every item in closure({A  X }) will be in
goto(I,X).

Example:
I ={ E’  .. .. .
E, E  E+T, E  T,
T
F . . ..
T*F, T 
(E), F 
F,
id }

.. .
goto(I,E) = { E’  E , E  E +T }
goto(I,T) = { E  T , T  T *F }
goto(I,F) = {T  F
.. .. .
}
goto(I,() = { F  ( E), E  E+T, E  T, T  .
T*F, T  . F,
F
.
goto(I,id) = { F  id }
(E), F  id }
226

CONSTRUCTION OF THE CANONICAL LR(0)


COLLECTION
 To create the SLR parsing tables for a grammar G,
we will create the canonical LR(0) collection of the
grammar G’.

 Algorithm:
.
C is { closure({S’ S}) }
repeat the followings until no more set of LR(0) items can be
added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C

 goto function is a DFA on the sets in C.


227

THE CANONICAL LR(0) COLLECTION --


EXAMPLE
I0: E’  .E I1: E’  E. I6: E  E+.T I9 : E
 E+T.
E  .E+T E  E.+T T  .T*F T  T.*F
E  .T T  .F
T  .T*F I2: E  T. F  .(E) I10: T 
T*F.
T  .F T  T.*F F  .id
F  .(E)
F  .id I3: T  F. I7: T  T*.F I11: F  (E).
F  .(E)
I4: F  (.E) F  .id
E  .E+T
E  .T I8: F  (E.)
T  .T*F E  E.+T
T  .F
F  .(E)
F  .id

I5: F  id.
TRANSITION DIAGRAM (DFA) OF
GOTO FUNCTION
I0 E I1 + I6 T I9 * to I7
F to I3
( to I4
T
id to I5
I2 I7
* I10
F
I3 F to I4
to I5
(
I4 I8
to I2 id I
( 11
I5 to I3 to I6
E
to I4 )
id id T
F +
(

228
229

CONSTRUCTING SLR PARSING TABLE


(OF AN AUGUMENTED GRAMMAR G’)

1. Construct the canonical collection of sets of


LR(0) items for G’. C{I0,...,In}
2. Create the parsing action table as follows
• If a is a terminal, A.a in Ii and goto(Ii,a)=Ij then action[i,a] is
shift j.
• If A. is in Ii , then action[i,a] is reduce A for all a in
FOLLOW(A) where AS’.
• If S’S. is in Ii , then action[i,$] is accept.
• If any conflicting actions generated by these rules, the grammar is
not SLR(1).

3. Create the parsing goto table


• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j

4. All entries not defined by (2) and (3) are errors.


5. Initial state of the parser contains S’.S
PARSING TABLES OF EXPRESSION
GRAMMAR Action Table Goto Table

state id + * ( ) $ E T F
0 s5 s4 1 2 3
1 s6 acc

230
2 r2 s7 r2 r2
3 r4 r4 r4 r4
4 s5 s4 8 2 3
5 r6 r6 r6 r6
6 s5 s4 9 3
7 s5 s4 10
8 s6 s11
9 r1 s7 r1 r1
10 r3 r3 r3 r3
11 r5 r5 r5 r5
231

SLR(1) GRAMMAR
 An LR parser using SLR(1) parsing tables for a
grammar G is called as the SLR(1) parser for G.
 If a grammar G has an SLR(1) parsing table, it is
called SLR(1) grammar (or SLR grammar in
short).
 Every SLR grammar is unambiguous, but every
unambiguous grammar is not a SLR grammar.
232

SHIFT/REDUCE AND REDUCE/REDUCE


CONFLICTS

 If a state does not know whether it will make a


shift operation or reduction for a terminal, we
say that there is a shift/reduce conflict.

 If a state does not know whether it will make a


reduction operation using the production rule i
or j for a terminal, we say that there is a
reduce/reduce conflict.

 If the SLR parsing table of a grammar G has a


conflict, we say that that grammar is not SLR
grammar.
233

SHIFT/REDUCE AND REDUCE/REDUCE


CONFLICTS
 If a state does not know whether it will make
a shift operation or reduction for a terminal,
we say that there is a shift/reduce
conflict.

 If a state does not know whether it will make


a reduction operation using the production
rule i or j for a terminal, we say that there
is a reduce/reduce conflict.

 If the SLR parsing table of a grammar G has


a conflict, we say that that grammar is not
SLR grammar.
234

SLR PARSER
235

CONFLICT EXAMPLE
S  L=R I0: S’  .S I1:S’  S. I6: S  L=.R I9: S
 L=R.
SR S  .L=R R  .L
L *R S  .R I2:S  L.=R L .*R
L  id L  .*R R  L. L  .id
RL L  .id
R  .L I3:S  R.

I4:L  *.R I7:L  *R.


Problem R  .L
FOLLOW(R)={=,$} L .*R I8: R  L.
= shift 6 L  .id
reduce by R  L
shift/reduce conflict I5:L  id.
236

CONFLICT EXAMPLE2
S  AaAb I0: S’  .S
S  BbBa S  .AaAb
A S  .BbBa
B A.
B.

Problem
FOLLOW(A)={a,b}
FOLLOW(B)={a,b}
a reduce by A   b reduce by A  
reduce by B   reduce by B  
reduce/reduce conflict reduce/reduce conflict
237

CONSTRUCTING CANONICAL LR(1)


PARSING TABLES
 In SLR method, the state i makes a reduction by
A when the current token is a:
◦ if the A. in the Ii and a is FOLLOW(A)
 In some situations, A cannot be followed by the
terminal a in a right-sentential form when 
and the state i are on the top stack. This means
that making reduction in this case is not correct.

S  AaAb SAaAbAabab
SBbBaBbaba
S  BbBa
A Aab   ab Bba 
 ba
B AaAb  Aa  b BbBa
 Bb  a
238

LR(1) ITEM
 To avoid some of invalid reductions, the states
need to carry more information.
 Extra information is put into a state by
including a terminal symbol as a second
component in an item.

A LR(1) item is:


.

A   ,a where a is the look-head
of the LR(1) item
(a is a terminal or end-
marker.)
239

LR(1) ITEM (CONT.)


 .
When  ( in the LR(1) item A   ,a ) is not
empty, the look-head does not have any affect.
.
 When  is empty (A   ,a ), we do the
reduction by A only if the next input symbol is
a (not for any terminal in FOLLOW(A)).
 A state will contain .
A   ,a1 where
{a1,...,an}  FOLLOW(A) ... .
A   ,an
240

CANONICAL COLLECTION OF SETS OF


LR(1) ITEMS
 The construction of the canonical collection
of the sets of LR(1) items are similar to the
construction of the canonical collection of
the sets of LR(0) items, except that closure
and goto operations work a little bit
different.

closure(I) is: ( where I is a set of LR(1) items)


◦ every LR(1) item in I is in closure(I)

◦ .
if A B,a in closure(I) and B is a
production rule of G; then B.,b will be in
the closure(I) for each terminal b in FIRST(a) .
241

GOTO OPERATION

 If I is a set of LR(1) items and X is a


grammar symbol (terminal or non-
terminal), then goto(I,X) is defined as
follows:
 If A  .X,a in I
then every item in closure({A  X.,a}) will be in
goto(I,X).
242

CONSTRUCTION OF THE CANONICAL LR(1)


COLLECTION
 Algorithm:
C is { closure({S’.S,$}) }
repeat the followings until no more set of LR(1) items can
be added to C.
for each I in C and each grammar symbol X
if goto(I,X) is not empty and not in C
add goto(I,X) to C

 goto function is a DFA on the sets in C.


243

A SHORT NOTATION FOR THE SETS OF


LR(1) ITEMS
 A set of LR(1) items containing the following
items
.
A   ,a1
...
.
A   ,an

can be written as

.
A   ,a1/a2/.../an
244

CANONICAL LR(1) COLLECTION -- EXAMPLE

S  AaAb I0: S’  .S ,$ S
I1: S’  S. ,$ A
S  BbBa S  .AaAb ,$
A S  .BbBa ,$ B  A.aAb ,$ a
I2: S to I4
B A  . ,a
B  . ,b I3: S  B.bBa ,$ b to I5

A b
I4: S  Aa.Ab ,$ I6: S  AaA.b ,$ I8: S  AaAb. ,$
A  . ,b

B a
I5: S  Bb.Ba ,$ I7: S  BbB.a ,$ I9: S  BbBa. ,$
B  . ,a
CANONICAL LR(1) COLLECTION – EXAMPLE2
S’  S I0:S’  .S,$ I1:S’  S.,$ I4:L  *.R,$/= R to I7
1) S  L=R S  .L=R,$ S * R  .L,$/= L
to I8
2) S  R S  .R,$ LI2:S  L.=R,$ to I6 L .*R,$/= *
3) L *R L R  L.,$ L  .id,$/= to I4

245
id
4) L  id .*R,$/= R to I5
I3:S  R.,$ i
L  .id,$/= I5:L  id.,$/=
5) R  L d
R  .L,$
I9:S  L=R.,$
R I13:L  *R.,$
I6:S  L=.R,$ to I9
R  .L,$ L I10:R  L.,$
to I10
L  .*R,$ *
R
I4 and I11
L  .id,$ to I11 I11:L  *.R,$ to I13
id R  .L,$ L I5 and I12
to I12 to I10
L .*R,$ *
I7:L  *R.,$/= L  .id,$ to I11 I7 and I13
id
I8: R  L.,$/= to I12 I8 and I10
I12:L  id.,$
246

CONSTRUCTION OF LR(1) PARSING


TABLES
1. Construct the canonical collection of sets of LR(1) items for
G’. C{I0,...,In}

2. Create the parsing action table as follows


• .
If a is a terminal, A a,b in Ii and goto(Ii,a)=Ij then action[i,a]
is shift j.
• .
If A ,a is in Ii , then action[i,a] is reduce A where AS’.
• If S’S.,$ is in I , then action[i,$] is accept.
i

• If any conflicting actions generated by these rules, the grammar is


not LR(1).

3. Create the parsing goto table


• for all non-terminals A, if goto(Ii,A)=Ij then goto[i,A]=j

4. All entries not defined by (2) and (3) are errors.

5. Initial state of the parser contains S’.S,$


LR(1) PARSING TABLES – (FOR EXAMPLE2)

id * = $ S L R
0 s5 s4 1 2 3
1 acc

247
2 s6 r5
3 r2 no shift/reduce or
4 s5 s4 8 7 no reduce/reduce conflict
5
6 s12 s11
r4 r4
10 9

so, it is a LR(1) grammar
7 r3 r3
8 r5 r5
9 r1
10 r5
11 s12 s11 10 13
12 r4
13 r3
248

LALR PARSING TABLES


 LALR stands for LookAhead LR.

 LALR parsers are often used in practice because


LALR parsing tables are smaller than LR(1) parsing
tables.

 The number of states in SLR and LALR parsing


tables for a grammar G are equal.

 But LALR parsers recognize more grammars than


SLR parsers.

 yacc creates a LALR parser for the given grammar.


 A state of LALR parser will be again a set of LR(1)
items.
249

CREATING LALR PARSING TABLES


Canonical LR(1) Parser  LALR Parser
shrink # of states

 Thisshrink process may introduce a


reduce/reduce conflict in the resulting
LALR parser (so the grammar is NOT
LALR)

 But,this shrink process does not produce


a shift/reduce conflict.
250

THE CORE OF A SET OF LR(1) ITEMS


The core of a set of LR(1) items is the set of its first component.

Ex: .
S  L =R,$ .
S  L =R Core
. .

R  L ,$ RL

 We will find the states (sets of LR(1) items) in a canonical LR(1)


parser with same cores. Then we will merge them as a single state.

.
I1:L  id ,= A new state: .
I12: L  id ,=
 .
L  id ,$
.
I2:L  id ,$ have same core, merge them

 We will do this for all states of a canonical LR(1) parser to get the
states of the LALR parser.
 In fact, the number of the states of the LALR parser for a grammar
will be equal to the number of states of the SLR parser for that
grammar.
251

CREATION OF LALR PARSING TABLES


 Create the canonical LR(1) collection of the sets of LR(1) items for
the given grammar.

 Find each core; find all sets having that same core; replace those sets
having same cores with a single set which is their union.
C={I0,...,In}  C’={J1,...,Jm} where m  n

 Create the parsing tables (action and goto tables) same as the
construction of the parsing tables of LR(1) parser.
◦ Note that: If J=I1  ...  Ik since I1,...,Ik have same cores
 cores of goto(I1,X),...,goto(I2,X) must be same.
◦ So, goto(J,X)=K where K is the union of all sets of items having
same cores as goto(I1,X).

 If no conflict is introduced, the grammar is LALR(1) grammar.


(We may only introduce reduce/reduce conflicts; we cannot introduce
a shift/reduce conflict)
252

SHIFT/REDUCE CONFLICT
 We say that we cannot introduce a shift/reduce conflict during the
shrink process for the creation of the states of a LALR parser.
 Assume that we can introduce a shift/reduce conflict. In this case, a
state of LALR parser must have:
.
A   ,a and .
B   a,b
 This means that a state of the canonical LR(1) parser must have:
.
A   ,a and .
B   a,c
But, this state has also a shift/reduce conflict. i.e. The original
canonical LR(1) parser has a conflict.

(Reason for this, the shift operation does not depend on lookaheads)
253

REDUCE/REDUCE CONFLICT
 But, we may introduce a reduce/reduce conflict
during the shrink process for the creation of the
states of a LALR parser.

. .
. .
I1 : A   ,a I2: A   ,b
B   ,b B   ,c

.
I12: A   ,a/b 

.
reduce/reduce conflict
B   ,b/c
CANONICAL LALR(1) COLLECTION – EXAMPLE2
S’  S I0:S’  .S,$ .
I1:S’  S ,$ .
I411:L  * R,$/= R
1) S  L=R S .
L=R,$ S *
.. .
R  L,$/=
to I713

2) S  R S .
R,$ L 2 .
I :S  L =R,$ to I6 L *R,$/=
L
to I810
3) L *R L .
*R,$/= R  L ,$
.
L  id,$/=
*
to I411

254
4) L  id L .
id,$/= R
. .
id
to I512
5) R  L R .
L,$
I3:S  R ,$ i
d
I512:L  id ,$/=

.
I6:S  L= R,$
R I9:S  L=R ,$.
..
R  L,$
L  *R,$
L
*
to I9
to I810
Same Cores
I4 and I11

.
L  id,$
id
to I411 I5 and I12
to I512
.
I713:L  *R ,$/=
I7 and I13

.
I810: R  L ,$/=
I8 and I10
LALR(1) PARSING TABLES – (FOR
EXAMPLE2)
id * = $ S L R
0 s5 s4 1 2 3

255
1 acc
2 s6 r5
no shift/reduce or
3 r2 no reduce/reduce conflict


4 s5 s4 8 7
5 r4 r4
6 s12 s11 10 9 so, it is a LALR(1) grammar
7 r3 r3
8 r5 r5
9 r1
APPLICATIONS

256
SQL QUERY SYNTAX

257
SELECT * FROM Book WHERE price > 100 ORDER BY title;

price > 100


Parser checks
SELECT FROM * Book WHERE ORDER BY title;
the syntax
(rules) of the
SELECT * FROM Book ORDER BY title WHERE price > 100; Programming
Language

SELECT * FROM Book WHERE price > 100 ORDER BY title;


Do you have a need for speed? Take the hard work out of touchscreen
typing with this impressive keyboard replacement. As you type, Swiftkey
suggests whole words that it thinks is the most likely next word. Almost like
mind reading. Words get inserted effortlessly with a tap. This app will help
you type faster and easier. It has a nice writing interface and is
personalized and provides an accurate autocorrect. It is also available in 61
languages!
UNIT III
INTERMEDIATE CODE GENERATION

Syntax Directed Definitions, Evaluation

Orders for Syntax Directed Definitions,

Intermediate Languages: Syntax Tree, Three

Address Code, Types and Declarations,

Translation of Expressions, Type Checking.


AC F
ODE translator
Syntax-directed RAGMENT T B T
map codeOfragments
E RANSLATED
into three-address code.

1: i = i + 1
2: t1 = a [ i ]
{ 3: if t1 < v goto 1
int i; int j; 4: j = j -1
float[100] a; float v; float x; 5: t2 = a [ j ]
while (true) { 6: if t2 > v goto 4
do i = i + 1; while ( a[i] < v ); 7: ifFalse i >= j goto 9
do j = j – 1; while ( a[j] > v ); 8: goto 14
if ( i>= j ) break; 9: x = a [ i ]
x = a[i]; a[i] = a[j]; a[j] = x; 10: t3 = a [ j ]
} 11: a [ i ] = t3
} 12: a [ j ] = x
13: goto 1
14:
A MODEL OF A COMPILER FRONT END

Source Lexical Token Syntax Intermediate Three-address


program analyzer stream Parser tree Code code
Generator
Character
Stream

Symbol
Table
INTERMEDIATE CODE GENERATION
The translation of parse tree into intermediate form …

based on the syntax of the input language …

is often referred to a Syntax Directed Translation



POSITION OF INTERMEDIATE CODE GENERATOR

Intermediate
Syntax Intermediate Code
Parser Code
tree Code Generator
Generator
NEED FOR INTERMEDIATE CODE
MANY COMPILERS ARE NEEDED

Java JVM

C Intel Pentium

.Net IBM Cell

FORTRAN Motorola Processor

… …
NEED FOR INTERMEDIATE CODE
 Suppose We have n-source languages and m-
Target languages.

 For each source-target pair we will need a compiler.

 Hence we need (n*m) Compilers.

 If we Use Intermediate code We will require n-


Compilers to convert each source language into
Intermediate code and m-Compilers to convert
Intermediate code into m-target languages. Thus We
require only (n+m) Compilers.
USE OF INTERMEDIATE CODE

Java JVM
Optimization

C Intel Pentium
Intermediate
Code
.Net IBM Cell

FORTRAN Motorola Processor

… …
Example:

Consider a C language compiler that generates machine instructions


for an 80X86 processor.
If this compiler is to be modified to generate m/c instructions for a
SPARC processor based system then 2 approaches can be made:

i) C
C Compiler Machine
for 80X86 instructions for
Program
System 80X86 system

C Compiler Machine
C
for SPARC instructions for
Program
System SPARC system

ii) Compile
r Machine
Backend instructions for
for 80X86 system
C 80X86
C Compile Intermediat system
Program r Front e code Compile
End r
Backend Machine instructions
for for SPARC system
SPARC
Benefits of using intermediate code
1. Retargeting is facilitated; a compiler for a different machine
can be created by attaching a Back-end for the new machine to
an existing Front-end.

2. A machine Independent Code-Optimizer can be applied to the


Intermediate code.

3. Intermediate code is simple enough to be easily converted to


any target code.

SYNTAX-DIRECTED TRANSLATION
Two notations (approaches) for associating semantic rules with
productions,

 Syntax-directed definition
 Build up a translation by attaching strings (semantic rules) as
attributes to the nodes in the parse tree.
 Hides implementation details

 Frees the user to explicitly specify the order of translations.

 Syntax-directed translation schemes


 Indicates the order in which semantic rules are to be evaluated

 Allows some implementation details to be shown


SYNTAX DIRECTED DEFINITIONS
SYNTAX
 The
-DIRECTED DEFINITION
syntax-directed definition associates
 With each grammar symbol (terminals and nonterminals),
a set of attributes [ if E is a non-terminal : E.val,
E.type].
 With each production, a set of semantic rules for
computing the values of the attributes associated with the
symbols appearing in the production.
 An attribute is said to be
 Synthesized
 if its value at a parse-tree node is determined from attribute
values at its children and at the node itself.
 Inherited
 if its value at a parse-tree node is determined from attribute
values at the node itself, its parent, and its siblings in the
parse tree.
ATTRIBUTE
 An attribute is said to be
 Synthesized attribute
if its value at a parse-tree node is
determined from attribute values at its
children and at the node itself.
 Inherited attribute
if its value at a parse-tree node is
determined from attribute values at the
node itself, its parent, and its siblings in
the parse tree.
EXAMPLE: SYNTHESIZED ATTRIBUTES
 An annotated parse tree
 Suppose a node N in a parse tree is labeled by grammar
symbol X.
 The X.a is denoted the value of attribute a of X at node N.

expr.t = “9-5+2”

expr.t = “9-5” term.t = “2”

expr.t = “9” term.t = “5”

term.t = “9”

9 - 5 + 2
SYNTHESIZED ATTRIBUTES

 A SDD that uses synthesized attributes exclusively is said to be


an S-attributed definition.

 Evaluating the semantic rules for the attributes at each node is


bottom up ( leaves to the root).

 Example annotated parse tree for 3 * 5 + 4n

 Consider the production T  T * F,

PRODUCTION Semantic rule

T  T1 * F T.val := T1. val x F.val


ANNOTATED PARSE TREE FOR 3*5+4N
INHERITED ATTRIBUTES
 An inherited attribute is one whose value at a node in a parse tree is
terms of attributes at the parent and/or siblings of that node.

 It is convenient for expressing the dependencies of a programming


language construct on the context in which it appears
DEPENDENCY GRAPHS
 Directed graph showing the dependencies between attributes at various nodes in
the parse tree.

 If an attribute b at a node in a parse tree depends on an attribute c, then the


semantic rule for b at that node must be evaluated after the semantic rule that
defines c.

 E.Val is synthesized from E1.val and E2.val


DEPENDENCY GRAPH FOR PARSE TREE
METHODS TO EVALUATING SEMANTIC RULES

1. Parse- tree methods – at compile time, obtain evaluation order from a


topological sort, failure model.

2. Rule-based methods – at compiler construction time, semantic rule


analyzed through specialized tool.

3. Oblivious methods - evaluation order is chosen without considering the


semantic rules.
A SYNTAX-DIRECTED DEFINITION FOR CONSTRUCTING
SYNTAX TREE
DIRECTED ACYCLIC GRAPHS (DAG) FOR EXPRESSIONS
 DAG for an expression identifies the common sub-expressions in the expression.

 A DAG has a node for every sub-expression of the expression.

 Interior node represents an operator and its children represent its operands.
 Difference between DAG and syntax tree
1. DAG – node in a common sub-expressions has more than one parent
2. Syntax tree - Common sub-expression would be represented as a duplicated
subtree.
Dag for a+a*(b-c)+(b-c)*d
INTERMEDIATE LANGUAGES

 Syntax trees

 Postfix notation

 Three address code


THREE FORMS OF INTERMEDIATE CODE
 Abstract syntax trees  Three-Address instructions

do-while

body < 1: i = i + 1
2: t1 = a [ i ]
3: if t1 < v goto 1
assign [] v

i + a i
 Postfix notation
 a*b-c : ab*c-
i 1
1. SYNTAX TREES
 Graphical representation
 Depicts the natural hierarchical structure of the
program.

DAG
Gives the same(Directed
information butAcyclic Graph)
in a more compact
way.
Commom sub expressions are identified
Graphical representation of : a:= b * -c +
b * -c

SYNTAX TREE
DAG
assig assig
n n
a + a +

* * *

b uminus b uminus b uminus


2. POSTFIX NOTATION
 Linearized representation of the syntax tree

 a+b

 Postfix : ? ab+

 a:= b * -c + b * -c

 Postfix : ?
3. THREE ADDRESS CODE:
A=B*-C+B*-C
assig
n
TAC
a +  t1=uminus(c)
 t2=b*t1
* *
 t3=uminus(c)
 t4=b*t3
b uminus b uminus
 t5=t2+t4
 a=t5

 Internal nodes are given temporary names (t1,t2 . . .


).
EXAMPLE THREE ADDRESS CODE

x=y op z

3 addresses :: 2 operands + 1 result

where x, y and z are names, constants or Compiler


generated temporaries.

where op stands for any operator.


CONSTRUCTION OF SYNTAX TREES

Three functions used


mknode()
mkunode()
mkleaf()
CONSTRUCTION OF SYNTAX TREES
Syntax directed definition for assignment stmt:
Production Semantic Rule

S → id=E S.nptr := mknode(‘assign’,


mkleaf(id,id.place),E.nptr)
E → E1+E2 E.nptr := mknode(‘+’,E1.nptr, E2.nptr)

E → E1*E2 E.nptr := mknode(‘*’,E1.nptr, E2.nptr)

E → –E1 E.nptr := mkunode(‘uminus’,E1.nptr)

E→ (E1) E.nptr := E1.nptr

E → id E.nptr := mkleaf(id,id.place)

Attribute place points to the symbol table entry for the


CONSTRUCTION OF DAGS
 The same syntax directed definition will produce DAGs

 The functions mkunode() and mknode() returns a


pointer to an existing node whenever possible instead
of constructing new nodes.
TWO REPRESENTATIONS OF SYNTAX
TREES
 Method 1:
Each node is represented as a record with a field for its operator
and additional fields for pointers to its children

assign
a:= b * -c + b * -c
id
a
+

* *

id b id b

uminus uminus

id id
c c
 Method 2:
Nodes are allocated from an array of records and the index or
position of the node serves as a pointer to the node.
All the nodes in the syntax tree can be visited by following
pointers.

0 id b
a:= b * -c + b * -c
1 id c
2 uminus 1
3 * 0 2
4 id b
5 id c
6 uminus 5
7 * 4 6
8 + 3 7
9 id a
10 assign 9 8
THREE ADDRESS CODE
Format: x = y op z
3 addresses :: 2 operands + 1 result

Eg: x+y*z must be translated into a sequence of


Three address statements

t1:= y*z
t2:= x+t1

t1 & t2 are compiler generated temporary


names

Exercise : write 3address code for : a+b*c


TYPES OF THREE ADDRESS
STATEMENTS
i) Assignment statements
ii) Assignment instructions
iii) Copy statements
iv) Unconditional jumps
v) Conditional jumps
vi) Param x, call p,n & return y
vii) Indexed Assignments
viii) Address and Pointer Assignments
1. Assignment statements:
x := y op z
op is a binary arithmetic or logical operation
2. Assignment instructions:
x := op y
op is an unary operation
- unary minus ( -)
- logical negation (!)
- shift operators ( << left shift, >> right shift)
- conversion operators ( eg. int to float)
3. Copy Statements:
x := y
value of y is assigned to x
4. Unconditional jumps:
goto L
5. Conditional Jumps:
if x relop y goto L
relop: <, >, <=, >=, !=, ==
if x is in relation with y stmt with label L is executed
next.
else stmt immediately following
if x relop y goto L is executed
6. Param x, call p,n & return y:
For Procedure calls
p(x1,x2, . . . ,xn)
3 address code : param x1
param x2

param
 xn
n indicates the number of actual parameters in call p,n
7. Indexed Assignments:
x := y[i]
x[i] := y

8. Address and Pointer Assignments:


x := &y
x := *y
*x := y
IMPLEMENTATION OF THREE
ADDRESS STATEMENTS

 Quadruples
 Triples

 Indirect Triples
QUADRUPLES
 A quadruple is a record structure with 4 fields :
op, arg1, arg2, result
 op represents the operator.
 Eg. The 3 addr stmt x=y op z is represented as
op arg1 arg2 result
op y z x
 Statements with unary operators do not use arg2

 Operator param do not use arg2 and result

 Jump stmts put the target label in result

 Exercise : write the quadruple notation of a := b*-c + b*-c


QUADRUPLE NOTATION OF A:=B*-C+B*-C

op arg1 arg2 result


uminus c t1
* b t1 t2
uminus c t3
* b t3 t4
+ t2 t4 t5
assign t5 a
The temporaries are entered into the symbol table
TRIPLES
 A triple is a record structure with 3 fields
op, arg1 and arg 2.

 Fields for arg1 and arg2 are pointers to symbol table


entries or the triples itself

 Avoids entering temporaries into symbol table


TRIPLE NOTATION OF A:=B*-C+B*-C

op arg1 arg2
(0) uminus c
(1) * b (0)
(2) uminus c
(3) * b (2)
(4) + (1) (3)
(5) assign a (4)

Temporaries are not generated in triple notation


TRIPLE NOTATION FOR X[I] := Y
op arg1 Arg2

(0) []= x i
(1) = (0) y

Triple notation for x := y[i]


op arg1 Arg2

(0) =[] y i
(1) = x (0)
INDIRECT TRIPLES
 Listing pointer to triples instead of the triples
themselves

a:=b*-c+b*-c
st op arg1 arg2
mt (10) uminus c
(0) (10)
(11) * b (10)
(1) (11)
(12) uminus c
(2) (12)
(3) (13) (13) * b (12)
(4) (14) (14) + (11) (13)
(5) (15) (15) assign a (14)
COMPARISON
 Quadruples have the limitation of more
temporary variables

 Triples avoids temporaries and saves memory,


but optimization is hard

 Indirect triples are best for code movement in


optimization and also avoids temporaries
SYNTAX DIRECTED TRANSLATION SCHEME

DECLARATIONS
 Entries are made into symbol table for type and relative address
 Declarations in a procedure
Translation scheme for declarations in a procedure
productions
P →D

D →D ;D
D→ id :T

T→ integer
T→ real
T→array[num]of T1
T→ ↑T1
DECLARATIONS
productions Translation scheme
P → D {offset : = 0}
D →D ;D
D→ id :T {
enter(id.name,T.type,offset);
offset := offset+T.width
}
T→ integer { T.type := integer; T.width:=4
}
T→ real { T.type := real; T.width:=8 }

T→array[num]of { T.type:=array(num.val,T1.type);
T1 T.width:=num.val X T1.width }

T→ ↑T1
{ T.type:=pointer(T1.type); T.width:=4; }
DECLARATIONS IN NESTED
PROCEDURES

 Functions enter(), enterproc() and mktable() are


used.
 Symbol table entries are maintained for nested
procedures
SYMBOL TABLE FOR NESTED PROCEDURES
EG. QUICK SORT ALGORITHM
ASSIGNMENT STATEMENTS
productions Translation scheme

S →id := E {p:= lookup(id.name);


if p≠ nil then
emit(p’:=‘E.place)
else error}
E →E1+E2 {E.place := newtemp;
emit(E.place’:=’E1.place+E2.place)}
E →E1*E2 {E.place := newtemp;
emit(E.place’:=’E1.place*E2.place)}
E →-E1 {E.place := newtemp;
emit(E.place’:=’ ’uminus’E1.place)}
E →(E1) E.place := E1.place
E →id {p:= lookup(id.name);
if p≠ nil then
E.place:=p
else error}
SYNTAX DIRECTED DEFINITION TO PRODUCE 3 ADDRESS
CODE FOR ASSIGNMENTS
ADDRESSING ARRAY ELEMENTS

 base + (i – low) x w
 The expression can be partially evaluated at
compile time if it is rewritten as
i X w + (base – low x w)
BOOLEAN EXPRESSIONS
1. Numerical representation (short circuit or jumping
code)
2. Control flow translation of boolean expressions

E → E or E | E and E | not E | (E) | id relop id | true


| false
1. NUMERICAL REPRESENTATION
true : 1
false : 0
Eg 1:
a or b and not c

The 3 address sequence is


t1: =not c
t2:= b and t1
t3:= a or t2

Eg 2: a<b or c<d and e<f


PRECEDENCE OF BOOLEAN OPERATORS

highest not
and
least or
THREE ADDRESS CODE FOR
IF A<B THEN 1 ELSE 0

100 : if a<b then goto 103


101: t1=0
102: goto 104
103: t1=1
104:
t2 t3
t1

A<B OR C<D AND E<F


A<B OR C<D AND E<F
100 : if a<b then goto 103
101: t1:=0
102: goto 104
103: t1:=1
104: if c<d then goto 107
105: t2:=0
106: goto 108
107: t1:=1
108: if e<f then goto 111
109: t3:=0
110: goto 112
111: t3:=1
112: t4 := t2 and t3
113: t5 := t1 or t4
TRANSLATION SCHEME USING NUMERICAL
REPRESENTATION

newtemp : returns a new temporary variable


emit() : places a 3 addr statement into an o/p
file
nextstat : give the address of the current 3 addr
stmt
TRANSLATION SCHEME FOR BOOLEAN
EXPRESSION
productions Translation scheme
E →E1 or E2 {E.place = newtemp;
emit(E.place’:=’ E1.place ‘or’ E2.place)}
E →E1 and E2 {E.place = newtemp;
emit(E.place’:=’ E1.place ‘and’ E2.place)}
E →not E1 {E.place := newtemp;
emit(E.place’:=’ ‘ not‘ E1.place)}
E →(E1) E.place := E1.place

E →id relop {E.place := newtemp;


id emit (‘if’ id1.place ‘relop.op’ id2.place ‘goto’
nextstat+3)
emit(E.place ‘:= ‘ 0)
emit(‘goto’ nextstat+2)
emit(E.place ‘:=’ 1)}
E →true {E.place = newtemp;
emit(E.place’:=’ 1)}
E →false {E.place = newtemp;
emit(E.place’:=’ 0)}
FLOW OF CONTROL STATEMENTS
S → if E then S1 |
if E then S1 else S2 |
while E do S1
WHILE STATEMENT

E.true

Semantic rules:
S -> while E do S1
{S.begin = newlabel;
E.true = newlabel;
E.false = S.next;
S1.next = S.begin;
S.code = gen(S.begin’:’)||E.code||gen(E.true’:’)||S1.code
||gen(‘goto”S.begin)}
SYNTAX DIRECTED DEFINITION FOR FLOW OF
CONTROL STATEMENTS
productions Syntax directed definition

S → if E then S1 E.true := newlabel


E.false:= S.next
S1.next:= S.next
S.code :=E.code || gen(E.true ‘:’ ) || S1.code
S → if E then S1 else E.true := newlabel
S2 E.false:= newlabel
S1.next:= S.next
S2.next:= S.next
S.code :=E.code || gen(E.true ‘:’ ) || S1.code
gen(‘goto’ S.next) ||
gen(E. false ‘:’) || S2.code
S→ while E do S1 S.begin := newlabel
E.true := newlabel
E.false:= S.next
S1.next:= S.begin
S.code:=gen(S.begin ‘:’ ) || E.code ||
gen(E.true ‘:’) || S1.code || gen (‘goto’
S.begin)
TRANSLATE INTO 3 ADDR CODE
while a<b
{
if c<d then
x=y+z
else x=y-z
}
USING THE ABOVE RULES AND ASSIGNMENT STATEMENTS WE GET THE
3 ADDRESS CODE AS:

L1: IF A < B GOTO L2


GOTO LNEXT
L2: IF C < D GOTO L3
GOTO L4
L3: T1 = Y +Z
X = T1
GOTO L1
L4: T2 = Y – Z
X = T2
GOTO L1
LNEXT:
CASE STATEMENTS
Selector expression
n constant values
Default value

switch expression
begin
case value :


statement


case value:
statement

case value:
statement
default : statement
end
SYNTAX DIRECTED TRANSLATION OF CASE
STATEMENTS
Code to evaluate E into t
Switch E goto test
begin L1 : code for s1
goto next
case V1: S1 L2 : code for s2
case V2: S2 goto next

case Vn-1: Sn-1




default :Sn
Ln : code for Sn
test: if t=v1 goto L1
if t=v2 goto L2
end

if t=vn-1 goto Ln-1
goto Ln
next :
BACKPATCHING
Includes target address in branch
statements
 Problems when there is a forward jump in the
code
 Leave the target label unspecified in the first
pass
 In second pass fill the target label
3 FUNCTIONS USED

 makelist(i) : creates a new list containing


only i, an index into the array of
quadruples and returns pointer to the list
it has made.
 merge(i,j) – concatenates the lists pointed
to by i and j ,and returns a pointer to the
concatenated list.
 backpatch(p,i) – inserts i as the target label
for each of the statements on the list
pointed to by p.
 Let’s now try to construct the translation scheme
for Boolean expression.
 Let the grammar be:
E → E1 or ME2
E → E1 and ME2
E → not E1
E → (E1)
E → id1 relop id2
E → false
E → true
M→ε
MARKER NON-TERMINAL M
 Marker non-terminal M into the grammar to
cause a semantic action to pick up, at appropriate
times the index of the next quadruple to be
generated.
 This is done by the semantic action:
{ M.Quad = nextquad } for the rule
M→ε
SYNTHESIZED ATTRIBUTES
TRUELIST AND FALSELIST
 To generate jumping code for Boolean
expressions.
 E.truelist : Contains the list of all the
jump statements left incomplete to be
filled by the label for the start of the code
for E=true.
 E.falselist : Contains the list of all the
jump statements left incomplete to be
filled by the label for the start of the code
for E=false.
 The variable nextquad holds the index of
the next quadruple to follow.
 This value will be backpatched onto
E1.truelist in case of E → E1 and ME2
where it contains the address of the first
statement of E2.code.
 This value will be backpatched onto
E1.falselist in case of E → E1 or ME2
where it contains the address of the first
statement of E2.code.
SEMANTIC ACTIONS
 1)E → E1 or M E2
backpatch(E1.falselist, M.quad)
E.truelist = merge(E1.truelist,
E2.truelist)
E.falselist = E2.falselist
2) E → E1 and M E2
backpatch(E1.truelist, M.quad)
E.truelist = E2.truelist
E.falselist = merge(E1.falselist,
E2.falselist)
3) E → not E1
E.truelist = E1.falselist
E.falselist = E1.truelist
4) E → (E1)
E.truelist = E1.truelist
E.falselist = E1.falselist
5) E → id1 relop id2
E.truelist = makelist(nextquad)
E.falselist = makelist(nextquad +1 )
emit(if id1.place relop id2.place goto __ )
emit(goto ___)
6) E → true
E.truelist = makelist(nextquad)
emit(goto ___)
7) E → false
E.falselist = makelist(nextquad)
emit(goto ___)
8) M → ε
M.Quad = nextquad
EXAMPLES
a<b or c<d and e<f

100 if a<b goto ___


101 goto ___
Marker M records next quad as 102
102 if c<d goto ___
103 goto ___
Marker M records next quad as 104
104 if e<f goto ___
105 goto ___
BOTTOM UP EVALUATION OF S-
ATTRIBUTED DEFINITIONS

 A translator for s-attributed definition uses LR


Parser
 Stack : pair of arrays state and val

State Val
… …
X X.x
Y Y.Y
top Z Z.Z
… …

Parser stack
MOVES MADE BY TRANSLATOR ON INPUT
3*5+4N
TYPE CHECKING
 A compiler must check that the source program
follows both syntactic and semantic conventions
of the source language.
TYPE CHECKING
 Semantic checks

 Static and Dynamic

 Static : done during compilation

 Dynamic : done during run-time


STATIC CHECKS

Type checks
Flow of control checks
Uniqueness checks
Name related checks
TYPE SYSTEMS
 A set of rules for associating a type expressions to
various parts of a program

 Type checker : checks for type errors

 Strongly typed: if every program executes


without type errors

 Pascal : Basic types Constructed types


Boolean Arrays
Character Records
Integer Sets
Real Pointers
Functions
TYPE EXPRESSIONS

 A basic type is a type expression


(integer, boolean, character, real, type_error)
 A type name is a type expression

 A type constructor is a type expression (Arrays,


products, records, pointers, functions)
 Tree and DAG for char x char pointer(integer)
SPECIFICATION OF A SIMPLE TYPE
CHECKER

 Type of each identifier must be declared before the


identifier is used.

 To build a translation scheme to synthesize the type of


every expression from its sub-expressions.

 The type checker can handle arrays, pointers,


statements and functions.
A SIMPLE LANGUAGE
 Here is a Pascal-like grammar for a sequence of declarations (D)
followed by an expression (E)

P→D;E
D → D ; D | id : T
T → char | integer | array [ num ] of T | ↑ T
E → literal | num | id | E mod E | E [ E ] | E ↑
TRANSLATION SCHEME FOR DECLARATIONS
 P→D;E
 D→D;D
 D → id : T { addtype(id.entry, T.type) }
 T → char { T.type := char }
 T → integer { T.type := integer }
 T → ↑T1 { T.type := pointer(T1.type) }
 T → array [ num ] of T1
 { T.type := array(1 .. num.val, T1.type) }

Try to derive the annotated parse tree for the


declaration X: array[100] of ↑ char
TYPE CHECKING FOR EXPRESSIONS
Once the identifiers and their types have been inserted into the symbol table,
we can check the type of the elements of an expression:

 E → literal { E.type := char }

 E → num { E.type := integer }

 E → id { E.type := lookup(id.entry) }

 E → E1 mod E2 { if E1.type =integer and E2.type = integer

 then E.type := integer

 else E.type := type_error }

 E → E 1 [ E2 ] { if E2.type = integer and E1.type = array(s,t)

 then E.type := t else E.type := type_error }

 E → E1↑ { if E1.type = pointer(t)

 then E.type := t else E.type := type-error }


TYPE CONVERSION

 Suppose we encounter an expression x+i where x has type float and i has type

int.

 CPU instructions for addition could take EITHER float OR int as operands, but not
a mix.

 This means the compiler must sometimes convert the


operands of arithmetic expressions to ensure that
operands are consistent with operators

 POSTFIX : x i inttoreal real+


 where real+ is the floating point addition operation.

350
TYPE COERCION

 IMPLICIT (COERCION)

If type conversion is done by the compiler without the programmer


requesting it, it is called IMPLICIT conversion or type COERCION.

 EXPLICIT

Conversions are defined by the programmer.

x = (int)y * 2;

351
APPLICATIONS
 Java
WORKFLOW AS DAG
UNIT IV
RUN TIME ENVIRONMENTS AND CODE
GENERATION

Storage Organization, Stack Allocation Space,

Access to Non-local Data on the Stack, Heap

Management - Issues in Code Generation -

Design of a simple Code Generator.


RUN-TIME ENVIRONMENTS:

 Source Language Issues

 Storage Organization

 Storage Allocation Strategies

 Parameter Passing

 Symbol Tables
SOURCE LANGUAGE ISSUES
1. A program is made up of procedures.

2. A procedure definition is a declaration that associates an


identifier with a statement.
• Identifier is a procedure name
• The statement is the procedure body

3. Formal and actual parameters


int max(int x, int y)
{ ……..}
-------
max(m,n)
SOURCE LANGUAGE ISSUES

 Procedures
 Activation trees

 Control stack

 Scope of declaration

 Binding of names
PROCEDURE ACTIVATION FOR QUICKSORT()
ACTIVATION TREES
 Assumptions about flow of control among
procedures during execution of a program:
 Control flows sequentially
 Each execution of a procedure starts at the beginning of the
procedure body and eventually returns to the point
immediately following the place where the procedure was
called.

 Each execution of a procedure body is referred as an


activation of the procedure.

 Lifetime of an activation of a procedure P is the


sequence of steps between the first and last steps in
the execution of the procedure body, including time
spent executing procedures called by P, the
ACTIVATION TREES . . .
 If a and b are procedure activations, then there
life times are either non-overlapping or are
nested.

 A procedure is nested if a new activation can


begin before an earlier activation of the same
procedure has ended.
ACTIVATION TREE FOR QUICK SORT
CONTROL STACKS
 The flow of control in a program corresponds to a
depth-first-traversal of the activation tree that
starts at the root, visits a node before its children, and
recursively visits children at each node in a left-to-
right order.

 We can use a stack, called control stack to keep track


of live procedure activations.

 The idea is to push the node for an activation onto the


control stack as the activation begins and pop the
node when the activation ends.
 Thus the contents of the stack are related to paths to
the root of the activation tree.
 When node n is at the top of the control stack, the
CONTROL STACK
 Flow of control
 Control stack keeps track of live procedure
activations

q(1,9)

q(1,3)

q(2,3)
SCOPE OF DECLARATION

 The portion of the program to which a declaration


applies is called scope of that declaration.

 An occurrence of a name within a procedure is


said to be local

 Occurrence of a name outside a procedure is said


to be non-local.
BINDING OF NAMES
 Data object corresponds to a storage location that can hold
values.

 The term environment refers to a function that maps a name


to a storage location.

 The term state refers to a function that maps a storage


location to the value held there.

 An assignment changes the state but not the environment.

Environment State

Name Storage Value


ACCESS TO NON-LOCAL NAMES
EXAMPLE
{ int a=0;
int b=0;
{
int b=1;
{
B0 B2
B1 int a=2;
}
{
B2 int b=3
}
}
}
STORAGE ORGANIZATION
 Fixed size objects can be placed
in predefined locations

 Run time stack and heap

 The STACK is used to store:


 Procedure activations.
 The status of the machine just
before calling a procedure, so
that the status can be restored
when the called procedure
returns.
 The HEAP stores data
allocated under program
control
ACTIVATION RECORDS
 Any information needed for a single activation of
a procedure is stored in the ACTIVATION
RECORD

 Activation record gets pushed for each procedure


call and popped for each procedure return.
ACTIVATION RECORDS
STORAGE ALLOCATION

 Static Allocation

 Stack Allocation

 Heap Allocation
THREE KINDS OF MEMORY
 Fixed memory
 Stack memory

 Heap memory
FIXED ADDRESS MEMORY
 Executable code
 Global variables

 Constant structures that don’t fit inside a


machine instruction. (constant arrays, strings,
floating points, long integers etc.)
 Static variables.

 Subroutine local variable in non-recursive


languages (e.g. early FORTRAN).
STACK MEMORY
 Local variables for functions, whose size
can be determined at call time.
 Information saved at function call and
restored at function return:
 Values of callee arguments
 Register values:
 Return address (value of PC)
 Frame pointer (value of FP)

 Other registers

 Static link (to be discussed)


HEAP MEMORY
 Structures whose size varies dynamically (e.g.
variable length arrays or strings).
 Structures that are allocated dynamically (e.g.
records in a linked list).
 Structures created by a function call that must
survive after the call returns.
Issues:
 Allocation and free space management
 Deallocation / garbage collection
STATIC ALLOCATION

 Program code is statically allocated in most


implementations of imperative languages
 Statically allocated variables are history
sensitive
 Global variables keep state during entire program
lifetime
 Static local variables in C/C++ functions keep state
across function invocations
 Static data members are “shared” by objects and
keep state during program lifetime
 Advantage of statically allocated object is the
fast access due to absolute addressing of the
STATIC ALLOCATION
 Statically allocated names are bound to storage
at compile time

 Static data members are “shared” by objects and keep


state during program lifetime
STORAGE ALLOCATION IN FORTRAN
STATIC ALLOCATION IN FORTRAN

Code for CNSUME

Code
Code for PRDUCE

Activation record
CHARACTER *50 BUFFER
for CNSUME
INTEGER NEXT Static Data
CHARACTER C

Activation record
CHARACTER`*50 BUFFER
for PRDUCE
INTEGER NEXT
LIMITATIONS
 Size of data objects must be known at compile
time
 Recursive procedures are restricted

 Dynamic data structures cannot be created

 eg Fortran
FORTRAN
.f, .FOR, .for, .f77, .f90 and .f95

f95 hello.f90 -o hello.exe


STACK ALLOCATION

 Idea of Control Stack

 Activation records are pushed and popped as


activation begins and ends

 Locals are bound to fresh storage locations as


new activation begins

 Values of locals are deleted as activation ends

 Example : Quicksort()
CALLING
AllocatesSEQUENCE
activation record
 Enters information into its fields

RETURN SEQUENCE

 Restores the state of the machine so that the


calling procedure can continue execution
Position in activation ACTIVATION RECORDS ON
tree THE STACK

S
S a: array

S
S a: array
r
r i : integer

S S
a: array
q(1,9)
r q(1,9) i : integer

S S
a: array
r q(1,9) q(1,9)
i : integer
p(1,9 q(1,3) q(1,3)
) i : integer
p(1,3 q(1,0)
)
LIMITATION OF STACK ALLOCATION
 Dangling References
References to a storage that has been deallocated

Eg : main()
{
int *p;
p=dangle();
}
int *dangle() 23 i
{ 2000
int i=23;
return &i;
}
HEAP ALLOCATION
 Allocation is done according to need

 De-allocated in any order

 Record for an activation record is retained


even after the activation ends

 The memory is marked as free space


Position in activation Activation records on the
tree heap
S

S Control Link

r q(1,9) r(1,9)

Control Link

q(1,9)

Control Link

Activation record of r is retained even after execution is


over
ACCESS TO NON-LOCAL NAMES
 A program consists of many blocks.
 Identifier names within a block is local to it and
outside a block is non-local.
 Syntax:

{
// statements
}
EXAMPLE
{ int a=0;
int b=0;
{
int b=1;
{
B0 B2
B1 int a=2;
}
{
B2 int b=3
}
}
}
PARAMETER PASSING

 Call by value
 Call by address

 Call by Copy-Restore
SYMBOL TABLES
 Stores the symbol of the source program as the
compiler encounters them.

 Each entry contains the symbol name plus a


number of parameters describing what is
known about the symbol(Attributes – type, value,
address)

 Reserved words (if, then, else, etc.) maybe


stored in the symbol table as well.
SYMBOL TABLE
 As a minimum we must be able to
 INSERT a new symbol into the table
 RETRIEVE a symbol so that its parameters maybe retrieved
and/or modified,
 Query to find out if a symbol is already in the table.

 Each entry can be implemented as a


record.
SYMBOL TABLE DATA STRUCTURE
 One Linear list:
 Easy to implement
 Search time will be very long if source has many
symbols.
SYMBOL TABLE DATA STRUCTURE
Hash table:
 Run the symbol name through a hash function to create
an index in a table.

 If some other symbol has already claimed the space then


rehash with another hash function to get another index,
etc.

 Hash Table must be large enough to accommodate


largest number of symbols.
394
POSITION OF CODE GENERATOR

Front Intermediate Code Intermediate Code


end Code Optimizer Code Generator

Symbol
table
396

ISSUES IN THE DESIGN OF CODE


GENERATOR

1. Input to code generator


2. Target programs
3. Memory management
4. Instruction selection
5. Register allocation
6. Choice of evaluation order
7. Target Machine
397

 Intermediate representation + symbol table information

 Intermediate representation
1. Postfix notation
2. Quadruples
3. Stack machine code
4. syntax trees
5. DAGs
 Assumptions made:

 Type conversions done


 Semantic errors identified
 Free of errors
 Output of code generator
 Forms:
1. Absolute m/c code
2. Relocatable m/c code
3. Assembly language

•Mapping names in source program to memory


•Symbol table entries
•Declarations give the width (amt of storage needed for
the name)
•Static and stack allocation techniques
•In jumps labels will be converted to addresses
398
399

 Instruction set of target m/c


 Instruction speed & machine idioms
 Code skeleton
 Eg. x := y + z
MOV y,R0 /* load y into R0 */
ADD z, R0 /* Add z to R0 */
MOV R0,x /* Store R0 into x */
 Limitation : stmt by stmt translation produced poor code
Eg. a:=b+c
d:=a+e
 Quality : speed & size
 Cost of instructions : eg INC (increment instr)
TAC for a:=a+1 can be replaced with
INC a
400

 Register instr: shorter and faster


 Use of registers

* Register allocation
* Register assignment
 Register pairs

Multiplication : M x,y
x - Multiplicand - even register
y - single register
Product - entire even/odd register pair
401

 Order in which computations are performed can affect


the efficiency of target code.
 Best order
402

 Byte addressable
 4 bytes / word

 n general purpose registers R0,R1, . . . , Rn-1

 op source, destination

 Opcodes : MOV, ADD, SUB, . . .


403

Addressing modes with its form and cost

MODE FOR ADDRESS ADDED COST


M
Absolute M M 1
Register R R 0
Indexed c(R) c+ contents(R) 1
Indirect register *R contents(R) 0
Indirect indexed *c(R) contents(c + 1
Register reference : 0 contents(R))
Memory reference : 1

Instruction Costs

a :=b+c
404

Different sequences for a:= b+c

1. MOV b, R0 cost =6
ADD c, R0
MOV R0, a

2. MOV b, a cost =6
ADD c, a

3. MOV *R1, cost =2


*R0
ADD *R2,
*R0

4. ADD R2,R1 cost =3


MOV R1, a
Activation of procedures

 call
 return

 halt

 action // place holder for any type of statement


STACK ALLOCATION

 Mov : Saves return address


 Goto : Transfer control to target code
TARGET CODE FOR STACK ALLOCATION
REGISTER DESCRIPTOR :
Keeps track of what is currently in each register.

ADDRESS DESCRIPTOR:
Keeps track of location where the current value of the
name can be found at run-time.
(Register, Stack or Memory Address)
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
t:=a-b
u:=a-c
v:=t+u
d:=v+u
422

Different sequences for a:= b+c

1. MOV b, R0 cost =6
ADD c, R0
MOV R0, a

2. MOV b, a cost =6
ADD c, a

3. MOV *R1, cost =2


*R0
ADD *R2,
*R0

4. ADD R2,R1 cost =3


MOV R1, a
UNIT V
CODE OPTIMIZATION
UNIT V
CODE OPTIMIZATION

 Principal Sources of Optimization


 Peep hole optimization
 DAG
 Optimization of Basic Blocks
 Global Data Flow Analysis
 Efficient Data Flow Algorithms
90/10 Rule: Programs spend 90% of their execution time in 10% of the code.

Improve the intermediate code


• To get faster running machine code
• Reduce the running time

Code improvement or code optimization


• Identify and eliminate unnecessary instructions
• Replace any sequence of instructions by faster sequence

Optimization can be done all levels


 User program/algorithm level
 Intermediate code level
 Target code level
PERFORMANCE
 If running time can be reduced from few hours

 to few minutes increase in performance.


 Machine independent optimizations
Transformations that improve the target code without taking into
consideration
any properties of the target machine.

 Machine dependant optimizations


Based on register allocation and utilization of special machine-instruction
sequences.
 Transformations must preserve the meaning of the
program.
 A transformation must not induce errors into the
program.
 A transformation must, on the average, speed up
programs by a measureable amount.
 A transformation must be worth the effect.
Front end
 Transformations: local or global
 Two types

 Function preserving transformations


 Loop optimizations
 Common sub expression elimination,
 Copy propagation,

 Dead-code elimination, and

 Constant folding
 Code motion
 Induction-variable elimination

 Reduction in strength
 Structure-Preserving Transformations
1. Common sub-expression elimination
2. Dead code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements.
 Algebraic Transformations
Simplifying expressions or replacing expensive
operation by cheaper ones
436
437

A basic block is a sequence of consecutive


intermediate language statements in which flow of
control can only enter at the beginning and leave at the
end.

 Only the last statement of a basic block can be a


branch statement and only the first statement
of a basic block can be a target of a branch.
1. Identify leader statements (i.e. the first statements of basic blocks)

438
by using the following rules:

(i) The first statement in the program is a leader

(ii) Any statement that is the target of a branch statement is a leader


(for most intermediate languages these are statements with
an associated label)

(iii) Any statement that immediately follows a branch or return


statement is a leader
How do we form the basic blocks associated with
each leader?

439
The following code computes the inner product of two vectors.

440
begin
prod := 0;
i := 1;
do begin
prod := prod + a[i] * b[i];
i = i+ 1;
end
while i <= 20
end
Source code
The following code computes the inner product of two vectors.
(1) prod := 0

441
(2) i := 1
begin
(3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
Three-address code
The following code computes the inner product of two vectors.
Rule (i) (1) prod := 0

442
(2) i := 1
begin
(3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
(13) …
Three-address code
The following code computes the inner product of two vectors.
Rule (i) (1) prod := 0

443
(2) i := 1
begin
Rule (ii) (3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
(13) …
Three-address code
The following code computes the inner product of two vectors.
Rule (i) (1) prod := 0

444
(2) i := 1
begin
Rule (ii) (3) t1 := 4 * i
prod := 0;
(4) t2 := a[t1]
i := 1;
(5) t3 := 4 * i
do begin
(6) t4 := b[t3]
prod := prod + a[i] * b[i];
(7) t5 := t2 * t4
i = i+ 1;
(8) t6 := prod + t5
end
(9) prod := t6
while i <= 20
(10) t7 := i + 1
end
(11) i := t7
Source code (12) if i <= 20 goto (3)
Rule (iii) (13) …
Three-address code
B1 (1) prod := 0
(2) i := 1

445
B2 (3) t1 := 4 * i
Basic Blocks: (4) t2 := a[t1]
(5) t3 := 4 * i
(6) t4 := b[t3]
(7) t5 := t2 * t4
(8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)

B3 (13) …
QUICKSORT ROUTINE

void quicksort(int m,int n)


{
int i,j;
i n t v, x ; / / S WA P a [ i ] , a [ n ]
if(n<=m) return; x:=a[i];
i:=m-1; a[i]:=a[n];
j:=n; a[n]:=x;
v:=a[n]; quicksort(m,i);
while(1) quicksort(i+1,n);
{
do i:=i+1; while(a[i]<v);
do j:=j-1; while(a[j]>v); } //end of quicksort()
if(i>=j) break;
/ / S WA P a [ i ] , a [ j ]
x:=a[i];
a[i]:=a[j];
a[j]:=x;
} //end of while
QUICKSORT ROUTINE
void quicksort(int m,int n)
{
int i,j;
i n t v, x ;
if(n<=m) return;
i:=m-1; x:=a[i];
j:=n; a[i]:=a[n];
v:=a[n]; a[n]:=x;
while(1) quicksort(m,i);
{
do i:=i+1; while(a[i]<v);
quicksort(i+1,n);
do j:=j-1; while(a[j]>v);
if(i>=j) break;
x:=a[i]; } //end of quicksort()
a[i]:=a[j];
a[j]:=x;
} //end of while
(1) i := m-1
QUICKSORT ROUTINE (2) j := n
(3) t1 := 4*n
(4) v := a[t1]
void quicksort(int m,int n)
(5) i := i+1
{
(6) t2 := 4*i
int i,j; (7) t3 := a[t2]
int v,x; (8) if t3<v goto(5)
if(n<=m) return; (9) j := j-1
i:=m-1; (10) t4 := 4*j
j:=n; (11) t5 := a[t4]
v:=a[n]; (12) if t5>v goto (9)
(13) if i>=j goto (23)
while(1)
(14) t6 := 4*i
{
(15) x := a[t6]
do i:=i+1; while(a[i]<v); (16) t7 := 4*i
do j:=j-1; while(a[j]>v); (17) t8 := 4*j
if(i>=j) break; (18) t9 := a[t8]
x:=a[i]; (19) a[t7] := t9
a[i]:=a[j]; (20) t10 := 4*j
a[j]:=x; (21) a[t10]:= x
(22) goto(5)
} //end of while
QUICKSORT ROUTINE

(23) t11 := 4*i


(24) x := a[t11]
x:=a[i]; (25) t12 := 4*i
a[i]:=a[n]; (26) t13 := 4*n
a[n]:=x; (27) t14 := a[t13]
quicksort(m,i); (28) a[t12] := t14
quicksort(i+1,n); (29) t15 := 4*n
(30) a[t15] := x
(31) 2 calls ...
} //end of quicksort()
QUICKSORT THREE ADDRESS CODE
(1) i := m-1 (16) t7 := 4*i
(2) j := n (17) t8 := 4*j
(3) t1 := 4*n (18) t9 := a[t8]
(4) v := a[t1] (19) a[t7] := t9
(5) i := i+1 (20) t10 := 4*j
(6) t2 := 4*i (21) a[t10]:= x
(7) t3 := a[t2]
(22) goto(5)
(8) if t3<v goto(5)
(23) t11 := 4*i
(9) j := j-1
(10) t4 := 4*j (24) x := a[t11]
(11) t5 := a[t4] (25) t12 := 4*i
(12) if t5>v goto (9) (26) t13 := 4*n
(13) if i>=j goto (23) (27) t14 := a[t13]
(14) t6 := 4*i (28) a[t12] := t14
(15) x := a[t6] (29) t15 := 4*n
(30) a[t15] := x
(31) 2 calls ...
IDENTIFYING BASIC BLOCKS
 Leaders : (1), (5),(9),(13),(14) and (23)

 Basic blocks :

BB1: (1)--(4)
BB2: (5)--(8)
BB3: (9)--(12)
BB4: (13)
BB5: (14)--(22)
BB6: (23)--(30)
QUICKSORT THREE ADDRESS CODE
(1) i := m-1 (16) t7 := 4*i
(2) j := n (17) t8 := 4*j
(3) t1 := 4*n (18) t9 := a[t8]
(4) v := a[t1] (19) a[t7] := t9
(5) i := i+1 (20) t10 := 4*j
(6) t2 := 4*i (21) a[t10]:= x
(7) t3 := a[t2]
(22) goto(5)
(8) if t3<v goto(5)
(23) t11 := 4*i
(9) j := j-1
(10) t4 := 4*j (24) x := a[t11]
(11) t5 := a[t4] (25) t12 := 4*i
(12) if t5>v goto (9) (26) t13 := 4*n
(13) if i>=j goto (23) (27) t14 := a[t13]
(14) t6 := 4*i (28) a[t12] := t14
(15) x := a[t6] (29) t15 := 4*n
(30) a[t15] := x
(31) 2 calls ...
Control Flow Graph B1
i := m-1
j := n
t1 := 4*n
v := a[t1]

B2
i := i+1
t2 := 4*i
t3 := a[t2]
if t3<v goto B2
B3
j := j-1
t4 := 4*j
t5 := a[t4]
if t5 > v goto B3

B4
if i >= j goto B6

B5 B6
t6 := 4 * i
t11 := 4*i
x := a[t6]
x := a[t11]
t7 := 4*i
t12 := 4*i
t8 := 4*j
t13 := 4*j
t9 := a[t8]
t14 := a[t13]
a[t7]:= t9
a[t12]:= t14
t10 := 4*j
t15 := 4*j
a[t10]:= x
a[t15]:= x
goto B2
 Common sub expression • Code motion
elimination,
• Induction-variable elimination
 Copy propagation,
 Dead-code elimination, and
• Reduction in strength
 Constant folding
B5 B6
t6 := 4 * i
x := a[t6] t11 := 4*i
t7 := 4*i x := a[t11]
t8 := 4*j t12 := 4*i
t9 := a[t8] t13 := 4*j
a[t7]:= t9 t14 := a[t13]
t10 := 4*j a[t12]:= t14
a[t10]:= x t15 := 4*j
goto B2 a[t15]:= x

t6 := 4 * i
x := a[t6] t11 := 4*i
t7 := 4*i x := a[t11]
t8 := 4*j t12 := 4*i
t9 := a[t8] t13 := 4*j
a[t7]:= t9 t14 := a[t13]
t10 := 4*j a[t12]:= t14
a[t10]:= x t15 := 4*j
goto B2 a[t15]:= x

t6 := 4 * i
x := a[t6] t11 := 4*i
t8 := 4*j x := a[t11]
t9 := a[t8] t13 := 4*j
a[t6]:= t9 t14 := a[t13]
a[t8]:= x a[t11]:= t14
goto B2 a[t13]:= x
B5 B6
t6 := 4 * i
x := a[t6] t11 := 4*i
t8 := 4*j x := a[t11]
t9 := a[t8] t13 := 4*j
a[t6]:= t9 t14 := a[t13]
a[t8]:= x a[t11]:= t14
goto B2 a[t13]:= x

x := t3 x := t3
a[t2]:= t5 t14 := a[t1]
a[t4]:= x a[t2]:= t14
goto B2 a[t1]:= x
B5 B6
x := t3 x := t3
a[t2]:= t5 t14 := a[t1]
a[t4]:= x a[t2]:= t14
goto B2 a[t1]:= x

CHANGE ALL OCCURENCES OF X AS T3


x := t3 x := t3
B5
a[t2]:= t5 t14 := a[t1] B6
a[t4]:= t3 a[t2]:= t14
goto B2 a[t1]:= t3

Copy propagation often turns copy stmt into dead code.

a[t2]:= t5 t14 := a[t1]


B5 a[t4]:= t3 a[t2]:= t14 B6
goto B2 a[t1]:= t3
• Code motion In B3 every time j decreases by 1,
• Induction-variable elimination t4 decreases by 4
• Reduction in strength Both j and t4 are in lock step

In B2 every time i increases by 1,


t2 increases by 4
Both i and t2 are in lock step
Such variables are called induction
variables. They can be eliminated.

Moving loop invariant code just


before entering the loop.
Replace * by +
while(i<=limit- t=limit-2; And / by – t4:=4*j;
2) while(i<=t)
{ { t4:=4*j-4
} } t4:=t4-4;
B1
Optimized Flow Graph
i := m-1
j := n
t1 := 4*n
v := a[t1]
t2:= 4*i
t4:= 4*j
B2

t2 := t2+4
t3 := a[t2]
if t3<v goto B2
B3
t4 := t4-4
t5 := a[t4]
if t5 > v goto B3

B4
if i >= j goto B6

B5 B6
a[t2]:= t5 t14 := a[t1]
a[t4]:= t3 a[t2]:= t14
goto B2 a[t1]:= t3
1. Common sub-expression elimination
2. Dead code elimination
3. Renaming of temporary variables
4. Interchange of two independent adjacent statements

Transformations Can be implemented


using DAG
1. Node :contains initial values of variables

2. Node : For each stmt, it is labelled by an operator

DAG for the following basic block.


a:= b + c
b := a – d
c := b + c
d := a - d
 Used to identify and eliminate common sub-
expressions

 When a new node ‘m’ is to be added,

 Check if ‘n’ is a node with same children, in same


order and with same operator,

 ‘n’ computes the same as ‘m’

 Therefore, ‘n’ can be used for ‘m’.


Common subexpression elimination:

a := b + c a := b + c

464
b := a - d b := a - d
c := b + c c := b + c
d := a - d d := b

Dead code elimination:

if x is never referenced after the statement x = y+z, the


statement can be safely eliminated.
a:= b + c
b := b – d
c := c + d
e := b + c

•Any root node with no ancestors, with no lively variables, is a dead


code. It can be removed from the DAG.
•Repeated removal leads to reduced/optimized code
Interchange of statements:

t1 := b + c t2 := x + y

466
t2 := x + y t1 := b + c

Renaming temporary variables:

if there is a statement t := b + c, we can change it to u := b + c


and change all uses of t to u.
Arithmetic Identities Constant Folding

x+0=0+x=x Compile time: Evaluate constant


expression
x*1=1*x=x Replace with their values

2 * 3.14 = 6.28
x/1 = x
x * y = y * x (commutativity)
x*1 = x

Reduction in Strength
Expensive operators can be replaces by cheaper
ones
x ** 2 = x * x
2.0 * x = x + x
x / 2 = x * 0.5
Algebraic transformations:

468
x := x + 0
x := x * 1 x := y*y
x := y**2 z := x + x
z := 2*x

Changes such as y**2 into y*y and 2*x into x+x are also known as
strength reduction.
Dominators:

 In a flow graph, a node d dominates node n, if every path from initial


node of the flowgraph to n goes through d. This will be denoted by
d dom n.
 Every initial node dominates all the remaining nodes in the flow graph.

 And the entry of a loop dominates all nodes in the loop.

 Similarly every node dominates itself.


 Initial node is always root, and each node
dominates only its descendents in a tree

 Each node n has a unique immediate


dominator m that is the last dominator of n on
any path from the initial node to n.

 Natural Loop: There must be at least one way


to iterate the loop
A DOMINATOR TREE (EXAMPLE)
1
List of dominators
2
3 1. Dominates all

472
2. Dominates itself
4 3. Dominates all but 1,2
4. Dominates all but 1,2,3
5 6 5. Dominates itself
7 6. Dominates itself
7. Dominates 7,8,9,10
8. Dominates 8,9,10
8
9. Dominates itself
9 10 10. Dominates itself

Tree formation
•Initial is root
•Each node dominates only its descendants
•Each node has a unique immediate dominator
•Each node dominates itself
DOMINATOR TREE

1
List of dominators
2 3
1. Dominates all
2. Dominates itself 4
3. Dominates all but 1,2
4. Dominates all but 1,2,3
5. Dominates itself 5 6 7
6. Dominates itself
7. Dominates 7,8,9,10
8. Dominates 8,9,10 8
9. Dominates itself
10. Dominates itself
9 10
 Process of collecting information which are useful
for the purpose of optimization.

The equation defines “Information at the end of


the statement is either generated within the
statement or enters at the beginning and is not
killed as control flows through the statement”
1. Definition & Use

476
Sk: V1 = V2 + V3

Sk is a definition of V1
Sk is an use of V2 and V3
 A is between every 2 successive statements and
before the first statement and after the last statement.
B
d1: x := y-m 1
d2: y := m
B
2
d3: x := x+1

d1,d2,d3 are definitions


 If p0,p1,p2 , . . Pn are points, p1 to p4 is a
path
478

Set up dataflow equations for each basic block.


For reaching definition the equation is:

out
 S   gen

S   inS   kill
     
S 
definition s definition s input to S that are
that reach generated not killed in S
the end of by S
statement S
Note: the dataflow equations depend on the problem statement
Data-flow analysis of structured programs:

S id: = E| S; S | if E then S else S | do S while E E


id + id| id
1.
2.
483

S ::= id := E
This restricted syntax results in the |S;S
forms depicted below for flowgraphs | if E then S else S
| do S while E
E ::= id + id
| id

S1 S1
If E goto S1

S2
S1 S2 If E goto S1

S1; S2 if E then S1 else S2 do S1 while E


gen[S] = {d}
S d : a := b + c
kill [S] = Da - {d}

S1 gen [S] = gen [S2]  (gen [S1] - kill [S2])


S
S2 kill [S] = kill [S2]  (kill [S1] - gen [S2])

gen [S] = gen [S1]  gen [S2]


S S1 S2
kill [S] = kill [S1]  kill [S2]

gen [S] = gen [S1]


S S1
484
kill [S] = kill [S1]
485

S d : a := b + c out [S] = gen [S]  (in [S] - kill [S])

S1 in [S1] = in [S]
S
in [S2] = out [S1]
S2 out [S] = out [S2]

in [S1] = in [S]
S S1 S2 in [S2] = in [S]
out [S] = out [S1]  out [S2]

S1
in [S1] = in [S]  out [S1]
S
out [S]= out [S1]
Goals:
- improve quality of target code

486
- reduce code size
Method:
1. Examine short sequences of target instructions
2. Replacing the sequence by a more efficient one.
CM
PUT
680 -
• Redundant-instruction elimination Com
• flow-of-control optimizations pile
• Algebraic simplifications r
Desi
• Use of machine idioms gn
• Unreachable code and
• Reduction in strength Opti
487
(1) LOAD R0, a
(2) STORE a, R0 (1) LOAD R0, a

If there are no changes to a between (1) and (2),


then the store (2) is redundant.
goto L1 goto L2
... ...
L1: goto L2 L1: goto L2

488
if a < b goto L1 if a < b goto L2
... ...
L1: goto L2 L1: goto L2

goto L1 if a < b goto L2


... goto L3
L1: if a < b goto L2 ...
L3: L3:
489
Remove statements like

x= x+0

x= x*1
490
Remove statements like

a=a+1

a= a – 1

By auto increment and auto decrement


addressing modes
debug = 0

491
...
Before: if debug  1 goto L2
print debugging information
L2:

debug = 0
...
After:
492
The exponentiation operator requires a
function call

x := y**2

this can be replaced by

x := y * y
A control flow graph (CFG), or simply a flow graph,

493
is a directed multigraph in which:
(i) the nodes are basic blocks; and
(ii) the edges represent flow of control
(branches or fall-through execution).

The basic block whose leader is the first intermediate


language statement is called the start node.

In a CFG we have no information about the data.


Therefore an edge in the CFG means that the
program may take that path.
494
There is a directed edge from basic block B1 to basic
block B2 in the CFG if:
(1) There is a branch from the last statement of B1 to the first
statement of B2, or
(2) Control flow can fall through from B1 to B2 because:
(i) B2 immediately follows B1, and
(ii) B1 does not end with an unconditional branch
B1 (1) prod := 0
(2) i := 1

495
Rule (2)
B2 (3) t1 := 4 * i
B1 (4) t2 := a[t1]
(5) t3 := 4 * i
B2 (6) t4 := b[t3]
(7) t5 := t2 * t4
B3 (8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)

B3 (13) …
EXAMPLE : CONTROL FLOW GRAPH
FORMATION

B1 (1) prod := 0
(2) i := 1 Rule (1)

496
Rule (2)
B2 (3) t1 := 4 * i
B1 (4) t2 := a[t1]
(5) t3 := 4 * i
B2 (6) t4 := b[t3]
(7) t5 := t2 * t4
B3 (8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)

B3 (13) …
EXAMPLE : CONTROL FLOW GRAPH
FORMATION

B1 (1) prod := 0
(2) i := 1 Rule (1)

497
Rule (2)
B2 (3) t1 := 4 * i
B1 (4) t2 := a[t1]
(5) t3 := 4 * i
B2 (6) t4 := b[t3]
(7) t5 := t2 * t4
B3 (8) t6 := prod + t5
(9) prod := t6
(10) t7 := i + 1
(11) i := t7
(12) if i <= 20 goto (3)
Rule (2)
B3 (13) …
498
Question: Given the control flow graph of a procedure,
how can we identify loops?

Answer: We use the concept of dominance.


499

A node a in a CFG dominates a node b if every path from the start node
to node b goes through a. We say that node a is a dominator of node b.

The dominator set of node b, dom(b), is


formed by all nodes that dominate b.

Note: by definition, each node dominates itself,


therefore, b  dom(b).
Definition: Let G = (N, E, s) denote a flowgraph,
where:

500
N: set of vertices
E: set of edges
s: starting node.
and let a  N, b  N.

1. a dominates b, written a  b, if
every path from s to b contains a.

2. a properly dominates b, written a < b, if


a  b and a  b.
Domination relation:
{ (1, 1), (1, 2), (1, 3), (1,4) …
(2, 3), (2, 4), … 1 S

501
(2, 10) 2
}
3
Direct Domination: 4
1 <d 2, 2 <d 3, …
5

6 7
Dominator Sets:
DOM(1) = {1} 8
DOM(2) = {1, 2}
9
DOM(3) = {1, 2, 3}
DOM(10) = {1, 2, 10) 10
502
Motivation: Programs spend most of the execution time in loops,

therefore there is a larger payoff for optimizations that exploit loop structure.
A dominator tree is a useful way to represent the

503
dominance relation.

In a dominator tree the start node s is the root, and each
node d dominates only its descendents in the tree.
DOMINATOR TREE

1
List of dominators
2 3
1. Dominates all
2. Dominates itself 4
3. Dominates all but 1,2
4. Dominates all but 1,2,3
5. Dominates itself 5 6 7
6. Dominates itself
7. Dominates 7,8,9,10
8. Dominates 8,9,10 8
9. Dominates itself
10. Dominates itself
9 10
 Depth first ordering in iterative algorithms
 Structure based data flow analysis

 Speed up to structure based algorithms

 Handling non-reducible flow graphs


APPLICATIONS
Generating Optimized code for multi-cores
through parallelism
WHAT’S AHEAD OF OPTIMIZATION
 Prescriptive analytics : Finding the time to
repair.
NPTEL
NPTEL REFERECES

1. https://onlinecourses.nptel.ac.in/noc21_cs07/preview

2. https://nptel.ac.in/courses/106108113

3. https://archive.nptel.ac.in/courses/106/105/106105190/

You might also like