Unit 4
Unit 4
o Compiler Basics
o Lexical Analysis
o Syntax Analysis
o Semantic Analysis
o Runtime environments
o Code Generation
o Code Optimization
Phases of Compilation
3
Beyond syntax analysis
• Parser cannot catch all the program errors
• Some language features can not be modeled using context free grammar
formalism
4
Beyond syntax analysis
• Example 1
string x; int y;
y=x+3
the use of x is type error
• int a, b;
a=b+c
c is not declared
• An identifier may be usable in one part of the program but not another
5
Compiler needs to know?
• Whether a variable has been declared?
• If an array use like A[i,j,k] is consistent with the declaration? Does it have three dimensions?
6
Compiler needs to know?
• How many arguments does a function take?
• Inheritance relationship
• Answers to these questions depend upon values like type information, number of
parameters etc.
8
How to answer these questions?
• Use formal methods
• Context sensitive grammars
• Extended attribute grammars
9
Why attributes ?
• For lexical analysis and syntax analysis formal techniques were used.
• However, we still had code in form of actions along with regular expressions
and context free grammar
10
Attribute Grammar Framework
• Generalization of CFG where each grammar symbol has an associated set of attributes
• Translation schemes
• indicate order in which semantic rules are to be evaluated
• allow some implementation details to be shown
11
Attribute Grammars
• Context-Free Grammars (CFGs) are used to specify the syntax of
programming languages
• E.g. arithmetic expressions
• How do we tie these rules to mathematical
concepts?
• Attribute grammars are annotated CFGs in
which annotations are used to establish
meaning relationships among symbols
– Annotations are also known as decorations
12
Attribute Grammars: Example
• Each grammar symbols has a set
of attributes
• E.g. the value of E1 is the attribute
E1.val
• Each grammar rule has a set of
rules over the symbol attributes
• Copy rules
• Semantic Function rules
• E.g. sum, quotient
Attribute Flow : Example
❖ Context-free grammars are not tied to an specific
parsing order
❖ E.g. Recursive descent, LR parsing
❖ Attribute grammars are not tied to an specific
evaluation order
❖ This evaluation is known as the annotation or
decoration of the parse tree
• The figure shows the result of annotating the
parse tree for (1+3)*2
• Each symbols has at most one attribute shown in
the corresponding box
• Numerical value in this example
• Operator symbols have no value
• Arrows represent attribute flow 14
Attribute Flow : Example
(1+3)*2
15
Example
• Consider a grammar for signed binary numbers
• Build attribute grammar that annotates Number with the value it represents
symbol attributes
Number value
sign negative
list position, value
bit position, value 16
Example
production Attribute rule
- 1 0 1
18
Parse tree and the dependence graph
production Attribute rule
Number Val=-5 number sign list list.position 0
if sign.negative
then number.value - list.value
else number.value list.value
sign neg=true Pos=0 list Val=5
sign + sign.negative false
sign - sign.negative true
Pos=1 list Val=4 Pos=0 bit Val=1
list bit bit.position list.position
list.value bit.value
Pos=1 bit Val=0
Pos=2 list Val=4 list0 list1 bit list1.position list0.position + 1
bit.position list0.position
list0.value list1.value + bit.value
Pos=2 bit Val=4 bit 0 bit.value 0
bit 1 bit.value 2bit.position
- 1 0 1
19
Attributes
20
Attributes
• Each grammar production A → α has associated with it a set of semantic rules of
the form
• b is a synthesized attribute of A
OR
• b is an inherited attribute of one of the grammar symbols on the right
21
Syntax Directed Definitions
22
Synthesized Attributes
23
Example of S-attributed SDD
• terminals are assumed to have only synthesized attribute values of which are
supplied by lexical analyzer
• start symbol does not have any inherited attribute
24
Example: Parse tree for 3 * 4 + 5 n
L Print 17
Val=12 T F Val=5
Val=3 T * F Val=4 id
Val=3 F id
25
id
Inherited Attributes
• an inherited attribute is one whose value is defined in terms of attributes at the parent and/or siblings
• possible to use only S-attributes but more natural to use inherited attributes
L → id addtype (id.entry,L.in)
26
Example: Parse tree for real x, y, z
type=real in=real
T L
addtype(z,real)
real L in=real , z
addtype(y,real) D→TL L.in = T.type
in=real L , y
T → real T.type = real
T → int T.type = int
addtype(x,real) x
L → L1, id Ll.in = L.in; addtype(id.entry, L.in)
L → id addtype (id.entry,L.in)
27
Dependence Graph
• The dependencies among the nodes can be depicted by a directed graph called
dependency graph
28
Algorithm to construct dependency graph
29
Example: Dependence Graph
X Y X.x Y.y
• If production A → X Y has the semantic rule X.x = g(A.a, Y.y)
A.a
A
X.x Y.y
X Y
30
Example: Dependence Graph
• Whenever following production is used in a parse tree
E→ E1 + E2 E.val = E1.val + E2.val
we create a dependency graph
E.val
E1.val E2.val
31
Example: Dependence Graph
type=real in=real
T L
addtype(z,real)
Type_lexeme
real L in=real , id.z z
addtype(y,real)
in=real L , id.y y
addtype(x,real) x id.x
32
Evaluation Order
• Any topological sort of dependency graph gives a valid order in which semantic rules must be
evaluated
Expression
Tree • Condensed form of parse tree,
• useful for representing language constructs.
• The production S → if B then s1 else s2
may appear as
if-then-else
B s1 s2
33
Expression tree
• Chain of single productions may be collapsed, and operators move
to the parent nodes
E +
E + T * id3
T F id1 id2
T * F id3
F id2
id1 34
Constructing expression tree
• Each node can be represented as a record
35
Example
P1 = mkleaf(id, entry.a) P5
+
P2 = mkleaf(num, P3 P4
4)
P3 = mknode(-, P1, P2)
- id
P4 = mkleaf(id, entry.c)
P5 = mknode(+, P3, P4) P1
P2 entry of c
id num 4
entry of a
36
A syntax directed definition for constructing Expression tree
37
DAG for Expressions
state value
ptr stack stack
39
Bottom-up evaluation of S-attributed definitions
• Suppose semantic rule A.a = f(X.x, Y.y, Z.z) is
associated with production A → XYZ
40
Type system
• A type is set of values
• The aim of type checking is to ensure that operations are used on the
variable/expressions of the correct types
41
Type system
• Languages can be divided into three categories with respect to the
type:
– “untyped”
• No type checking needs to be done
– Statically typed
• All type checking is done at compile time
• Also, called strongly typed
– Dynamically typed
• Type checking is done at run time
• Mostly functional languages like Lisp, Scheme etc.
42
Type system
• Static typing
– Catches most common programming errors at
compile time
– Avoids runtime overhead
– May be restrictive in some situations
– Rapid prototyping may be difficult
• In fact, most people insist that code be strongly type checked at compile time even if
language is not strongly typed (use of Lint for C code, code compliance checkers)
43
Type expression
• Type of a language construct is • If both the operands of arithmetic operators +, -, x are
integers then the result is of type integer
denoted by a type expression
• The result of unary & operator is a pointer to the object
• A basic types in type referred to by the operand.
expression:
– Primitive data types defined in • If the type of operand is X the type of result is pointer to X
language like int,
float … • Basic types: integer, char, float, boolean
– type error: error during type
checking • Sub range type: 1 … 100
– void: no type value
• Enumerated type: (violet, indigo, red)
• Derived types
– Array
• Constructed type: array, record, pointers, functions
– Product
– Record
44
– Pointer
Type expressions for Array and Product
• If T is a type expression then array(I, T) is a type expression denoting the
type of an array with elements of type T and index set I
– C lang:
• Declaration : int A[10]
• Type expression: array([0..9],int)
– Pascal :
• Declaration : var A:array [-10 .. 10] of integer
• Type Expression: array([-10..10],int)
46
Type expression of Pointer
• If T is a type expression, then pointer( T ) is a type expression denoting type
pointer to an object of type T
Example:
statement
type expression
pointer(int)
int *a
pointer(struct row)
int *struct row
47
Type expression of function
• Function maps domain set to range set. It is denoted by type
expression D → R
– int *f( char a, char b) has type expression char x char pointer(
integer )
48
Specifications of a type checker
Consider a language which consists of a sequence of declarations
followed by a single expression
P→D;E
D → D ; D | T id | T id[ num] | T*id
T → char | int
E
• → id | num |generated
A program E mod E |byid this
[E] |grammar
*E | F(E1)is
int a;
key mod 1999
• Assume following:
– basic types are char, int, type-error
– all arrays start at 1
– array[256] of char has type expression 49
array(1 .. 256, char)
Rules for Symbol Table entry
D id : T addtype(id.entry, T.type)
T char T.type = char
T integer T.type = int
T *T1 T.type = pointer(T1.type)
T array [ num ] of T1 T.type = array(1..num, T1.type)
50
Type checking for expressions
E → num
E.type = int
E → id
E.type = lookup(id.entry)
E→ E1 mod E.type = if E1.type == int and E2.type==int
E2
then int
else type_error
E → E[E2] E.type = if E2.type==int and E1.type==array(s,t)
then t
else type_error
E → *E1 E.type = if E1.type==pointer(t)
then t
else type_error
51
Type checking for statements
• Statements typically do not have values. Special basic type void can be assigned to them.
else type_error
• The compiler has to convert both the operands to the same type
54
Type implicit for expressions
E → num E.type = int
E → num.num E.type = real
E → id E.type = lookup( id.entry )
55
Type resolution
• Try all possible types of each overloaded function (possible but brute force method!)
56
Phases of Compilation
57
INTERMEDIATE CODE GENERATION
INTERMEDIATE CODE
GENERATION
◻ Goal: Generate a Machine Independent Intermediate Form that is Suitable for Optimization and Portability.
◻ Facilitates retargeting: enables attaching a back end for the new machine to an existing front end.
◻ Enables machine-independent code optimization.
58
Motivation
◻ What we have so far...
A Parser tree
■ With all the program information
What We Want
■ Known to be correct
■ Well-typed
◻ A Representation that
■ Nothing missing
■ No ambiguities Is closer to actual machine
◻ What we need... Is easy to manipulate
Something “Executable” Is target neutral (hardware
Closer to independent)
■ An operations schedule
■ Actual machine level
Can be interpreted
59
Recall ASTs and DAGs
a + a +
* *
*
b
b uminus uminus
b uminus
c c c 60
Representations
● Syntax Trees
+
–
Maintains structure of the construct * 4
– Suitable for high-level representations
3 5
● Three-Address Code
– Maximum three addresses
t1 = 3 * 5
in an instruction 3AC
t2 = t1 + 4
– Suitable for both high and
low-level representations mult 3, 5 2AC
add 4
● Two-Address Code
● … push 3 1AC
push 5 or
– e.g. C mult stack 61
add 4
machine
We can represent three-address code by using the following ways.
–
Quadruple
- Triple
- Indirect Triple
62
63
64
65
Representation
◻ Two Different Forms:
assign
Linked Data Structure
a +
Multi-Dimensional Array
* *
b
b uminus uminus
c c
66
Abstract Syntax Trees (ASTs) and DAG
◻ Directly Generate Code From AST or DAG as a Side Effect of Parsing
Process.
◻ Consider Code Below:
p. 68
Types of Three Address Statements
◻ Indexed Assignments of Form:
X := Y[i] (Set X to i-th memory location of Y)
X[i] := Y (Set i-th memory location of X to Y)
Note the limit of 3 Addresses (X, Y, i)
Cannot do: x[i] := y[j]; (4 addresses!)
◻ Address and Pointer Assignments of Form:
X := & Y (X set to the Address of Y)
X := * Y (X set to the contents pointed to by Y)
* X := Y (Contents of X set to Value of Y)
69
Quadruples
◻ In the quadruple representation, there are four fields for each instruction: op, arg1,
arg2, result
Binary ops have the obvious representation
Unary ops don’t use arg2
Operators like param don’t use either arg2 or result
Jumps put the target label into result
◻ The quadruples implement the three- address code in (a) for the expression a = b * - c
+ b*-c
70
Three address code VS Quadruples
◻ The quadruples implement the three- address code in (a) for the expression a = b * - c + b * - c
71
Syntax tree VS Triples
Representations of a = b * - c + b * - c
72
Indirect Triples
◻ These consist of a listing of pointers to triples, rather than a listing of
the triples themselves.
◻ An optimizing compiler can move an instruction by reordering the instruction list, without affecting the triples
themselves.
73
Control Flow
Conditionals
– if, if-else, switch
Loops
– for, while, do-while, repeat-until
We need to worry about
– Boolean expressions
– Jumps (and labels)
74
Code Generation for Boolean Expressions
◻ Implicit representation
For the boolean expressions which are used in flow-of-control statements
(such as if- statements, while-statements etc.) boolean expressions do not
have to explicitly compute a value, they just need to branch to the right
instruction.
Generate code for boolean expressions which branch to the appropriate
instruction based on the result of the boolean expression.
75
Generated Code
◻ Consider: a < b or c <d and e < f 100: if a< b goto 103
101: t1:=0
102: goto 104
103: t1:=1
104: if c< d goto 107
105: t2:=0
106: goto 108
107: t2:=1
108: if e< f goto 111
109: t3:=0
110: goto 112
111: t3:=1
112: t4:=t2 and t3
113: t5:=t1 or t4
76
CONTROL FLOW: IF STATEMENT Generated Code
77
Generated Code
78
Run-Time Environment
79
Run-time Environment
• Compiler must cooperate with OS and other system software to support
implementation of different abstractions (names, scopes, bindings, data
types, operators, procedures, parameters, flow-of-control) on the target
machine.
Code Memory locations for code are determined at compile time. Usually placed in the
low end of memory
Static Size of some program data are known at compile time – can be placed another
statically determined area
Heap
Dynamic space areas – size changes during program execution.
• Heap
• Grows towards higher address
Free Memory • Stores data allocated under program control
• Stack
• Grows towards lower address
Stack • Stores activation records
81
Typical subdivision of run-time memory
Static vs. Dynamic Allocation
• How do we allocate the space for the generated target code and the data
object of our source programs?
• The places of the data objects that can be determined at compile time will
be allocated statically.
• But the places for the some of data objects will be allocated at run-time.
• Stack allocation is a valid allocation for procedures since procedure calls are
nested.
83
Procedure Activations
• Each execution of a procedure is called as activation of that procedure.
• An execution of a procedure P starts at the beginning of the procedure
body;
• When a procedure P is completed, it returns control to the point immediately
after the place where P was called.
• Lifetime of an activation of a procedure P is the sequence of steps between
the first and the last steps in execution of P (including the other procedures
called by P).
• If A and B are procedure activations, then their lifetimes are either
non-overlapping or are nested.
• If a procedure is recursive, a new activation can begin before an earlier
activation of the same procedure has ended.
84
Call Graph
A call graph is a directed multi-graph where:
• the nodes are the procedures of the program and
• the edges represent calls between these procedures.
85
Call Graph: Example
var a: array [0 .. 10] of
integer; readarray
procedure Main readarray
var i: integer
begin … a[i] … end
function partition(y,z: integer): integer
var i,j,x,v: integer
begin … end quicksort partition
86
Activation Tree/ Call Tree
• We can use a tree (called activation tree) to show the way control
enters and leaves activations.
• In an activation tree:
– Each node represents an activation of a procedure.
– The root represents the activation of the main program.
– The node A is a parent of the node B iff the control flows from A to B.
– The node A is left to to the node B iff the lifetime of A occurs before the lifetime
of B.
87
Activation Tree (cont.)
program main;
procedure
s;
begin ... end;
procedure p;
procedure
q;
begin ...
end; begin q;
s; end;
begin p; s; end;
88
Sketch of a quicksort program
89
Run-time Control Flow for Quicksort
r() q (1,9)
90
Implementing Run-time control flow
91
Activation Records
Activation Records
92
Activation Records
The returned value of the called procedure is returned
return value in this field to the calling procedure. In practice, we may use a machine register for
the return value.
actual The field for actual parameters is used by the calling procedure to supply parameters to
parameters the called procedure.
optional control The optional control link points to the activation record of the caller.
link
The optional access link is used to refer to nonlocal data
optional access held in other activation records.
link The field for saved machine status holds information about the state of the machine
saved machine before the procedure is called.
status
The field of local data holds data that local to an execution
local data of a procedure.
Temporary variables is stored in the field of temporaries.
temporaries
93
Activation Records: Example
stack
main
program main;
procedure p;
var a:real;
procedure
q;
var
p
b:integer;
begin ... a:
end;
begin q; end;
main q
procedure s;
var p s
c:integer; b:
begin ... end;
begin p; s; q
end; 94
Activation Records for Recursive Procedures
program main;
main
procedure
p;
function
q(a:integer):integer; p
begin
if (a=1) then
q:=1; else
q(3)
q:=a+q(a-1); a: 3
end; q(2)
begin q(3); end;
begin p; end; a:2
q(1)
a:1
95
Stack Allocation for quicksort 1
96
Stack Allocation for quicksort 2
97
Stack Allocation for quicksort 3
98
Layout of the stack frame
99
Creation of An Activation Record
• Who deallocates?
– Callee de-allocates the part allocated by Callee.
– Caller de-allocates the part allocated by Caller.
100
Creation of An Activation Record
return value
Caller’s Activation
actual parameters
Record
optional control link
optional access link
saved machine status
local data
temporaries
Caller’s Responsibility
return value
actual parameters
Callee’s Activation
Record
optional control link
optional access link
saved machine status
local data Callee’s
temporaries Responsibility
101
Designing Calling Sequences
• Values communicated between caller and callee are generally placed at the
beginning of callee’s activation record.
• Fixed-length items: are generally placed at the middle.
• Items whose size may not be known early enough: are placed at the end of
activation record.
• We must locate the top-of-stack pointer judiciously: a common approach is to
have it point to the end of fixed length fields.
102
Division of tasks between caller and callee
103
calling sequence
• The caller evaluates the actual parameters.
• The caller stores a return address and the old value of top-sp into the
callee's activation record.
• The callee saves the register values and other status information.
• The callee initializes its local data and begins execution.
104
corresponding return sequence
105
Access to dynamically allocated arrays
106
Memory Manager
Typical Memory Hierarchy Configurations
• Two basic functions:
• Allocation
• Deallocation
• Properties of
memory managers:
• Space efficiency
• Program efficiency
• Low overhead
107