Unit 3 Compiler Design
Unit 3 Compiler Design
Syntax Directed Definitions, Evaluation Orders for Syntax Directed Definitions, Intermediate
Languages: Syntax Tree, Three Address Code, Types and Declarations, Translation of
Expressions, Type Checking.
Syntax-Directed Translations
• Translation of languages guided by CFGs
• Information associated with programming language constructs
– Attributes attached to grammar symbols
– Values of attributes computed by “semantic rules” associated with grammar
productions
• Two notations for associating semantic rules
– Syntax-directed definitions
– Translation schemes
Semantic Rules
• Semantic rules perform various activities:
– Generation of code
– Save information in a symbol table
– Issue error messages
– Other activities
• Output of semantic rules is the translation of the token stream
Conceptual View
Attributes
• Each grammar symbol (node in parse tree) has attributes attached to it ex: a string, a
number, a type, a memory location etc.
• Values of a Synthesized attributes at a node is computed from the values of attributes at
the children of that node in the parse tree.
• Values of a Inherited attributes at a node is computed from the values of attributes at the
siblings and parent of that node.
A dependency graph represents dependencies between attributes
A parse tree showing the values of attributes at each node is an annotated parse tree
• Each semantic rule for production A -> α has the form
b := f(c1, c2, …, ck)
– f is a function
– b may be a synthesized attribute of A or
– b may be an inherited attribute of one of the grammar symbol on the right side of
the production
– c1, c2, …, ck are attributes belonging to grammar symbols of production
• An attribute grammar is one in which the functions in semantic rule cannot have side
effects
NOTE: a semantic rule may have side effects ex: printing a value or updating a global
variable.
S-attributed Definitions
LEn print(E.val)
NOTE
Inherited Attributes
• Inherited Attributes:
– Value at a node in a parse tree depends
on attributes of parent and/or siblings
– Convenient for expressing dependencies of programming language constructs on
context
• It is always possible to avoid inherited attributes, but they are often convenient
L .in := L.in
L L , id 1
1
addtype(id.entry, L.in)
L id addtype(id.entry, L.in)
Dependency Graphs
• Dependency graph:
– Depicts interdependencies among synthesized and inherited attributes
– Includes dummy nodes for procedure calls
• Numbered with a topological sort
4
Example(inherited attribute)
• We put the values of the synthesized attributes of the grammar symbols into a parallel
stack.
– When an entry of the parser stack holds a grammar symbol X (terminal or non-
terminal), the corresponding entry in the parallel stack will hold the synthesized
attribute(s) of the symbol X.
• We evaluate the values of the attributes during reductions.
(1) L E \n Print(val[top])
(3) E T
(5) T F
(7) F digit
*5+4\n 3 3
*5+4\n F 3 (7)
*5+4\n T 3 (5)
5+4\n T* 3_
+4\n E 15 (3)
4\n E+ 15_
\n E+4 15_4
\n E 19 (2)
E\n 19_
L 19 (1)
A top-down parser can evaluate attributes as it parses if the attribute values can be computed in a
top-down fashion. Such attribute grammars are termed L-Attributed. First, we introduce a
new type of symbol called an action symbol. Action symbols appear in the grammar in any place
a terminal or nonterminal may appear. They may also have their own attributes. They may,
however, be pushed onto their own stack, called a semantic stack or attribute stack.
We illustrate action symbols using the notation "<>" which indicates that the symbol within
the brackets is to be pushed onto the semantic stack when it appears at the top of the parse stack.
By inserting this action in appropriate places, we will create a translator which converts from
infix expressions to postfix expressions.
We parse and translate a + b * c. The top is on the left for both stacks.
abc*+
the input string translated to postfix. In Example 6, the action symbol did not have any attached
attributes.
The BNF in Example 6 is in LL(1) form. This is necessary for the top-down parse.
(1) says that each inherited attribute of a symbol on the right-hand side depends only on
inherited attributes of the right-hand side and arbitrary attributes of the symbols to the left of the
given right-hand side symbol.
(2) says that each synthesized attributes of the left-hand-side symbol depends only on
inherited attributes of that symbol and arbitrary attributes of right-hand-side symbols.
(2) says that the synthesized attributes of any action symbol depend only on the inherited
attributes of the action symbol.
The front end translates a source program into an intermediate representation from
which the back end generates target code.
1. Retargeting is facilitated. That is, a compiler for a different machine can be created
by attaching a back end for the new machine to an existing front end.
2. A machine-independent code optimizer can be applied to the intermediate representation.
INTERMEDIATE LANGUAGES
Syntax tree
9
Postfix notation
Three address code
The semantic rules for generating three-address code from common programming language
constructs are similar to those for constructing syntax trees or for generating postfix notation.
Graphical Representations:
SYNTAX TREE:
A syntax tree depicts the natural hierarchical structure of a source program. A dag
(Directed Acyclic Graph) gives the same information but in a more compact way because
common subexpressions are identified. A syntax tree and dag for the assignment statement a : =
b * - c + b * - c are as follows:
10
assign assign
a + a +
* * *
c c c
Postfix notation:
Syntax trees for assignment statements are produced by the syntax-directed definition.
Non-terminal S generates an assignment statement. The two binary operators + and * are
examples of the full operator set in a typical language. Operator associativities and precedences
are the usual ones, even though they have not been put into the grammar. This definition
constructs the tree from the input a : = b * - c + b* - c.
12
The token id has an attribute place that points to the symbol-table entry for the identifier.
A symbol-table entry can be found from an attribute id.name, representing the lexeme
associated with that occurrence of id. If the lexical analyzer holds all lexemes in a single array of
characters, then attribute name might be the index of the first character of the lexeme.
Two representations of the syntax tree are as follows. In (a) each node is represented as a
record with a field for its operator and additional fields for pointers to its children. In (b), nodes
are allocated from an array of records and the index or position of the node serves as the pointer
to the node. All the nodes in the syntax tree can be visited by following pointers, starting from
the root at position 10.
a 0 id b
assign
1 id c
id a
2 uminus 2 1
3 * 0 2
+
4 id b
5 id c
* *
6 uminus 5
id b id b
7 * 4 6
uminus uminus 8 + 3 7
9 id a
id c id c
10 assign 9 8
(a) (b)
THREE-ADDRESS CODE:
13
x : = y op z
where x, y and z are names, constants, or compiler-generated temporaries; op stands for any
operator, such as a fixed- or floating-point arithmetic operator, or a logical operator on boolean-
valued data. Thus a source language expression like x+ y*z might be translated into asequence
t1 : = y * z
t 2 : = x + t1
where t1 and t2 are compiler-generated temporary names.
14
Three-address code corresponding to the syntax tree and dag given above
t1 : = - c t1 : = -c
t 2 : = b * t1 t 2 : = b * t1
t3 : = - c t 5 : = t2 + t 2
t 4 : = b * t3 a : = t5
t 5 : = t2 + t 4
a : = t5
(a) Code for the syntax tree (b) Code for the dag
The reason for the term “three-address code” is that each statement usually contains three
addresses, two for the operands and one for the result.
15
3. The unconditional jump goto L. The three-address statement with label L is the next to be
executed.
4. Conditional jumps such as if x relop y goto L. This instruction applies a relational operator (
<, =, >=, etc. ) to x and y, and executes the statement with label L next if x stands in relation
16
relop to y. If not, the three-address statement following if x relop y goto L is executed next,
as in the usual sequence.
6. param x and call p, n for procedure calls and return y, where y representing a returned value
is optional. For example,
param x1
param x2
...
param xn
call p,n
generated as part of a call of the procedure p(x1, x2, …. ,xn ).
E E1 + E2 E.place := newtemp;
E.code := E1.code || E2.code || gen(E.place ‘:=’ E1.place ‘+’ E2.place)
E E1 * E2 E.place := newtemp;
17
E - E1 E.place := newtemp;
E.code := E1.code || gen(E.place ‘:=’ ‘uminus’ E1.place)
E id E.place : = id.place;
E.code : = ‘ ‘
18
t1=c*d
t2=e/f
t3=b+t1
t4=t3-t2
a=t4
19
Quadruples
Triples
Indirect triples
20
Quadruples:
A quadruple is a record structure with four fields, which are, op, arg1, arg2 and result.
The op field contains an internal code for the operator. The three-address statement x : =
y op z is represented by placing y in arg1, z in arg2 and x in result.
The contents of fields arg1, arg2 and result are normally pointers to the symbol-table
entries for the names represented by these fields. If so, temporary names must be entered
into the symbol table as they are created.
Triples:
To avoid entering temporary names into the symbol table, we might refer to a temporary
value by the position of the statement that computes it.
If we do so, three-address statements can be represented by records with only three
fields: op, arg1 and arg2.
The fields arg1 and arg2, for the arguments of op, are either pointers to the symbol table
or pointers into the triple structure ( for temporary values ).
Since three fields are used, this intermediate code format is known as triples.
(a) Quadruples
op arg1 arg2 result
(0) uminus C t1
(1) * B t1 t2
(2) uminus C t3
(3) * B t3 t4
(4) + t2 t4 t5
(5) := t3 a
21
(3) * b (2)
(1) * b (0)
22
A ternary operation like x[i] : = y requires two entries in the triple structure as shown as below
while x : = y[i] is naturally represented as two operations.
Indirect Triples:
23
DECLARATIONS
24
Declarations in a Procedure:
The syntax of languages such as C, Pascal and Fortran, allows all the declarations in a
single procedure to be processed as a group. In this case, a global variable, say offset, can
keep track of the next available relative address.
In the translation scheme shown below:
PD { offset : = 0 }
DD; D
26
D D ; D | id : T | proc id ; D ; S
One possible implementation of a symbol table is a linked list of entries for names.
A new symbol table is created when a procedure declaration D proc id D1;S is seen,
and entries for the declarations in D1 are created in the new table. The new table points back to
the symbol table of the enclosing procedure; the name represented by id itself is local to the
enclosing procedure. The only change from the treatment of variable declarations is that the
procedure enter is told which symbol table to make an entry in.
For example, consider the symbol tables for procedures readarray, exchange, and
quicksort pointing back to that for the containing procedure sort, consisting of the entire
program. Since partition is declared within quicksort, its table points to that of quicksort.
sort
nil header
a
x
readarray to readarray
exchange to exchange
quicksort
27
partition
partition
header
i
j
28
1. mktable(previous) creates a new symbol table and returns a pointer to the new table. The
argument previous points to a previously created symbol table, presumably that for the
enclosing procedure.
2. enter(table, name, type, offset) creates a new entry for name name in the symbol table pointed
to by table. Again, enter places type type and relative address offset in fields within the entry.
3. addwidth(table, width) records the cumulative width of all the entries in table in the header
associated with this symbol table.
4. enterproc(table, name, newtable) creates a new entry for procedure name in the symbol table
pointed to by table. The argument newtable points to the symbol table for this procedure
name.
D D1 ; D2
The stack tblptr is used to contain pointers to the tables for sort, quicksort, and
partition when the declarations in partition are considered.
29
The top element of stack offset is the next available relative address for a local of
the current procedure.
All semantic actions in the subtrees for B and C in
A BC {actionA}
are done before actionA at the end of the production occurs. Hence, the action associated
with the marker M is the first to be done.
30
The action for nonterminal M initializes stack tblptr with a symbol table for the
outermost scope, created by operation mktable(nil). The action also pushes relative
address 0 onto stack offset.
Similarly, the nonterminal N uses the operation mktable(top(tblptr)) to create a new
symbol table. The argument top(tblptr) gives the enclosing scope for the new table.
For each variable declaration id: T, an entry is created for id in the current symbol table.
The top of stack offset is incremented by T.width.
When the action on the right side of D proc id; ND1; S occurs, the width of all
declarations generated by D1 is on the top of stack offset; it is recorded using addwidth.
Stacks tblptr and offset are then popped.
At this point, the name of the enclosed procedure is entered into the symbol table of
its enclosing procedure.
BOOLEAN EXPRESSIONS
Boolean expressions have two primary purposes. They are used to compute logical
values, but more often they are used as conditional expressions in statements that alter the flow
of control, such as if-then-else, or while-do statements.
Boolean expressions are composed of the boolean operators ( and, or, and not ) applied
to elements that are boolean variables or relational expressions. Relational expressions are of the
form E1 relop E2, where E1 and E2 are arithmetic expressions.
Here we consider boolean expressions generated by the following grammar :
There are two principal methods of representing the value of a boolean expression. They are :
To encode true and false numerically and to evaluate a boolean expression analogously
to an arithmetic expression. Often, 1 is used to denote true and 0 to denote false.
To implement boolean expressions by flow of control, that is, representing the value of a
boolean expression by a position reached in a program. This method is particularly
31
Numerical Representation
Here, 1 denotes true and 0 denotes false. Expressions will be evaluated completely from
left to right, in a manner similar to arithmetic expressions.
For example :
which can be translated into the three-address code sequence (again, we arbitrarily
start statement numbers at 100) :
E E1 or E2 { E.place : = newtemp;
emit( E.place ‘: =’ E1.place ‘or’E2.place )}
E E1 and E2 { E.place : = newtemp;
emit( E.place ‘: =’ E1.place ‘and’E2.place )}
E not E1 { E.place : = newtemp;
emit( E.place ‘: =’ ‘not’ E 1.place )}
32
E ( E1 ) { E.place : = E1.place }
E id1 relop id2 { E.place : = newtemp;
emit( ‘if’ id1.place relop.op id2.place ‘goto’ nextstat + 3);
emit( E.place ‘: =’ ‘0’ );
emit(‘goto’ nextstat +2);
emit( E.place ‘: =’ ‘1’) }
E true { E.place : = newtemp;
emit( E.place ‘: =’ ‘1’) }
E false { E.place : = newtemp;
emit( E.place ‘: =’ ‘0’) }
33
Short-Circuit Code:
We can also translate a boolean expression into three-address code without generating code
for any of the boolean operators and without having the code necessarily evaluate the entire
expression. This style of evaluation is sometimes called “short-circuit” or “jumping” code. It is
possible to evaluate boolean expressions without generating code for the boolean operators and, or,
and not if we represent the value of an expression by a position in the code sequence.
With the help of control flow mechanism, the Boolean operator and conditional statements in
which Boolean expression are part of it are translated into three address code as follows.
E E1 or E2 E1.true : = E.true;
E1.false : = newlabel;
E2.true : = E.true;
E2.false : = E.false;
34
35
E1.true : = E.true;
E ( E1 ) E1.false : = E.false;
E.code : = E1.code
Flow-of-Control Statements
We now consider the translation of boolean expressions into three-address code in the
context of if-then, if-then-else, and while-do statements such as those generated by the following
grammar:
S if E then S1
| if E then S1 else S2
| while E do S1
In each of these productions, E is the Boolean expression to be translated. In the translation, we
assume that a three-address statement can be symbolically labeled, and that the function
newlabel returns a new symbolic label each time it is called.
E.true is the label to which control flows if E is true, and E.false is the label to which
control flows if E is false.
The semantic rules for translating a flow-of-control statement S allow control to flow
from the translation S.code to the three-address instruction immediately following
S.code.
S.next is a label that is attached to the first three-address instruction to be executed after
the code for S.
36
. .
37
to E.false
E.true: S1.code
goto S.begin
E.false: ...
(c) while-do
TYPE CHECKING
A compiler must check that the source program follows both syntactic and semantic
conventions of the source language.
This checking, called static checking, detects and reports programming errors.
2. Flow-of-control checks – Statements that cause flow of control to leave a construct must
have some place to which to transfer the flow of control. Example: An error occurs when an
enclosing statement, such as break, does not exist in switch statement.
A type checker verifies that the type of a construct matches that expected by its context.
For example : arithmetic operator mod in Pascal requires integer operands, so a type
checker verifies that the operands of mod have type integer.
39
Type information gathered by a type checker may be needed when code is generated.
TYPE SYSTEMS
The design of a type checker for a language is based on information about the syntactic
constructs in the language, the notion of types, and the rules for assigning types to
language constructs.
For example : “ if both operands of the arithmetic operators of +,- and * are of type integer,
then the result is of type integer ”
Type Expressions
1. Basic types such as boolean, char, integer, real are type expressions.
A special basic type, type_error , will signal an error during type checking; void denoting
“the absence of a value” allows statements to be checked.
2. Since type expressions may be named, a type name is a type expression.
For example:
type row = record
address: integer;
lexeme: array[1..15] of char
end;
var table: array[1...101] of row;
declares the type name row representing the type expression record((address X integer) X
(lexeme X array(1..15,char))) and the variable table to be an array of records of this type.
Pointers : If T is a type expression, then pointer(T) is a type expression denoting the type
“pointer to an object of type T”.
For example, var p: ↑ row declares variable p to have type pointer(row).
x pointer
Type systems
A type system is a collection of rules for assigning type expressions to the various parts of
a program.
A type checker implements a type system. It is specified in a syntax-directed manner.
Different type systems may be used by different compilers or processors of the same
language.
Checkingdone by a compiler is said to be static, while checking done when the target
41
Error Recovery
Since type checking has the potential for catching errors in program, it is desirable for
type checker to recover from errors, so it can check the rest of the input.
Error handling has to be designed into the type system right from the start; the type
checking rules must be prepared to cope with errors.
Here, we specify a type checker for a simple language in which the type of each
identifier must be declared before the identifier is used. The type checker is a translation scheme
that synthesizes the type of each expression from the types of its sub expressions. The type
checker can handle arrays, pointers, statements and functions.
A Simple Language
P→D;E
D → D ; D | id : T
T → char | integer | array [ num ] of T | ↑ T
E → literal | num | id | E mod E | E [ E ] | E ↑
Translation scheme:
42
P→D;E
D→D;D
D → id : T { addtype (id.entry , T.type)}
T → char { T.type : = char }
T → integer { T.type : = integer }
T → ↑ T1 { T.type : = pointer(T 1.type) }
T → array [ num ] of T1 { T.type : = array ( 1… num.val , T1.type) }
In the following rules, the attribute type forE gives the type expression assigned to the
expression generated by E.
43
The postfix operator ↑ yields the object pointed to by its operand. The type of E ↑ is the type
t of the object pointed to by the pointer E.
Statements do not have values; hence the basic type void can be assigned to them. If an error
is detected within a statement, then type_error is assigned.
1. Assignment statement:
S → id : = E { S.type : = if id.type = E.type then void else
type_error }
2. Conditional statement:
S → if E then S1 { S.type : = if E.type = boolean then S1.type
else type_error }
3. While statement:
S → while E do S1 { S.type : = if E.type = boolean then S1.type
else type_error }
4. Sequence of statements:
S → S1 ; S2 { S.type : = if S1.type = void and S2.type = void
then void
else type_error }
44
45