0% found this document useful (0 votes)
17 views

CH-6 Intermediate Code Generator

Uploaded by

vakame5133
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views

CH-6 Intermediate Code Generator

Uploaded by

vakame5133
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Injibara University

Department of Computer Science


Compiler Design (CoSc3112)

Chapter 6: Intermediate Languages


Minychil F.
Contents
Chapter 6: Intermediate Languages (4hr)
6.1. Three Address Code Rules

6.2. Quadruples

6.3. Declarations

6.4. Declarations in Procedures

6.5. Flow Control Statements

6.6. Back Patching

6.7. Procedure Calls


Intermediate Language
▪ Intermediate languages (IL) are abstract programming languages used by compilers
during the translation process of a computer program.

▪ Intermediate languages, also known as intermediate representations (IR), are


programming languages or representations that are used as a bridge between the source
code of a program and its final executable form.

▪ They serve as an intermediary step between the high-level source code and the
machine code produced by the compiler
Cont…
▪ Examples of intermediate languages include:
▪ Java bytecode: Generated from Java source code and executed by the Java Virtual
Machine (JVM).
▪ Intermediate Representation (IR) in compilers: LLVM IR and GCC IR are
examples of intermediate representations used in compiler toolchains.
▪ Common Intermediate Language (CIL): Used in the .NET framework for
languages like C# and VB.NET.
▪ Python bytecode: Generated from Python source code and executed by the Python
interpreter.
Intermediate Code Generator
▪ Intermediate Code Generator is a phase in the compilation process where the source
code of a program is translated into an intermediate representation (IR) or code. This
intermediate code serves as a bridge between the high-level source code and the target
machine code or lower-level representation.

▪ The intermediate code generator receives input from its predecessor phase, semantic
analyzer, in the form of an annotated syntax tree.

▪ The syntax tree can then be converted into a linear representation, such as postfix
notation.
Cont…
▪ The benefits of using a machine-independent intermediate form in the context of
compiler design include:
1. Retargeting Facilitation: Machine-independent intermediate code allows for easier
retargeting, enabling the creation of compilers for different machines by attaching a back
end for the new machine to an existing front end
2. Code Optimization: A machine-independent code optimizer can be applied to the
intermediate representation, enhancing the efficiency and performance of the generated
target code
3. Portability Enhancement: By utilizing machine-independent intermediate code,
portability is improved.
Cont…
• In compiler design, intermediate code serves as a bridge between the
source code and the target machine code, enabling easier optimization and
translation.

• Three commonly used intermediate code representations are


1. Postfix Notation,

2. Three Address Code, and

3. Syntax Tree.
Cont…
❖ Postfix Notation:

➢It is also known as reverse Polish notation

➢Postfix notation is a linear representation of a syntax tree where the operator follows
the operands.

➢The ordinary (infix) way of writing the sum of a and b is with operator in the middle:
a+b

➢The postfix (or postfix polish)notation for the same expression places the operator at
the right end, as ab+.
Cont…
❖ Postfix Notation:

➢In postfix notation, the operator appears after the operands, simplifying the
expression evaluation process.

➢For example, the expression "a * d - (b + c)" can be translated into postfix form as
"ad * bc + -".

➢Postfix notation is beneficial for evaluating expressions without the need for
parentheses

➢It is commonly used in compiler design for its simplicity and efficiency
Cont…
Cont…
• Syntax tree:
➢The syntax tree represents the hierarchical structure of a source program, with
each node corresponding to an operator or operand.

➢The parse tree itself is a useful intermediate-language representation for a source


program, especially in optimizing compilers where the intermediate code needs to
extensively restructure.

➢A dag (Directed Acyclic Graph) gives the same information but in a more compact
way because common subexpressions are identified.
Cont…
• A syntax tree and dag for the assignment statement a : =b * - c + b * - c are as follows:
6.1. Three Address Code Rules
• Three address code:
➢Three-address code is a type of intermediate code used by optimizing
compilers, where a given expression is broken down into several separate
instructions that can be easily translated into assembly language

➢It is designed to represent operations and expressions in a program using a


minimal set of instructions with at most three addresses or operands.

➢A statement involving no more than three references (two for operands and
one for result) is known as three address statement
Cont…
• Three address code:
➢Three Address Code is a common representation of intermediate code where each
instruction contains three operands.

➢The typical form of a three address statement is expressed as "x = y op z," where
x, y, and z represent memory addresses.

➢Temporary variables are often used to store intermediate results.


Cont…
▪ For instance, the expression "a + b * c + d" can be represented in three
address code as:
T1 = b * c
T2 = a + T1
T3 = T2 + d

▪ Here, T1, T2, and T3 are temporary variables used to store intermediate
results
Cont…
▪ Advantages of Three Address Code:
➢Simplicity: Three Address Code is a simple and easy-to-understand representation
of code that can be easily parsed and manipulated2.

➢Code Optimization: TAC enables the compiler to perform optimization techniques


such as constant folding, dead code elimination, and common subexpression
elimination, which help in reducing the size of the generated code and improving its
efficiency.

➢Portability: TAC is independent of any particular machine architecture, making it


portable across different platforms
Cont…
▪ Disadvantages of Three Address Code:
➢Increased Complexity: Adding loop detection to the compiler can increase the
complexity of the compiler code, making it harder to maintain and debug.

➢Performance Overhead: Loop detection can require additional computation time,


which can slow down the compilation process and increase the time required to
generate the final code.

➢Limited Benefit: Depending on the specific application, loop detection may not
provide a significant benefit in terms of code optimization or performance
Cont…
▪ Three-address code is a linearized representation of a syntax tree or a
dag in which clear names correspond to the interior nodes of the graph.

▪ The syntax tree and dag are represented by the three-address code
sequences.

▪ A syntax tree and dag for the assignment statement

a : =b * - c + b * - c
Cont…
• A syntax tree and dag for the assignment statement a : =b * - c + b * - c are as follows:
Cont…
Types of Three-Address Statements
▪ Three-address statements are a form of intermediate representation (IR) used in
compilers to aid in the implementation of code-improving transformations
▪ There are several types of three-address statements, including:
1. Assignment statements
➢ x : = y op z, where op is a binary arithmetic or logical operation.

2. Assignment instructions
➢ x : = op y, where op is a unary operation. Essential unary operations include unary minus, logical
negation, shift operators, and conversion operators that, for example, convert a fixed-point number to a
floating-point number.

3. Copy statements
➢ x : = y where the value of y is assigned to x.
Types of Three-Address Statements
4. Unconditional jump
goto L
Creates label L and generates three-address code ‘goto L’

5. Indexed assignments
x : = y[i] and x[i] : = y.

7. Address and pointer assignments


x : = &y , x : = *y, and *x : = y
Implementation of Three-Address Statements
▪ There are three commonly used representations for Three Address Code:
1. Quadruples
2. Triples

3. Indirect Triples
6.2. Quadruples
▪ Quadruples
➢Quadruples are a form of 3-address code representation that consists of four fields
namely: operator, argument 1, argument 2, and result.
➢A quadruple is a record structure with four fields, which are, op, arg1, arg2 and result.
➢The op field contains an internal code for the operator.
➢The three-address statement x : = y op z is represented by placing y in arg1, z in arg2 and
x in result.
➢The contents of fields arg1, arg2 and result are normally pointers to the symbol-table entries
for the names represented by these fields. If so, temporary names must be entered into the
symbol table as they are created.
Cont…
▪ Examples: ▪ The Quadruples representation be
➢A: = -B * (C+D)

➢The 3-address code will be


T1:= -B

T2:= C+D

T3:= T1 *T2

A: = T3
Cont…
▪ Examples: ▪ The Quadruples representation be
➢A: = -B * (C+D)

➢The 3-address code will be


T1:= -B

T2:= C+D

T3:= T1 *T2

A: = T3
Class work
• Example-2,
1. Quadruple representation for the statement a : =b * - c + b * - c
2. Quadruple representation for the statement "a := b + c * d"

▪ What will be the three address code


for a : =b * - c + b * - c?
Cont…
• Triples
➢Triples are another form of 3-address code representation that does not make use
of extra temporary variables to represent a single operation.

➢Instead, when a single operation is represented, a pointer to that triple is used.

➢This representation contains three fields, namely operator, argument-1, and


argument-2.

➢In this representation, temporary variables are not used, and instead, a number in
parentheses is used to represent a pointer to a particular record of the symbol table
Cont…
• Example of Triple: ▪ The Quadruples representation be
A: = -B * (C+D)
3-address code:
✓ Temporal variable not used

✓ Instead of Temporal variable ,


number in parentheses is used
Cont…
• For example, • For example,
1. Triple representation for the 1. Triple representation for the
statement a = b + c * d statement a : =b * - c + b * - c

Triple Location Operator arg1 arg2

0 * C d

-1 + B 0

-2 = A -1
Cont…
▪ Indirect Triples
➢Indirect Triples are a variation of triples that make use of a pointer to the listing of
all references to computations.

➢This representation uses an extra array to list the pointers to the triples in the
desired order than listing the triples themselves. This implementation is known as
indirect triple representation.

➢This representation requires less space than quadruples


Cont…
• Example: Indirect Triple representation of A: = -B * (C+D)
Cont…
• Indirect Triple representation for the statement a : =b * - c + b * - c
Exercise
• For example, the Indirect Triple representation for the statement "a = b + c * d"
would be:
6.3. Declarations
▪ A variable or procedure has to be declared before it can be used.

▪ Declaration involves allocation of space in memory and entry of type and name in the
symbol table.

▪ A program may be coded and designed keeping the target machine structure in mind, but it
may not always be possible to accurately convert a source code to its target language.

▪ Taking the whole program as a collection of procedures and sub-procedures, it becomes


possible to declare all the names local to the procedure.

▪ Memory allocation is done in a consecutive manner and names are allocated to memory in
the sequence they are declared in the program.
Cont…
▪ We use offset variable and set it to zero {offset = 0} that denote the base
address.

▪ The source programming language and the target machine architecture


may vary in the way names are stored, so relative addressing is used.

▪ While the first name is allocated memory starting from the memory
location 0 {offset=0}, the next name declared later, should be allocated
memory next to the first one.
Cont…
▪ Example: We take the example of C programming language where an integer variable is
assigned 2 bytes of memory and a float variable is assigned 4 bytes of memory.
int a;
float b;

Allocation process:
{offset = 0}
int a;
id.type = int
id.width = 2
offset = offset + id.width
{offset = 2}
float b;
id.type = float
id.width = 4
offset = offset + id.width
{offset = 6}
Cont…
▪ To enter the details in a symbol table, a procedure enter can be used with the following
structure: enter(name, type, offset)

▪ This procedure should perform the following tasks:


1. Create an entry in the symbol table for the given variable name.
2. Set the type of the variable to the provided type.
3. Set the relative address of the variable to the given offset in the data area.

▪ By using this enter procedure, the symbol table will be populated with the necessary
information about each variable, including its name, data type, and relative address in the
data area. This information is crucial for the compiler to generate correct code and perform
various optimizations during the compilation process.
Here's an example of how the enter procedure can be implemented:
Here's an example of how the enter procedure can be implemented:

Cont…
▪ Here's an example of how the enter procedure can be implemented:

▪ In this example, symbol_table is a list of dictionaries representing the symbol table.

▪ Each dictionary contains the name, type, and offset of a variable.

▪ The enter procedure creates a new dictionary with the given name, type, and offset and
appends it to the symbol_table list.
6.4. Declarations in Procedures
▪ The syntax of languages such as C, Pascal, and Fortran allows all the declarations in a
single procedure to be processed as a group. This means that declarations for variables
can be grouped together and processed at once.

▪ During this process, a global variable, say offset, can keep track of the next available
relative address.

▪ In the case of declarations in a procedure, as the sequence of declarations is examined,


storage for names local to the procedure can be laid out. For each local name, a
symbol-table entry is created with information like the type and the relative address of
the storage for the name. The relative address consists of an offset from the base of the
static data area or the field for local data in an activation record.
Cont…
• The procedure enter(name, type, offset) can be used to create a symbol-table entry for a name,
give its type and relative address offset in its data area. The attribute type represents a type
expression constructed from the basic types integer and real by applying the type constructors
pointer and array. If type expressions are represented by graphs, then attribute type might be a
pointer to the node representing a type expression.

• By processing all the declarations in a single procedure as a group, the compiler can efficiently
allocate memory and keep track of the relative addresses of variables in the procedure. This
can help to avoid errors and improve the performance of the resulting code.
Cont…
6.5. Flow Control Statements
• We now consider the translation of boolean expressions into three-address code in
the context of if-then, if-then-else, and while-do statements such as those generated
by the following grammar:

S → if E then S1

| if E then S1 else S2

| while E do S1
Cont…
• In each of these productions, E is the Boolean expression to be translated. In the translation,
we assume that a three-address statement can be symbolically labeled, and that the function
newlabel returns a new symbolic label each time it is called.

• E.true is the label to which control flows if E is true, and E.false is the label to which
control flows if E is false.

• The semantic rules for translating a flow-of-control statement S allow control to flow from
the translation S.code to the three-address instruction immediately following S.code.

• S.next is a label that is attached to the first three-address instruction to be executed after the
code for S.
Cont…
Cont…
6.6. Back Patching
• The easiest way to implement the syntax-directed definitions for boolean expressions is to
use two passes.

• First, construct a syntax tree for the input, and then walk the tree in depth-first order,
computing the translations. The main problem with generating code for boolean expressions
and flow-of-control statements in a single pass is that during one single pass we may not
know the labels that control must go to at the time the jump statements are generated. Hence,
a series of branching statements with the targets of the jumps left unspecified is generated.
Each statement will be put on a list of goto statements whose labels will be filled in when the
proper label can be determined. We call this subsequent filling in of labels backpatching.
Cont…
• The syntax-directed definitions for boolean expressions can be implemented using two passes
to ensure proper labeling and generation of code for flow-of-control statements.

• During the first pass, a syntax tree is constructed for the input. This tree represents the
structure of the boolean expression and allows for easy traversal and manipulation during the
second pass.

• In the second pass, the tree is traversed in depth-first order, and the translations for the
boolean expressions are computed. During this traversal, the main problem with generating
code for boolean expressions and flow-of-control statements in a single pass is that the
targets of the jumps may not be known at the time the jump statements are generated.
Cont…
• By using two passes and backpatching, the syntax-directed definitions for boolean
expressions can be implemented efficiently and accurately, ensuring that the proper
labels are assigned to each jump statement and that the flow-of-control is correct.

• To manipulate lists of labels, we use three functions :


1. makelist(i) creates a new list containing only i, an index into the array of quadruples; makelist
returns a pointer to the list it has made.

2. merge(p1,p2) concatenates the lists pointed to by p1 and p2, and returns a pointer to the
concatenated list.

3. backpatch(p,i) inserts i as the target label for each of the statements on the list pointed to by p.
6.7. Procedure Calls
• The procedure is such an important and frequently use programming construct that it is
imperative for a compiler to generate good code for procedure calls and returns.

• The run-time routines that handle procedure argument passing, calls and returns are part
of the run-time support package.

• Let us consider a grammar for a simple procedure call statement

1. S →call id ( Elist )

2. Elist → Elist , E

3. Elist → E
Cont…
Calling Sequences:
➢The translation for a call includes a calling sequence, a sequence of actions taken on entry to and
exit from each procedure. The falling are the actions that take place in a calling sequence :
➢ When a procedure call occurs, space must be allocated for the activation record of the called procedure.
➢ The arguments of the called procedure must be evaluated and made available to the called procedure in a
known place.
➢ Environment pointers must be established to enable the called procedure to access data in enclosing blocks.
➢ The state of the calling procedure must be saved so it can resume execution after the call.
➢ Also saved in a known place is the return address, the location to which the called routine must transfer after it
is finished.
➢ Finally a jump to the beginning of the code for the called procedure must be generated.
Cont…
• For example, consider the following syntax-directed translation
1. S → call id ( Elist )
{ for each item p on queue do emit (‘ param’ p );
emit (‘call’ id.place) }
2. Elist→Elist , E
{ append E.place to the end of queue }
3. Elist →E
{ initialize queue to contain only E.place }
• Here, the code for S is the code for Elist, which evaluates the arguments, followed by a
param p statement for each argument, followed by a call statement.
• queue is emptied and then gets a single pointer to the symbol table location for the
name that denotes the value of E.

You might also like