0% found this document useful (0 votes)

57 views108 pages

@vtucode - in 21CS643 Module 5 2021 Scheme

Uploaded by

irannasankannavar0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views108 pages

@vtucode - in 21CS643 Module 5 2021 Scheme

Uploaded by

irannasankannavar0

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 108

PM !

|lnfG-MM-‘ Hllitwopmm

MODULE-5
l 0

Parallel Models, Languages, and

Compilers
This chapter is devoted to programming and compiler aspects of parallel and vector computers.To study
beyond anchitectural tzpabilitlesnne mustlearn about the basic models for parallel pnogramming and how
to design optimizing compilers for parallelism. Models studied include those for shared-variable.message-
passing, object-oriented. data-parallel. functional. and logic programming. “E: examine language exten-
sions. parallelizing. vectorizing. and trace-driven compilers designed to support parallel programming.

PARALLEL PROGRIKMMING MODELS

Z .-it programming model is a collection of program abstractions providing a program-
mer a simpliﬁed and transparent view of the computer liardware-“software system. Par-
allel programming models are speciﬁcally designed for multiprocessors. multicomputers. or vecton"STMD
computers. Five mo-tlcls are characterized below tor these computers that exploit parallelism with differcrit
execution paradigms.

10.1.1 Shared-Variable Model

ln all programming systems, we consider processors active resources and memory and I10 devices passive
resources. The basic computational units in a parallel program are pmcc.s'se.s' corresponding to operations
performed by related code segments. The granularity of a process may vary in different programming models
and applications.
Apro;__=rnrri is a collection of processes. Parallelism depends on how interprocess communication (IPC) is
implemented. Fundamental issues in parallel programming are centered around the speciﬁcation, creation,
suspension, reactivation, migration, termination, and synchronization of concurrent processes residing in the
same or different processors.
By limiting the scope and access rights, the process address space may he shared or restricted. To ensure
orderly lPC'_. a mutual ertcliision property requires the exclusive access of a shared object by one process at a
time. We address these issues and explore their solutions below.
Stored-Hilri-able Communication Multiprocessor programming is based on the use of shared variables
in a common memory for IPC. As depicted in Fig. lD.la_, shared-variable LPC decmantls the use of shared
memory and mutual exclusion among multiple processes accessing the same set of variables.
414 ‘i Advanced Cmnptioerﬁtchitccture

P A S-hared variables
"°'°°'5S in a common memory

[at IPC using shared variah-is

-mm’:
“W 1
Process D
[Communication ehamoi]
Process E

[ii] IPC using rnessago passing

Fig. 10.1 ‘lino basic mechanln'ns be lnrerprocess COﬂ'|l't‘Il.H'lilC31IlOi'i {iFC.'j.

Fine-grain MIMD parallelism is exploited in tightly coupled multiprocessors. lrlterprocessor

synchronization can he implemented either unconditionally or conditionally, depending on the mechanisms
used.
The main issues in using this model include protected access of critical sections, memory consistency,
atomicity of memory operations, fast synchronization, shared data structures, and fast data movement
tool-miques, to be studied in Section ltl.2.
Critical Section A oriticai section (CS) is a code segment accessing shared variables, which must be
executed by only one process at a time and which, once started, must be completed without interruption. In
other words, a CS operation is indivisible and satisﬁes the following requirements:
' Mrtrrtrzi cxcirtsion—At most one process executing the C S at a time.
* No deadlock in n-'oiring—No circular wait by two or more processes trying to enter the CS; at least one
will succeed.
- .'v'onpreemprion—No interrupt until completion, once entered the CS.
- Evertrrtoi t-ntr_v—A process attempting to enter its CS will cw.-cntually succeed.

Protected Access The main problem associated with the use of a CS is avoiding race conditions where
concurrent processes executing in different orders produce different results. The granularity of a CS affects
the performance. If the boundary of a CS is too large, it may limit parallelism due to excessive waiting by
competing processes.
When the CS is too small. it may add unnecessary code complexity or software overhead. The trick is to
shorten a heavy-duty CS or to use conditional CS5 to maintain a balanced perforrnance.
In Chapter ll, we will study shared variables in the fomi of locks for implementing mutual exctusion in
CS9. Binmjt and cotmting semaphores are used to implement CS3 and to avoid system deadlocks. Monitors
are suitable for structured programming.
Shared—variable progranirning requires special atomic operations for IPC, new language constructs for
ettpressing parallelism, compilation support for exploiting parallelism, and OS support forscheduling parallel
events and avoiding resource conflicts. Ofcourse, all ofthese depend on the memory consistency model used.
r....».rr......i.r................... . — 4,.
Shared-memory multiproccssors use shared variables for interprocessor communications. Multiprocessing
takes various forrrrs. depending on the number of users and the granularity of divided computations. Four
operational modes used in programming multiprocessor systems are specified below:
Nlulripmgmmming Traditionally, nwitrpmgrmnming is defined as multiple independent. programs running
on a single processor or on a multiprocessor by time-sharing use of the system resources. A multiprocessor
can be used in solving a single large problem or in running multiple programs across the processors.
A multiprogrammcd multiprocessor allows multiple programs to run concurrently through time-sharing
of all the processors in the system. Multiple programs are interleaved in their CPU and U0 activities. When
a program enters HO mode, the processor switches to another program. Therefore, multiprogramming is not
restricted to a multiprocessor. Even on a single processor, multiprogramming is usually implemented.
Multiprocessing When multiprogramming is inrplemented at the process level on a multiprocessor, it is
callccl mu!rt'pmce.rsing. Two types of multiprocessing are specified below. If interprocessor communications
are handled at the instruction level, the multiprocessor operates in MIMD mode. It‘ interprocessor
communications are handled at the program, subroutine, or procedural level. the machine operates in MPMD
[multiple programs" over multiple data .'rtream.r} mode.
In other words, we define MIMI] multiprocessing with line-grain instruction-level parallelism. MPMD
multiprocessing exploits coarse-grain procedure-level parallelism. In both multiprocessing modes, shared
variables are used to achieve interprocessor commurricatiorr. This is quite different from the operations
implemented on a message-passing system.
Nlultitnrlring A single program can be partitioned into multiple interrelated tasks concunently executed
on a multiprocessor. This has been implemented as multitasking on Cray multiprocessors. Thus multitasking
provides the parallel execution oftwo or more parts of a single program. Ajob etliciently multitasked requires
less execution time. Multitasking is achieved with added codes in the original program in order to provide
proper linkage and synchronization of divided tasks.
Trade-olTs do exist between multitasking and not multitasking. Only when overhead is short should
multitasking be practiced. Sometimes. not all parts ofa program can he divided into parallel tasks. 'I"l:rerefore,
multitasking tradeoffs must he analyzed before implementation. Section l 1.2 will treat this issue.

Mulrirhreoding The traditional LFNIXIUS has a single-threaded kernel in which only one process can
receive OS kemel service at a time. In a multiprocessor as studied in Chapter 9, we want to extend the single
kemel to he multiliireaded. The purpose is to allow multiple threads of lightweight processes to share the
same address space and to he executed by the same or dilTerent processors simultaneously.
The concept of muIn'rhreadr'ng is an extension of the concepts of multitasking and multiprocessing. The
ptuposc is to exploit fine-grain parallelism in modem multiproccssors built with multiple-context processors
or superscalar processors with multiple-instruction issues. Each thread will use a separate program counter.
Resource conflicts are the major problem to be resolved in a multithreaded architecture.
The levels of sophistication in securing data coherence and in preserving event order increase from
rnonoprograrnming to multitasking, to multiprogramming, to multiprocessing, and to multithreading in that
order. Memory management and special protection mechanisms must be developed to ensure correctness and
data integrity in parallel thread operations.
FM Mtfiruw Hlllrbmyrorrns
476 i " Advnrrcod Covrrputrerfirreiritectture

Partitioning ond Replication The goal of parallel processing is to exploit parallelism as much as possible
with the lowest overhead. Pmgmm partitioning is a technique for decomposing a large program and data set
into many small pieces for parallel execution by multiple processors.
Program partitioning involves both programmers and the compiler. Parallelism detection by users is
often explicitly expressed with parallel language constructs. Program restructuring techniques can be
used to transform sequential programs into a parallel fomi more suitable for multiprocessors. Ideally, this
transformation should be carried out automatically by a compiler.
Pmgmm replication refers to duplication of the same program code for parallel execution on multiple
processors over different data sets. Partitioning is often practiced on a shared-memory multiprocessor system,
while replication is more suitable for distributed-memory message-passing multicomputers.
So far, only special program constructs, such as independent loops and independent scalar operations, have
been successfully paralleli;-red. Clustering of independent scalar operations into vector or VLIW instructions
is another approach toward this end.
Scheduling on-cl Synchronization Scheduling of divided program modules on parallel processors is much
more complicated than scheduling of sequential programs on a uniprocessor. Static scheduling is conducted
at post~compile time. Its advantage is low overhead but the shortcoming is a possible mismatch with the run-
time proﬁle of each task and therefore potentially poor resource utilization.
Dynamic scheduling catches the run-time conditions. However, dynamic scheduling requires fast context
switching, preemption, and much more OS support. The advantages of dynamic scheduling include better
resource utilization at the expense of higher scheduling overhead. Static and dynamic methods can be jointly
used in a sophisticated multiprocessor system demanding higher efficiency.
ln a conventional UNIX system, inlerpmcessor communication (IPC) is conducted at the process level.
Processes can be created by any processor. All processes asynchronously accessing the shared data must
be protected so that only one is allowed to access the shared writable data at a time. This mutual exclusion
property is enforced with the use of locks, semaphores, and monitors to be described in Chapter ll.
At the control level, virtual program counters can be assigned to different processes or threads. Counting
semaphores or barrier counters can be used to indicate the completion of parallel branch activities. One can
also use atomic memory operations such as 'li2st&Sei and Fe.tclr&rl dd‘ to achieve synchronization. Software-
implemented synchronization may require longer overhead. Hardware barriers or combining networks can
be used to reduce the synchronization time.
Codie Coherence and Protection Besides maintaining data coherence in a memory hierarchy,
multiproccssors must assume data consistency betweenprivate caches and the shared memory. The multioache
coherence problem demands an invalidation or update after each write operation. These coherence control
operations require special bus or network protocols for implementation as noted in previous chapters. A
memory system is said to be coherent ifthe value retruned on a read instruction is always the value written
by the latest write instruction on the same memory location. The access order to the caches and to the main
memory makes a big difference in computational results.
The shared memory of a multiprocessor can be used in various consistency models as discussed in
Chapters 4 and 9. Sequential consistency demands that all memory accesses be strongly ordered on a global
basis. A processor cannot issue an access until the most recently shared writable memory access has been
s....i.n.1...n..i.,..,...,........i — .,,,
globally perforinecl. A weak consistency model enforces ordering and coherence at explicit synchronization
points only. Programming with the processor consistency or release consistency may be more restricted, but
memory pearfonnance is expected no improve.

10.1 .1 Message Passing Model

Multicomputer programming is depicted in Fig. 1l].lb. Two processes D and E residing at different processor
nodes may communicate with each other by passing messages through a direct or indirect network. The
messages may be instructions, data, synchronimtion, orinterrupt signals,etc. The comrmmication delay caused
by message passing is much longer than that caused by accessing shared variables in a common memory.
Multicomputers are considered loosely coupled multiprocessors. Two mess-age—passing programming models
are introduced below. Techniques for message-passing programming are treated in Sections 11.4 and ll.5.
Message Passing interface (MP!) is discussed in Chapter 13.
Synchronous Menage Passing Since there is no shared memory, there is no need for mutual exclusion.
Synchronous message passing must synchronize the sender process and the receiver process in time and space,
just like a telephone call using circuit-switched lines. [n general, no buffers are used in the communication
channels. That is why synchronous communication can be blocked by channels being busy or in error since
only one message is allowed to be trsnsmittted via a channel at a time.
ln a synchronous paradigm, the passing of a message must synchronize t:l'te sending process and the
receiving process in time and space. Besides having a time connection, the sender and receiver must also be
linked by physical communication channels in space. A path ofchannels must be ready to enable the message
passing between them.
ln other words, the sender and receiver must be coupled in both time and space synchronously. If one
process is ready to connmmicate and the other is not, the one that is ready must be bloc-ked (or wait). In this
sense, synchronous connnunication has been also called a blocking communication scheme.
Asynchronous Message Passing Asynchronous communication does not require that message sending and
receiving be synchronized in time and space. Buffers are often used in channels, which results in nonbloclting
in message passing provided sufficiently large buffers are used or the network traffic is not saturated.
However, arbitrary communication delays may be experienced because the sender may not lcnow if and
when the message has been received until acknowledgment is received from the receiver. This scheme is like
a postal service using mailboxes {channel buffers] with no synchronization between senders and receivers.
Nonblocking can be achieved by asyncfironous message passing in which two processes do not have to be
synchronized either in time or in space. The sender is allowed to send a message without blocking, regardless
of whether the receiver is ready or not.
Asynchronous communication requires the use of buffets to hold the messages along the path of the
connecting channels. Since channel bufiers are finite, the sender will eventually be blocked. In c synchronous
multicomputer, buffers are not needed because only one message is allowed to pass through a channel at a
time.
The critical issue in programming this model is how to distribute or duplicate the program codes and data
sets over the processing nodes. Tradeoffs between computation time and communication overhead must be
considered.
4TB i‘ Adirotnced Covnpunerfirchitecture

As explained in Chapter 9, ﬁne-grain concurrent programming with global naming was aimed at merging
the shared-variable and message-passing mechanisms for heterogeneous processing.
Distributing the Computation: Progmm replication and data distribution are used in multicompubers.
The proeessois in a multicomputer [or a NORMA machine) are loosely coupled in the sense that they do
not share memory. Message passing in a multicomputer is handled at the subprogram level rather than at the
instructional or ﬁne-grain process level as in a tightly coupled multiprocessor. That is why explicit parallelism
is more attractive for multicomputers.

I»)
Cg Example 10.1 A concurrent program for distributed
computing on a multicomputer (justin
Rattner, lntel Scientiﬁc Computers, 1990)
The computation involved is the evaluation of iras the area under the ctmreﬂx} between U and I as shown in
Fig. 10.2. Using a rectangle rule, we write the integral in discrete form:
_| - I 4 ll‘ --

1: = 'o_{{ii¢
- -it = J;H+x: dx + h;_y{x,i.

4»
as :ll'=)7\fB-&UﬂdEI"1=f{X)=4J[l+l‘2}

3
2.5

1.5 _
11012-3U123'D123'D123D123

0.5

CI e e e x
0 0.1 c-.2 as 0.4 0.5 as 0.1 us as 1
Fig. 10.2 Domain clacornp-osltlon for concurrent programming on a muidcomputer with four pr\oceesca's

where ix = lfn is the panel width, x,- = Mi‘ — 0.5) are the rnidpoints, and n is the number of panels (rectangles)
to be computed.
Assume a four-node multicomputer with four processors labeled 0, 1, 2, and 3. The rectangle rule
decomposition is shown with n = 20 and it = 1110 = (1.05. Each procefisof node is nssigtled 110 compute the
areas of five rectangular panels. Therefore, the computational load of all four nodes is balanced.
r.....i.n......l..l....,...,........r — 4,,
Host program Node program
inputinl p = numnodeslffi
sendt n,allnodes} me = rnynode(]
recv(Pi) recv{n')
output(Pi) h = 1.Dr'n
sum = U
Do i = me + 1, n, p
a=hxfi—&fl
sum = sum + fix]
End Do
pi = h >< sum
gop["+‘, Pi, host)
Each node exec-ules a separate copy ol‘ the node program. Several system calls are used to achieve message
passing between the host and the nodes. The host program semis the number of panels n as a message to all
the nodes, which Yficeive it accordingly in the node program. The commands mnnnodes and mynode specify
how big the system is and which node it is, respectively.
The software For the iPSC system offers a global summing operation gq|'J(’+', pi, host) which iteratively
pairs nodes that exchange their current partial sums. Each partial sum received from another node is added to
the sum at the receiving node, and the new sum is sent out in the next round of message exchange.
Eventually, all the nodes accumulate the global sum multiplied by the height -[pi = it >< sum) which will
be retumed to the host for printout. Not all pairs ol‘ node communications need to be carried out. Only
log; N rounds of message exchanges are required to compute the adder-tree operations, where N is the
I11-Jmber of nodes in the systcm.This point will be further elaborated in Chapter I3.

1D.1 .3 Dat:a- Parallel Model

With die lockstep operations in SIMD computers, flte data-parallel oode is easier to write and to debug
because parallelism is explicitly handled by hardware synchronization and Ilow control. Data-parallel
languages are modified directly fium standard serial programming languages. For example, Fortran 90 is
specially tailored for data parallelism. Thinking Machines‘ C‘ was specially designed for programming the
erstwhile Connection Machines.
Data-parallel programs require the use of pre-distributed data sets. Thus the choice of parallel data
structures makes a hig difference in data-parallel programming. Interconnected data structures are also
needed to facilitate data exchange operations. In summary, data-parallel programming emphasizes local
computations and data routing operations (such as permutation, replication, reduction, and parallel prefix). It
is applied to fine-grain problems using regular grids, stencils, and multidimensional signaltimage data sets.
Data parallelism can he implemented either on SIMD computers or on SPMD multicomputers, depending
on the grain size and operation mode adopted. In this section, we consider mainly parallel programming on
SIMD computers that emphasize fine-grain data parallelism under synchronous control. Data parallelism
often leads to a high degree of parallelism involving thousands of data operations concurrently. This is rather
difierent from control parallelism which ofi"ers a much lower degree of parallelism at the irtshuctiort level.
FM Mtfiruw Hlfltbmpwins
4B'll i " Adnorrced Covnputieriliicliitecture

Synchronization of data-parallel operations is done at compile time rathcr than at run time. Hardware
synchronization is enforced by the control unit to carry out the locltstep execution ofSl1\-'lD programs. We
address below instruction."data broadcast, masking, and data-routing operations separately. Languages,
compilers, and the conversion of SIMD programs to run on MIMD multicomputers are also discussed.
Dara Parallelism Ever since the introduction of the llliac IV computer, programming SIMD array
processors has been a challenge for computational scientists. The main difiiculty in using the llliac IV had
been to match the problem size with the fixed machine size. In other words, large arrays or matrices had to
be partitioned into 64-element segments before they could be effectively processed by the 64 processing
elements (PEs) in the llliac IV machine.
A latter SIMD computer, the Connection Machine CM-2, offered hit-slice fine-grain data parallelism using
16.384 PEs concurrently in a single-array configuration. This demanded a lower degree of array segmentation
and thus offered highcr flexibility in programming.
Synchronous SIMD programming differs from asynchronous MIMD programming in that all PEs in
an SIMD computer operate in a locltstep fashion, whereas all processors in an MIMD computer execute
difierent instructions asynchronously. As a result, SIMD computers do not have the mutual exclusion or
synchronization problems associated with multiproccssors or multicomputers.
Instead, inter-PE communications are directly controlled by hardware. Besides lo-ckstep in computing
operations among all PEs, inter-PE data communication is also carried out in lockstep. These synchronizaed
instruction executions and data-routing operations make SIMD computers rather efficient in exploring spatial
parallelism in large arrays, grids, or meshes of data.
ln an SIMD program, scalar instructions are directly executed by tl'|e control unit. Vector instructions
are broadcast to all processing elements. Vector operands are loaded into the PEs from local memories
simultaneously using a global address with ditferent offsets in local index registers. Vector stores can he
executed in a similar manner. Constant data can be broadcast to all PEs simultaneously.
Amasking pattern {binary vector} can be set under program control so that PEs can be enabled or disabled
dynamically in any instruction cycle. Masking instructions are directly supported by hardware. Data—routing
vector operations are supported by an inter-PE routing network. which is also under program control on a
dynamic basis.
Army Language Extension: Array extensions in data—para11el languages are represented by high-level
data types. We will specify Fortran 90 array notations in Section 10.2.2. The array syntax enables the removal
of some nested loops in the code and should reflect the architecture of the array processor.
Examples of array processing languages are {JFD For the llliac W, DAP Fortran for the AMT! Distributed
Array Processor, C‘ for the TMCIConnection Machine, and MP? for the MasPar family of massively parallel
computers.
An SIMD programming language should have a global address space, which obviates the need for explicit
data routing between PEs. The array extensions should have the ability to make the number of PEs a function
of the problem size rather than a function of the target machine.
Connection Machine C‘ language satisfied these requirements nicely. A Pascal-based language, .-lems. was
developed by RH. Perrott for problem-oriented SIMD programming. Acme offered hardware transparency,
application flexibility, and explicit control structures in both program structming and data typing operations.
t.....»..n.1..d.i,.t....,...,........t . — .,,,,
Compiler Support To support data-parallel programming, the array language expressions and their
optimizing compilers must be embedded in familiar standands such as Fortran T7, Fortran 90. and CI. The
idea is to unify the program execution model, facilitate precise control of massively parallel hardware, and
enable incremental migration to data-parallel execution.
Compiler-optimized control of SIMD machine hardware allows the programmer to drive the PE array
transparently. The compiler must separate the program into scalar and parallel components and integrate with
the US environment.
The compiler technology must allow array extensions to optimize data placement, minimize data
movement, and virtualize the dimensions of the PE array. The compiler generates data-parallel machine code
to perform operations on arrays.
Array sectioning allows a programmer to reference a section or a region of a multidimensional array.
Array sections are designated by specifying a start index, a bound, and a stride. Vector-valued subscripts arc
often used to construct arrays from arbitrary permutations of another array. These expressions are vectors
that map the desired elements into the target array. They facilitate the implementation of gather and scatter
operations on a vector of indices.
SIMD programs can in theory be recompiled for MIMD architecture. The idea is to develop a source-to-
source precompiler to convert, for example, from Connection Machine C"‘ programs to C programs running
on an nCUBE message-passing multicomputer in SPMD mode.
ln fact, SPMD programs are aspecial class of SIMD programs which emphasize medium-grain parallelism
and synchronization at the subprogram level rather than at the instruction level. ln this sense, the data—parallcl
programming model applies to both synchronous SIMD and loosely coupled MIMD computers. Program
conversion between different machine architectures is needed to broaden software portability. The parallel
programming paradigm based on openMP standard is described in Chapter 13.

10.1 .4 Object-Oriented Model

If one considers special language features and their implications, additional models for parallel programming
can he introduced. Ari object-oriented programming model is characterized below.
In this model, o!Jjecr.s are dynamically created and manipulated. Processing is perfomted by sending and
receiving messages among objects. Concurrent programming models are built up from low-level objects such
as processes, queues, and semaphores into high—1evel objects like monitors and program modules.
Concurrent OOP The popularity ofohjecr-orientedpmgromming [OOP]| is attributed to three application
demands: First, there is increased use of interacting processes by individual users, such as the use ofmultiple
windows. Second. workstation networks have become a cost-effective mechanism for resource sharing and
distribnt-ed problem solving. Thind, multiprocessor technology in several variants has advanced to the point
of providing supcreomputing power at a fraction ofthe traditional cost.
As a matter of fact, program abstraction leads to program modularity and software reusability as is
commonly exprienced with OOP. Other areas that have encouraged the growtlt ofDOP'include the development
of CAD {computer-aided design} tools and other sophisticated applications with graphics capabilities.
Objects are program entities which encapsulate data and operations into single computational units. It
turns out that concurrency is a natural consequence of the concept of objects. In fact, the concurrent use of
coroutines in conventional programming is very similar to the concurrent manipulation of objects in GDP.
4111 i‘ Advorrced Cmnpiunerﬁtchitecrurc

The development of concurrent object-on'ented pmgrarnming (CCIOP) provides an alternative model for
concunent computing on multiproccssors or on multicomputers. Various object models differ in the internal
behavior of objects and in how they interact with each other.
An Actor Model COOP must support patterns of reuse and classiﬁcation, for example, through the use
of inheritance which allows all instances of a particular class to share the same property. An actor model
developed at MIT is presented as one framework for COOP.
Actors are self-contained, interactive, independent components of a computing system that communicate
by asynchronous message passing. In an actor model, message passing is attached with semantics. Basic
actor primitives include:
(ll j Create: Creating an actor from a bchaviordcscription and a sct ofparamctcrs.
{2} Send-Io: Sending a mcssagc to another actor.
-['3] Become: An actor replacing its own bchaviorby a ncw behavior.

State changes are specified by behavior replacement. The replacement mechanism allows one to aggregate
changes and to avoid unnecessary control-flow dependences. Concurrent computations are visualized in
learns of concurrent actor creations, simultaneous communication events, and behavior replacements. Each
message may cause an object (actor) to modify its state, create new objects, and send new messages.
Concurrency control structures represent particular patterns of message passing. The actor primitives
provide a low-level description of concturent systems. High-level constmcts are also needed for mising
the granularity of descriptions and for encapsulating faults. The actor model is particularly suitable for
multicomputer implementations.
Parallelism in COOP Three common pattems of parallelism have been found in the practice of COOP.
First, pipeline concurremry involves the overlapped enumeration of successive solutions and concurrent
testing of the solutions as they emerge from an evaluation pipeline.
Second, divide-and-conquer concurrency involves the concurrent elaboration of difierent subprograms
and the combining of their solutions to produce a solution to the overall problem. In this case, there is
no interaction between the procedures solving the subproblems. These two patterns are illustrated by the
following examples taken from the paper by Agha (1990).

I/)
lg Example 10.2 Concurrencyin object-oriented programming
(GulAgha.,1990}
A prime-number generation pipeline is shown Fig. l(I.3a. Integer numbers are generated and successively
tested for divisib-ility by previously generated primes in a linear pipeline of primes. The circled ntunbers
represent those being generated.
A number enters the pipeline from the left end and is eliminated if it is divisible by the prime number
tested at a pipeline stage. All the numbers being forwarded to the right of a pipeline stage are those indivisible
by all the prime numbers nested on the left of that stage.
»...<..t..n.1..t.t.l....g..e.,.....i — 4,,
Figure It]-.3b shows the multiplication of a list of numbers [[0, 7, -2, 3, 4, ~11, -3] using a divide-
and-oonquer approach. The numbers are re-presented as leaves of a tree. The problem can be recursively
subdivided into subproblems of multiplying two sublists, each of which is concurrently evaluated and the
results multiplied at the upper node.

@ @ cs) <19 ® (55+) Ii

T T T T

-5.5440

-421] 1 32

-6 33-
4
TD
-2 3 -11 -3

11] 1"

{bl Divide-and-conquer concurrency

Flg.1ll.3 ‘Mo eon-currency types in oh-jeeraodenned p1*ogrammlng{Co|.irresy of G.Ag1a, Common. ACM.

September 1990]

A third pattern is called cooperative problem solving. A simple example is the -:lynamic path evaluation
(computational objects) of many physical bodies {objects} under the mutual inﬂuence of gravitational ﬁelds.
In this case, all objects must interact with each other; intemiediate results are stored in objects and shared by
passing messages between them. interested readers may refer to the book on actors by Agha (1986).
Today companies sueh as IBM and Cray produce supercomputers with thousands of processors inter-
connected over high performance networks. At the same time, object-oriented programming and the message-
passing model of inter-process communication have become established as standard paradigms of program
design and development. Consider, for example, IBM's powerful Blue Gene line of supercomputers; the
standard method of communication amongst node processes in these supercomputers is the Message-Passing
Interface {MPI}, customized for the architecture as needed. The Blue Gene line of supercomputers and it-'fPl
will both be discussed in Chapter 13.

10.1.5 Functional and Logic Models

Two language-oriented prograrnming models for parallel processing are described below. The ﬁrst model is
based on using functional programming languages such as pure Lisp, SISAL, and Sn-and 83. The second model
FM Mcﬁruw H'lllr'n.-rq|w|n1'
434 i " Advainced Covnputierilucltitecture

is based on logic programming languages such as Concurrent Prolog and Parfog. We reveal opportunities for
parallelism in these two models and discuss their potential in AI applications.
Functional Programming Model A ftmctional programming language emphasizes the functionality oi
a program and should not produce side effects after execution. There is no concept of storage, assignment,
and branching in fitnctional programs. In other words, the history of any computation performed prior to the
evaluation of a functional expression should be irrelevant to the meaning of the expression.
The lack of side effec-ts opens up much more opportunity for parallelism. Precedence restrictions occur
only as a result of function application. The evaluation of a function produces the same value regardless
of the order in which its a.rgu.rnents are evaluated. This implies that all argliments in a dynamically created
structure of a functional program can be evaluated in parallel. All single-assignment and clam-flow languages
are functional in nature. This implies that functional programming models can be easily applied to data-
driven multiprocessors. The functional model emphasizes fine-grain MIMD parallelism and is referentially
transparent.
The majority of parallel computers designed to support the functional model were oriented toward
Lisp, such as Multilisp developed at MIT. Other dataflow computers have been used to execute functional
programs, including SISAL used in the Manchester datafiow machine.
Logic Programming Model Based on predicate logic, logic pmgronuning is suitable for knowledge
processing dealing with large databases. This model adopts an implicit search strategy and supports parallelism
in the logic inference process. A question is answered if the matching facts are found in t|'te database. Two
facts match if their predicates and associated arguments are the same. The process of matching and uni{i—
cation can be parallelized under certain conditions. Clauses in logic programming can be transformed into
dataflow graphs. Parallel tmification has been attempted on some dataflow computers built in Japan.
Concurrent Pmlog, developed by Shapiro {I986}, and Pnrlog, introduced by Clark (1931), are two
parallel logic programming languages. Both languages can implement relational language features such as
AND-parallel execution of conjunctive goals, IPC by shared variables, and DR-parallel reduction.
In Purlog, the resolution tree has one chain at AND levels, and OR levels are partially or fiilly generated.
ln Corrcurrem‘ Pmlog, the search strategy follows multiple paths or depth first. Stream parallelism is also
possible in these logic programming systems.
Both functional and logic programming models have been used in artificial intelligence applications
where parallel processing is very much in demand. Japan's Fi,fi'h-Generation Computing System (FGCS)
project attempted to develop parallel logic systems for problem solving, machine inference, and intclligcnt
human-machine interfacing.
In many ways, the FGCS project was a marriage of parallel processing hardware and AI software. The
Parallel Inference Machine (PIM-I} in this project was designed to perform ID million logic inferences
per second (MLIPS}. However, more recent Al applications tend to be based on other techniques, such as
Bayesian inference.

PARALLEL LANGUAGES AND COMPILERS

_ The environment for parallel computers is much more demanding than that for sequential
computers. A programming environment is a collection of software tools and system software
»...<.t.l.-.1...a..t....,..,.,,.......t — .,,,
support. Users should not have to spend a lot of time programming hardware details; they should focus
instead on program parallelism using high-level abstractions. To break this hardwareisoftware barrier. we
need a parallel software environment which provides better tools for users to implement parallelism and to
debug programs.

10.2.1 Language Features for Parallelism

Chang and Smith (1990) classified the language features for parallel programming into six categories
according to functionality. These features are idealized for general-purpose applications. In practice, the
real languages developed or accepted by the user community might have some or no features in some of
the categories. Some of the features are identified with existing languagefcompiler development. The listed
features set guidelines for developing a user-friendly programming environment.
Optimization Feature: These features are used for prograrn restructuring and compilation directives in
convening sequentially coded programs into parallel forms. ‘Ilse purpose is to match the software parallelism
with the hardware parallelism in the target machine.
- Automated parallelizer—E1amples are: Express C automated parallelizer and the Alliant FX Fortran
compiler.
~ Semiautomated par-a11e|izer—Nccds compiler directives or programmer's interaction, such as DINO.
' Interactive restructure support—Static analyzer, run-time statistics, dataflow graph, and code translator
for restructuring Fortran code, such as the Mllvl Dizer from Pacific Sierra.

Availability Feature: These are features that enhance the user-friendliness, make the language portable to
a large class of parallel computers, and expand the applicability of soﬁware libraries.
* Scalabi_lity—Thc language is scalable to thc number of processors available and indcpcndcnt of
hardware topology.
' Compatibility—Thc language is compatible with an cstablishcd soqucntial language.
' Pbr'lability—Thc language is portable to shared-memory multiproccssors, message-passing
multicomputers, or both.

Synchronization ICommunication Feature: Listed below are desirable language features for
synchronization or for communication purposes:
- Single-assignment languages
' Shared variables (locks) for IPC
* Logically shared memory such as the tuple space in Linda
* Sendfreceive for message passing
' Rendezvous in Ada
~ Remote procedure call
* Datafiow languages such as Id
* Barriers. mailbox, semaphores, monitors
Control of Parallelism Listed below are feattues involving control constructs for specifying parallelism
in various forms:
' Coarse. medium. or fine grain
435 i‘ - Adrnrrced Compurnerfiirhitecture

- Explicit versus implicit parallelism

' Global parallelism in the entire program
- Loop parallelism in iterations
' Task-split parallelism
I Shared task queue
I Divide-and-conquer paradigm
- Shared abstract data types
- Task dependency specification
Dara F'enI'a.lle.li:m Feature: Data parallelism is used to specify how data are accessed and distributed in
either SIMD or MIMI) computers.
- Run-time automatic dccomposition—Data are automatically distributed vvitl1 no user intervention, as
in Express.
* Mapping specificat:ion—Provides s facility for users to speciiy cornmunication patterns or how data
and processes are mapped onto the hardware, as in DINO.
' Virtual processor support - The compiler maps the virtual processors dynamically or statically onto
the physical processors, as in PISCES 2 and DINO.
- Direct access to shared data—S|1a.rcd data can be directly acecsscd without monitor control, as in
Linda.
~ SP!»-l.D (single program multiple data} snpport—SPl\-'1D prograrnming, as in D]l'~lO and Hypertasking.
Proeu: Nlrlnagement Feature: These features are needed to support the efficient creation of parallel
processes, implcmcntation of mnltithreading or multitasking, program partitioning and replication, and
dynamic load balancing at run lime.
~ Dynamic process creation at run time
~ Lighnveight processes (threads)—Compared to UNIX (heavyweight) processes
- Replicated workers—Same program on every no-de with different data [SPMD mode)
' Partitioned networlrs—Each processor nodemight have more than one process and all processor nodes
might run difiisrent process-es
- Automatic load balaneing—The workload is dynamically migrated among busy and idle notlcs to
achieve the same amount of work at various processor nodes
The above language features cannot he implemented without compiler support, operating system
assistance, and integration with an existing environment. Software assets based on conventional languages
form the basis for building an cfficicnt parallel programming environment.
The optimization icatures emphasize code parallclization and vcctorization at compile time. The
availability features widen the application domains and make the languages 1'naehine—independent.
The synchronization features must be supported by efiicient hardware and software rnechanisms for
their implementation. The control features often depend on tradeofis among grain sire, memory demand.
and communication and scheduling overhead. Data parallelism exploits fine-grain computations on SIMD
machines and medium-grain computations on MIMD computers.
.».....»..i.-......t...t...,...,........i — ,,,,
The process management features are closely tied to the US functions provided. Therefore, the languages,
compilers, and OS must he developedjointly in an integrated fashion.

10.2.2 Parallel Language Constnlcts

Special language constructs and data array expressions are presented below for exploiting parallelism in
programs. We first specify Fortran 90 array notations. Then we describe commonly used parallel constructs
for program flow control.
Fortran 90 Army Neitatlnm A multidimensional data array is represented by an array name indexed by
a sequence of subscript triplets, one for each dimension. Triplets for different dimensions are separated by
commas. Examples are:
(:1 I ("2 : e3
cl I 1'2
e1: =1 :23 (lll!)
0| I =l
1'1
I

where each e,- is an arithmetic expression that must produce a scalar integer value. The ﬁrst expression e, is a
lower bound, the second E2 an upper bound, and the third e3 an increment Lrtride}. For example, B(l : 4 : 3,
6 : 3 : 2,3) represents four elements B(l, 6, 3], B{4 ,6, El), Bil, ll, 3), and B[4, 8, 3} of a three-dimensional array.
When the third expression in a triplet is missing, a unit stride is assumed. The " notation in the second
expression indicates all elements in that dimension starting from cl, or the entire dimension if e, is also
omitted. When both 02 and e3 are omitted, the c| alone represents a single element in that dimension. For
example, A{.5) represents the liflh element in the array A(3 : 7 : 2). This notation allows us to select array
sections or particular array elements.
Array assignments are permitted under the following constraints: The array expression on the right must
have the same shape and the same number of elements as the array on the left. For example, the assignment
A(2 : 4, 5 : El) =A{3 : 5, I : 4) is valid, but the assignment .4{l : 4, I : 3] =1-lll :2, I : 6) is not valid, even tempt
each side has 12 elements. When a scalar is assigned to an array, the value of the scalar is assigned to every
element ofthe array. For instance, the statement B(3 : 4, 5) = 0 sets BQ3, S) and Bf,-4, 5) to G.
Parallel Flaw Control The conventional Fortran Do loop declares that all scalar instructions within the
(Du, Endrlo] pair are executed sequentially. and so are the successive iterations. To declare parallel activities,
we use die (Dnall, Endall] pair. All iterations in the Deal] loop are totally independent of each other. This
implies that they can be executed in parallel if there are sufficient processors to handle different iterations.
However. the computations within each iteration are still executed serially in program order.
‘When the successive iterations of a loop depend on each other, we use the (Dnacross, Endacrnss} pair
to declare parallelism with loop-carried dependences. Synchronizalions must be performed between the
iterations that depend on each other. For example, dependence along the J-dimension exists in the following
program. We use Doacross to declare parallelism along the I-dimension, but synchronization between
iterations is required. The (ForalL Emtall) and (Panto, Parend) commands can be interpreted either as a
Doall loop or as a Doacross loop.
4B5 i‘ Advorrced Cnvnptunerﬁrrhiteczure

Duacrnss I = 2, N
[In .l = 2, N
S]: A(I,J)=(r'-"t(l,[_J— li])—A(l, J + 1',t)l'2
Enddo
Endacmss
Another program construct is the {Cube-_gi|1, Coend) pair. All computations speciﬁed within the block
could be executed in parallel. But parallel processes may be created with a slight time difference in real
impletnentations. This is quite different from the semantics of the Doall loop or Doacross loop structures.
Syrtchronizations among concurrent processes created within the pair are implied. Formally, the command
Cube-gin
Pl
P2

P»
Coend

causes processes P], P1, .. . , P, to start simultaneously and to proceed concurrently until they have all ended.
The command (Pa rbe-gin, Faren-d} has equivalent meaning.
Finally, we introduce the Forlt and Join commands in the following example. During the execution of a
process P, we can use a Fork Q command to spawn a new process Q:
Process P Process Q

Fork Q E
E End
Join Q
The Join Q command recombines the two processes into one process. Execution of Q is initialized when
the Fnrk Q statement in P is executed. Programs P and Q are executed concurrently until either P executes the
Join Q statement or Q terminates. Whichever one finishes first must wait for the other to complete execution,
before they can he rejoined.
111 it UNIX or LINUX environment, flie Fork-Join statements provide a direct mechanism for dynamic
process creation including multiple activations of the same process. The Cnbe-gin-Cttend statements provide
a structured single-entry, single-exit control command which is not as dynamic as the Furl:-Jain. The
(Farhegin, Farend) command is equivalent to the (Cube-gin, Coend] command.

10.1.3 Optimizing Compilers for Parallelism

Because high-level languages are used almost exclusively to write programs today, compilers have become a
necessity in modem computers. The role of a compiler is to remove the burden of program optimization and
code generation from the programmer. A parallelizing compiler consists of the following three major phases:
ﬂow analysts, optimizations, and code generation, as depicted in Fig. li).4.
nrwrr=rmrrd¢s.r.o1-gwge=.ma - —
Soil-no Coda

Data dependence,
Flew Control dopondonco,
analysts
Rouse analysis

Program Mlactorlzatlon,
Parallotlzatlons,
op-tlmlzalons
Locality, Pipolln lng

Para llal Granutarlty,

Degree of parallelism
coda generation
Co-do scheduling

Sup-arscalar Sha'ne|-ntomory Distributed-memory

processor: rnultlprocoesor: multicomputer:
scheduling, portltloring, cl lstr lb-utoo data
register allocation, synchronization, and computations,
contain switching, load lrralanclng, message-paslng_
ate. etc. etc.

Fig. 111.4 Cor-nplarlort phases in pararltni code generation

Flow Analysis This phase reveals t11e program flow patterns in order to determine data and control
dependences in the source code. We have discussed data dependence relations among scalar-type instructions
in previous chapters. Scalar dependence analysis is extended below to structured data arrays or manices.
Depending on the machine structure, the granularities of parallelism to be exploited are quite different. Thus
the l-‘tow analysis is conducted at different execution levels on different parallel computers.
Generally speaking, instruction-level parallelism is exploited in superscalar or "r-‘LS1 processors; loop
level in Sllvlll, vector, or systolic computers; and task level in multiprocessors. multicomputers, or a network
of workstations. Of course, exceptions do exist. For example. fine-grain parallelism can in theory be pushed
down to multicomputers with a globally shared address space. The flow analysis must also reveal code.-‘data
reuse and memory-access patterns.
Pretgmm Optimization: This refers to the rransfomiation of user programs in order to explore the
hardivare capabilities as much as possible. Transformation can be conducted at thc loop level, locality level,
or prefetching level with the ultimate goal ofreaching global optimization. The optimization often transforms
a code into an Equivalent but “better” form in the same representation language. These transfonnations should
be machine-independent.
In reality, most transfomtations are constrained by the machine architecture. This is the main reason why
many such compilers are machine-dependent. At the least, we vvantto design a compiler which can run on most
machines with only minor modifications. One can also conduct curtain transformations preceding the global
FM Mtfiruw Hffltitmpwtnv
4911 T " Aduertced Covnpunerfitrhiteczure

optimization. This may require a source-to-source optimization {sometimes canied out by a pracompiicr),
which transforms the program from one high-level language to another before using a dedicated compiler for
the second language on a target machine.
The ultimate goal of program optimization is to maximize the speed of code execution. T'his involves the
minimization of code length and of memory accesses and the exploitation of parallelism in programs. The
optimization techniques include vectorizration using pipelined hardware and parallelization using multiple
processors simultaneously. The compiler should be designed to reduce the nmning time with minimum
resource binding. Other optimizations demand the expansion of routines or procedure integration with
inlining. Both local and global optimizations are needed in most programs. Sometimes thc optimization
should be conducted at the algorithmic level and must involve the programmer.
Machine-dependent transformations are meant to achieve more efficient allocation of machine resources,
such as processors, memory, registers, and functional units. Replacement of complex operations by cheaper
ones is often practiced. Other o_ptimizations include elimination of unnecessary branches or common
expressions. Instruction scheduling can he used to eliminate pipeline or memory delays in executing
consecutive instructions.
Fhrallel C/ode Gurerertion Code generation usually involves transformation from one representation to
another, called an Inrermediatejbrm. A code model must he chosen as an intermediate form. Parallel code
is even more demanding because parallel constructs must be included. Code generation is closely tied to the
instruction scheduling policies used. Basic blocks linl-ted by control-flow commands are often optimized to
encourage a high degree of parallelism. Special data stnrctures are needed to represent instruction blocks.
Parallel code generation is very different for diiTerent computer classes. For example, a superscalar
processor may be software-scheduled or hardware-scheduled. How to optimize the register allocation on a
RISC or superscalar processor, how to reduce the synchronization overhead when codes are partitioned for
multiprocessor execution, and how to implement message-passing commands when codestdata ane distributed
(or replicated) on a multicomputer are added difficulties in parallel code generation. Compiler directives can
be used to help generate parallel code when automated code generation cannot he implemented easily.
Two well-known exploratory optimizing compilers were developed over mid-l ‘J80: one was Parafrase at
the University of Illinois, and the other was the PFC (Parallel Fortran Converter] at Rice University. These
systems are briefly introduced bclow.
Porafmse and?-rrrafmee I This system, developed by David Kuck and coworkers at lllinoi s, is a source-to-
source program restructurer (or compiler preprocessor) which transforms sequential Fortran 77 programs into
forms suitable for vcctorization or parallclization. Parafrase contains more than 10!) program transformations
which a.re encoded as prtsses. Aposs list is used to identify the particular sequence of transformations needed
for restructuring a given sequential program. The output of Parafrase is the converted concurrent program.
Different programs use different pass list and thus go through different sequences of transformations. The
pass lists can be optimized for specific machine architectures and specific program constructs. Parafrase 2
was developed for handling programs written in C and Pascal, in addition to convening Fortran codes.
Information on Parafrase can he found in [Kuck84] and on Parafrase 2 in [PolychronopoulosB9].
Parafrase is retargetable to produce code for different classes of parallelfvector computers. The program
transformed by Farafiase still needs a conventional optimizing compiler to produce the object code for the
target machine. The Parafrase technology was later transferred to implement the RAP vec-torizer by Kuck
and Associates, Inc.
t.....»..n.1..t.t.t....g....,.....,..t — .,,,,
The PFC and PamScope Ken Kennedy and his associates at Rice University developed PFC as an
automatic source-to-source vectorizer. It translated Fortran T7 code into Fortran 9|] code. A categorized
dependence testing scheme was developed in PFC for revealing opportunities for loop vectorization. The
PFC package was also extended to PFC+ for parallel code generation on shared-memory multiprocessors.
PFC and PFC l also supported the ParaScope pnogramming enviromnent.
PFC [Allen and Kennedy, I934] perfonned syntax analysis, including the following four steps:
(I) lnterprocedural flow analysis using call graphs.
(2) Standard transformations such as Do-loop normalization, subscript categorization, deletion of dead
codcs, ctc.
(3) Dependence analysis which applied the separability, GCD, and Banerjee tests jointly.
(4) Vector code generation. PFC+ ftuther implemented a parallel code generation algorithm {Callahan
ct al, 1938).

Commercial Compiler: Optimizing compilers have also been developed in a number of commercial
parallelfvector computers, including the Alliant FXI F Fortran compiler, the Convex parsllelizinglvec-torizing
compiler, the Cray CFT compiler, the IBM vectorizing Fortran compiler, the VAST vectorizcr by Paciﬁc
Sierria, lnc., and lntel iPSC-VX compiler. [BM also developed a PTRAN (Parallel Fortran) system based on
control dependence with interproccdural analysis.

DEPENDENCE ANALYSIS OF DATAARRAYS

— Dependence testing of successive iterations in multidimensional data arrays is described in
this section. This provides a theoretical foundation for the development of vcctorizing or
parallelizing compilers.

10.3.1 lteration Space and Dependencerﬁtialysis

Flow dependence, antidependence, and output dependence were defined for scalar data in Section 2.1.2. They
can be summarized by the existence ofdynamic references ofR, and R1, ifand only ifeither R, or R1 is a write
operation, R] executes before R3, or R] and R3 both write the same variable. ‘Nhen the referenced object is a
data array indexed by a multidimensional subscript, the dependence becomes very difficult to determine at
compile time. since subscript values are not in general available.
Precise and efficient dependence tests are essential to the effectiveness of a parallelizing compiler. The
process of computing all the data dependences in a program is called dependence anaiysis. The testing
scheme presented below is based on the work of Goff, Kennedy, and Tseng (1991). These dependence tests
were implemented at Rice University in PFC with the parallel Paraficope progranuning environment.
Dependence Testing Calculating data dependence for an'ays is complicated by the fact that two array
references may not access the same memory location. Dependence testing is the method used to determine
whether dependences exist between two subscripted references to the same an'ay in a loop nest. For the
purpose of this explication, we ignore any control flow except for the loops themselves. Suppose we wish
to test whether or not there exists a dependence from statement 5', to S; in the following model loop nest of
rt levels. represented by :1 intcgcr indiccs r',, F2, .. ., i,,.
FM Mtfiruw H'IH:'nm;|w|n1'
491 i " Admtnced Cmnpunerfiichiteczure

DI] I-| = L], LI]

Du ll = L1! LT:

Do t,,=1.,,, 1.-',,
S]: 1‘\l:ji_(i-|, ..., I-M), ...,j,;1.{t|, ..., I-FD =

51: "'=.|‘\.(‘t';'](t-1,, ..., I-H), ...,gm{f-1, ...,?-"D

Enddn

Eltddu
Enddu

Iteration Space The .-1-dimensional discrete Canesian space for !‘l—|2lCCp loops is called an iteration space.
The iteration is represented as coordinates in the iteration space. The foiiowing example clariﬁes the concept
oftexicagraphic order for the successive iterations in :1 loop nest.

Ir)
Kg Example 10.3 Lexicographic order for sequential execution
of successive iterations in a loop structure
(Monica Lam,1992)
COtI5iClBT a hvo-dimensional iteration space (Fig. I D5} representing the following two-level loop nest in imit-
incrcment steps:
Do i= 0,5
Du j = i, '1'
fiﬂi)= ~-
Enddo
Endd-n

5 -- -- --- -- -- -- -- --
4 -- -- --- -- - -- -- -- ‘if
2. -- -- -- - -- -- -- toe
2 —- —- - —- - -— —- --
1 -- -- -- - -- -- --
I5
|~,;.$- §_-§_ |.-_J_ -|._ J ta‘-—l- l- I- |- t- 1 -~*- r +-1 —+-L J wt-+ -+ -4 1 I31‘-I-*0-I-l-0-1 - L-+ -+ -+ 1
E";'-
Fig. 10.5 A two-dimensional it:era1:iort space for the loop nest: in Ex:-.rnpie 10.3
...............................,...... — .,,,
The following sequential order of iteration is a lexicographic order".
(0, 0). to. 1), (0, 2), (0. 3), (0, 4), (0, 5). to, s), (0. 1}
(1.1). (1. 1). (1, 3). (1, 4). (1,5). (1.6). (1. 7)
(2, 2). (2, 3). (2, 4). (2. 5). (2, 6). (2, 1)
(3. 3). (3. 4). (3. 5). (3. 6). (3. 1')
(4. 4). (4. 5). (4. 6). (4. 7)
(5, 5). (5, 6). (5, 7)
The lexicographic order is important to performing matrix transfonnation, which can be applied for loop
optimization. We will apply lexicographic orders for loop parallelizatiotl in Section 10.5.
Dependence Equations Let rr and B bc vectors of n integer indices within tl1e ranges of the upper and
lower bounds of the n loops. There is a dependence from S, to S2 if and only if there exist rr and ,8 such that
tr is lexicographically less than or equal to ,3 and the following system of dependence equation.r is satisﬁed:

fl-(t2')= g,-[)3] Vi, l 5 FE m (10.2)

Otlierwise the two references are independent.

The dependence equations in Eq. 10.2 are linear expressions of the loop index variables. Dependence
testing is thus equivalent to the problem of linear Diophantine equations, which is an NP-complete problem.
Exact tests are dependence tests that will detect dependences if and only if they exist. In practice, exact tests
are not performed due to the excessive overhead involved. Only approximate solutions {which are eflicient
to implement} are sought.
Parallelizing compilers have traditionally relied on two dependence tests to detect data dependences
between pairs of array references: Banerg'ee’s inequalities tlianeijec, 1988) and GED tests (Wolfe, 1989).
However, these tests are usually more general than necessary.
ln Section 10.3.2, we present a practical testing algorithm develop-ed by Rice University researchers led by
Ken Kennedy. The best algorithm is based on partitioning the subscripts in a pair of array references. A suite
of simple tests is developed to reduce the cost of performing dependence analysis, making it more practical
for most compilers.
Distance and Direction Hector: Suppose there exists a data dependence for or = (tn, :13, rr,,) and ,5 =
(J31, B3, ..., fin). Then the distance vector D = (D1, ..., DR] is defined as B — 0!. The d|'n=.'ctr'on vector d = (J1,
d1, .. ., d,,} of the dependence is defined by
l < ifn, < ,3,-
d,- = { = iftl‘, = ,3, {ll).3_]
lb ifn,->,B,

The elements are always displayed in order from left to right and from the outermost to the innermost loop
in the nest.
For example, consider the following loop nest:
Dn t= .-1,, U1
Do j = L1, U2
ﬁll i‘ Advorrced Cmnptuterﬁirhitecture

Do It = L3, U3
A[_i+ l_,j,k— l}=A{r',j, I:)+ C
Enddn
Enddo
End-do
The distance and direction vectors for the dependence between iterations along three dilnensions of the
anay A are ('1,0, —l} and [_<, = , 1>}, respectively. Since several different values of cr and ,6 may satisfy the
dependence equations, a set of distance and direction vectors may be needed to completely describe the
dependence.
Direction vectors are liseﬁll for calculating the level of loop-canted dependences. Adependence is carried
by the outermost loop for which the direction in the direction vector is not “="'. For instance, the direction
vector (<, = , >1 for the dependence above shows the dependence is carried on the i-loop.
Carried dependences are important because they determine which loops cannot be executed in parallel
without synchronization. Direction vectors are also useful in determining whetlier loop interchange is legal
and proﬁtable. Distance vectors are more precise versions of direction vectors that specify the actual distance
in loop iterations between two accesses to the same memory location. ‘They may be used to guide opti-
mizations to exploit parallelism or the memory hierarchy.

10.3.2 Subscript Separability an-rl Partitioning

Dependence testing thus has two goals. It tries to disprove the dependence between pairs of subscripted
references to the same array variable. If dependences exist, it tries to characterize them in some manner,
usually as a minimum complete set of distance and direction vectors. Dependence testing must also be
cortscrvntive and assume the existence ofony dependence it cannot dilspmve. Otherwise the validity of any
optimizations based on dependence information is not guaranteed.
Subscript Categories The term .snbscrr'pt refers to one of the subscripted positions in a pair of anay
references, i.e. the pair of subscripts in some dimension of the two array references. When testing for
dependence, we classify subscript positions by the total number of distinct loop indices they contain.
A subscript is said to be zero index variable (ZlV] if the subscript position contains no index in either
reference. A subscript is said to be single indetr voriabie (SW) if only one index occurs in that position. Any
subscript with more than one index is said to be multiple indct variable (MW).

I»)
Cg Example 10.4 Subscript types in a loop computation
Consider the following loop nest of three levels, identified by indices v',j, and k.
Do r, = Ll, o‘,
Do J: =12, 1.-'2
Do It = L," L;
Misnt.i.,,.g..@s,,.,s — 4,,
A[5, r'— 1,j}=A(N, i, k) + C
Enddn
Enddn
Enddn
When testing for a flow dependence between the two references to A in the code, the first subscript is ZIV
because 5 and N are both. constants, the second is SIV because only index i appears in this dimension, and the
third is MW because both indices] and k appear in the third dimension. For simplicity, we have ignored the
output dependence in this example.

Subscript Separnbility When testing multidimensional arrays, we say that a subscript position is separable
if its indices do not occur in t11e other subscripts. If two different subscripts contain the same index, we say
they are coupled. Separability is important because multidimensional array references can cause imprecision
in dependence testing.
if all the subscripts are separable, we may compute the direction vector for each subscript independently
and merge the direction vectors on a positional basis with ﬁtll precision. The following examples clarify these
concepts.

I/)
[<3 Example 10.5 From separability to direction vector and
distance vector
Consider the following loop nest:
Du i| = L], [F1
D" I = L2, U2
Du .t= 1.3, L3
At?-.fJl = Atﬂi, kl + '3
Enddn
Enddn
En-dd-n

The fust subscript is separable because index. i does not appear in the other dimensions, but the second and
third are coupled because they both contain the index]. ZIV subscripts are separable because they contain
no indices.
Consider another loop nest:
D0 q= ,:,, U,
Du ;= L2, U2
Do k = .r.,,, L-",,
A(r'+ l,j,k 1)=A{i,j, k}+ C
End-do
End-do
Enddn
495 ‘i Advanced Cornprreerﬁrchitecture

The leftmost direction in the direction vector is determined by testing the first subscript, the middle
direction by testing the second subscript, and t11c rightmost direction by testing the third subscript.
The resulting direction vector (-1 , = , >] is precise. The same approach applied to distances allows us to
calculate the exact distance vector ( l , 0, 1}.

Subscript Pkrrtirioning We need to classify all the subscripts in a pair of array references as separable
or as part of some minimal coupled group. A coupled group is minrlrrrai if it cannot be partitioned into two
nonempty subgroups with distinct sets of indices. Once a partition is achieved, each separable subscript and
each coupled group has completely disjoint sets of indices.
Each partition may then be tested in isolation and the resulting distance or direction vectors merged
without any loss of precision. Since each variable and coupled subscript group contains a unique subset of
indices, a merge may be ﬁt/ought of as a Cartesian product.
ln the following loop nest, tl'|e ﬁrst subscript yields the direction vector (<1) for 1:he i-loop. The second
subscript yields the direction vector [=10 for the _,r'-loop. The resulting Cartesian product is the single vector
t<, =3-
Dn r= L|,U1
DU = L2, Ll:

A(r'- l.j}=A(i,_;]+ C
Enddrr
Enddn

Consider another loop nest where the first subscript yields the direction vector {<2} for the i-loop.
D0 i = L1, U]
no j =r.,, U2
an-+1, 5} = no, N] + c
Enddrr
Enddu
Since j does not appear in any subscript, we must assume the full set of direction vectors for the j-loop:
{(<), (=), {>1}. Thus a merge yields the following set ofdirection vectors for both dimensions:
{(1 "ii. (‘E =1. (‘H I-‘ll
10.3.3 Categorized Depenrlenr:eTest:s
The goal of dependence testing is to construct the complete set of distance and direction vectors representing
potential dependences between an arbitrary pair of subscripted referenom to the same array variable. Since
distance vectors may be treated as precise direction vectors. we will simply refer to direction vectors.
Tire T-irstirrgrlllgioritlrm The following procedure is for dependence testing based on a partitioning approach,
which can isolate unrelated indices and localize the computation involved and thus is easier to implernenl.
{ll Partition the subscripts into separable and minimal coupled groups using the following algorithm:
MrMt.r.,,.g..q.,,.,...i — 4,,
Subscript I"artitinningAlgnrithm (Goﬂ, Kennedy, and Tseng, 1991)
Input: A pairofm-dimensional array references
containing subscripts S| ...Sm enclosed in n loops
with indices I, ...f,,.
'Dutput:A set ofpartitions Pl . . . P"-, n’ S n, each
containing a separable or minimal couplod group.
Fnreachi, 1 Eién Do
Pr <— l5.-i
Endfor
For each index I,-, l 5 ii n Du
it ~t— inane}
For each remaining partition 11- Do
if3S| E such that SI contains ipthcn
ifir= inrme} then
Ir <—_f
else
P,-t. <— Pk u E’,-
Diseard P_‘_,-
Endit"
En-dif
Endfor
Endfnr
(2) Label each subscript as ZI\-K SIY, or I'vE[\{
(3) For each separable subscript, apply tl1e appropriate single subscript test (ZIUI, SIY, MIV) based on
the complexity ofthe subscript. This will produce independence or direction vectors for the indices
occurring in that subscript.
(4) For each coupled group, apply a multiple subscript test to produce a set of direction vectors for the
indices occurring within that group.
{'5} If any test yields independence, no dependences exist.
{st Otherwise merge all the direction vectors computed in the previous steps imo a single set of direc1:ion
vectors for the two references.

Tim Categories Dependence test results for ZN subscripts are treated specially. If a ?.I'v‘ subscript proves
irrdependeuce, the dependence test algorithm halts immediately. lf independence is not proved, the ZIV test
does not produce direction vectors, and so no merge is necessary. For the implementation of the above
algorithm, we have specified how to perform the single subscript tests (?.W__ SIY, MW) separately. We
consider below the trivial ease ot'ZIV first, then SIY, and finally MW which is more involved.
We first consider dependence tests for single separable subscripts. All tests presented assume that the
subscript being tested contains expressions that are linear in the loop index variables. A subscript expression
is linear if it has the fonn a| i| + 113?; + ...+ 11,, in + e, where ii is thc index for the loop at nesting level Ir;
all oh 15 it 5 rr, are integer constants; and c is an expression possibly containing loop-invariant. symbolic
expressions.
fill ‘i Advanced Cmnpttoernrchitecrure

The 2'tl|"‘l'e:r The ZIV test is a dependence test performed on two loop-invariant expressions. If the system
determines that the two expressions cannot be equal, it has proved independence. Otherwise the subscript
does not contribute any direction vectors and may be ignored. The ZW test can be easily extended for
symbolic expressions. Simply fonn the expression representing the ditferenoe between the two subscript
expressions. If the difference simplifies to a nonzero constant, we have proved independence.
The SI? ‘first An SW subscript for index i is said to be stnmg if it has the form (oi + c|, m" + C2), i.e. if it
is Iincar and the coefticicnts of the two occurrences of the index i are constant and equal. For strong SW
subscripts, define the dependence distance as

d=r’-r= I
(10.4)
A dependence exists if and only if d is an integer and la‘ l E U — L, where U and L are the loop upper and
lower bounds. For dependences that do exist, the dependence direction is given by

I< Hd>O
Direction= i = ifd = -[1 (I05)
l> fidcfl
The strong SIV test is thus an exact test that can be implemented very efliciently in a few operations. A
bounded iteration space is shown in Fig. ltlfia. The case ofa strong SIV test is shown in Fig. 10.-Eb.
Another advantage ofthe strong SW test is that it can be easily extended to handle loop-invariant symbolic
expressions. The trick is to first evaluate the dependence distance d symbolically. ll‘ the result is a constant,
then the test may be performed as above. Otherwise calculate the difference between the loop bounds and
compare the result with d symbolically.
A weak SW’ subscript has the form (tn i + cl, {I2 i’ + C2), where the co-efficients of the two oeeturenees
of index i have different constant values. As stated previously, weak SIV subscripts may be solved using
the single-index exact test. However. we also find it helpful to view the problem geometrically, where the
dependence equation
fll i+ cl =rI2i" + c-2

describes a line in the two-dimensional plane with i and i’ as axes.

The weak SW test can then be fomiulated as determining whether the line derived from the dependence
equation intersects with any integer points in the space bounded by the loop upper and lower bounds, as
shown in Fig. lD.5.
Two special cases should be studied separately.

‘Meek-Zero Silt’ ‘ten The case in which .o| = D or tr; = 0 is called a weak-zero SN subscript, as illustrated
in Fig. 10.6c. If H1 = 0, the dependence equation reduces to
.
r
¢'2'¢“|
.n|
M.Mt.t.,,.,...t.,..,..i —
do 1c 1 - 1,4
H’ In 1|; F-.(i'| — Elfi — 1'}
9‘

u-— 0 0 0 ‘,0’, - 0 0 K0
0 0 X0"! 0 ca K0", 0
0 :0/I 0 0 ’,c/I 0 0
|_- /O’/O G O - /,0’, 0 0 <3
’ I
ilt; i
1_ u
GOOD
[a] Bounded iteration space [I1] Strung SIV

do 1|:-1 - 1,4 in 1|: 1 - 1,a

1 1|: Mil - I-.tl'| 1- 1I3 Riii - A [5 - i'l

- 0 0 ‘Io’! - 0 0 go"!
0 K0", 0 0 go”, 0
X0", 0 0 0 ‘,0’, 0
- _ X 0 0 0 — /0" 0 0
,, -

[c] Weak-zero SIV

in/$0
{d] Weak-crossing SW

Fig. 1lI.i Geornetric view of SIV nest: in four cases (Cour-nesy of Gui e1: al. 1991: reprirmed fmrn ACFI
SIGFLAN Confl Progmmtning Language Design and hnpJementutiun.Tu*cnto.Cana.da. 1991]

It is only necessary to check that the resulting value for i is an integer and within the loop hounds
A similar check applies when 0| = 0.
The weak-rem SW test finds dependences caused by aparticular iteration i. In scientific cndes, r IS usually
the first ur last itm-atinn of thc lnnp, eliminating om: possible direction vcctur for the dependence
SUD i‘ " Advanced Compimerfirchiteottrre

Consider the following loop for a strong SW test:

Do i= 1, N
Ali — ZN} = Ft(i+ N)
Enddn
The strong SIV test can evaluate the dependence distance d as ZN — N, which simplifies to N. This is
compared with the loop bounds symbolically, proving independence since N ‘-> .'t~'- l.
Consider the following simplified loop in the program tomcatv from the SPEC benchmark suite
(Uniejewski, 1989):
D-n i = 1, H
Yti, N) =Y(i.Nl I YtN,Nl
Enddn
The weak-zero SIV test. can detcnninc that the use of Y(l, N) causes a loop-carried true dependence from
the first iteration to all the other iterations. Similarly, with aid from symbolic analysis, the weak-zero SIV
test can discover that the use of Y(N, N) causes a loop-can-ied antidependence from all iterations to the last
iteration. By identifying the first and last iterations as the only cause of dependences, the weak-zero SW test
advises the user or compiler to peel the first and last iterations of the loop, resulting in the following parallel
loop:
Y{l, N) = ‘([1, N) + Y(N, N)
Do i = 2, N I
Y[t1N)= Yti. 1'4] + Y0". Ni
End-tin
Y['N, N) = Y{N, N) + Y[N, N)

Weak-Crowning Sl'i"'l"e:t All subscripts where n2 = —n| an: wen}:-crossing SIV. These subscripts typically
occur as pan ofCho|esky decomposition, also illustrated in Fig. l{J.6d. In these cases we set i= i’ and dcrivc
the dependence equation
. i
,= ('2 ' Ft
Eco
This corresponds to the intersection of the dependence equation with the line i = r’. To determine whether
dependences exist, we simply need to check that the resulting value i is within the loop bounds and is either
an integer or has st noninteger part equal to L"2.
Weak-crossing SIV subscripts cause crossing dependences, loop-carried dependences whose end points
all cross iteration i, These dependences may be eliminated using a loop-spfirting transforrnation {Kennedy
et al, I99 I } as described below.
Consider the following loop from the Callal1an—Dongarra~Levine vector lest (Callahan et al., I988]:
[In i= LN
A[i}=A{N—i+ l_)+C
Enddn
Tl'le weak-crossing SH’ test determines that dependences exist between the deﬁnition and use ofA and that
they all cross iteration {N + l)f2. Splitting the loop at that iteration results in two parallel loops:
amt r.1....t.t....,...s......r . — ,,,,
[Io r'= l, (ii.-'+ l}f2
At[r'J=A(_N—r'— l_‘_t+C
Endtln
Do i={N+ I]-"2+ l,N
At[r’j=A{l'~l—r'—l]+C
En-rldn

The MW Ten: SIV tests can be extended to handle complex iteration spaces where loop bounds may be
filrlctions ofoflrer loop indices, e.g. triangular or trapezoidal loops. We need to compute the minimum and
maximum loop bounds for each loop index.
Starting at the outermost loop nest and working inward. we replace each index in a loop upper bound with
its maximum value (or minimal if it is a negative tenn). We do the opposite in the lower hound, replacing
each index with its minimal value (or maximal if it is a negative team}.
We evaluate the resulting expressions to calculate the minimal and maximal values for the loop index and
then repeat for the next inner loop. This algorithm returns the maximal range for each index. all that is needed
for SIV tests.
The Banerjee-GCD test may be employed to construct all legal direction vectors for linear subscripts
containing multiple indices. In most cases the test can also determine the minimal dependence distance for
the carrier loop.
A special case of Mil-" subscripts, called F.DlV[rest1-ictcd double—index variable) subscripts, have the form
(o| F + c|, I71] + cl]. They are similar to SIV subscripts except that randy‘ are distinct indices. By observing
different loop botuids for i andj, SIV tests may also be extended to test RDIV subscripts exactly,
A large body of work was performed in the field of dependence testing at Rice University, the University
of lllinois, and Oregon Graduate Institute. What was described above is only one of the many dependence
testing algorithms proposed. Experimental results are reported from these research centers. Readers are
advised to read published material on Banerjee's test and the GCD test, which provide other inexact and
conservative soluliorm to the problem.
The development of a paralleliaing compiler is limited by the difficulty of having to deal with many
nonperfcctly nested loops. The lack of datatlow information is often the ultimate limit on automatic
compilation of parallel code.

CODE OPTI MIIATI ON AND SCH EDULING

— 1n this section, we describe thc roles of compilers in code optimization and code generation
for parallel computers. In no case can one expect production of a true optimal code which
matches the hardware behavior perfectly. Compilation is a sottware technique which transforms the source
program to generate better object code. which can reduce the running time and memory requirement. On a
parallel computer, program optimization often demands an effort from both the programmerand the compiler.

10.4.1 Scalar Optimization with Basic Blocks

Instruction scheduling is often supported by both compiler techniques and dynamic scheduling hardware. In
order to exploit insrmcrion-level parallelism (ILP), we need to optimize the code generation and scheduling
process under both machine and program constraints. Machine constraints are caused by mutually exclusive
SUI i‘ Advartced Cmnpiunerﬁtrhiteczure

use of functional units, registers, data paths, and memory. Program constraints are caused by data and control
dependences. Some processors. like those with VLIW architecture, explicitly specify [LP in their instntctions.
Dthers may use hardware interlock, out-of-order execution, or speculative execution. Even machines vvith
dynamic scheduling hardware can benefit from compiler scheduling techniques.
There are two alternative approaches to supporting instruction scheduling. One is to provide an additional
set of nontrapping instructions so that the compiler can perform aggressive static irtstmction scheduling.
This approach requires an extension of the instruction set of existing processors. The second approach is
to support out-of-order execution in the rnicro—architec1ure so that the hardware can perform aggressive
dynamic instruction scheduiing. This approach usually does not require the instruction set to be modified but
requires complex hardware support.
ln general, instruction scheduling methods ensure that control dependences, data dependences, and
resource limitations are properly handled during concurrent execution. The goal is to produce a schedule
that minimizes the execution time or the memory demand, in addition to enforcing correctness of execution.
Static scheduling at compile time requires intelligent compilation support, whereas dynamic scheduling at
rim time requires sophisticated hardware support In practice, dynamic scheduling can be assisted by static
scheduling in improving performance.
Precedence Constraint: Speculative execution requires the use of program profiling to estimate
effectiveness. Speculative exceptions must not terminate execution. In other words, precise exception
handling is desired to alleviate the control dependence problem. The data dependence problem involves
instruction ordering and register allocation issues.
lfa flow dependence is detected, the write rnust proceed ahead of the read operation involved. Similarly,
output dependence produces different results if two writes to the same location are executed in a different
order. Antidependence enforces a read to be ahead of the write operation involved. We need to analyze
the memory variables. Scalar data dependence is much easier to detect. Dependence among arrays of data
elements is much more involved, as shown in Section 10.3. Other difficulties lie in interprocedural analysis,
pointer analysis, and register allocations interacting with code scheduling.

Basic Block Scheduling Abasic black [orj ust a bt'ocI:_) is a sequence ofstatements satisfying two properties:
(1) No statement but the first can be reached from outside the block; i.e. there are no branches into the middle
of the block. (2) All statements are executed consecutively if the first one is. Therefore, no branches out or
halts are allowed until the end of the block. All blocks are required to be maxirrtai in the sense that they cannot
be extended up or down without violating these properties.
For local optimization only, an extended basic block is defined as a sequence of statements in which the
first statement is the only entry point. Thus an extended block may have branches out in the middle of the
code but no branches into it. The basic steps for constructing basic blocks are summarized below:
{ll Find the !cader.s, which are the first statements in a block. Leaders are identified as being one or more
ofthe following:
(a) The first statement of the code.
{bl Thc target ofa conditional orunconditional branch.
{ct A statcrrtcnt following a conditional branch.
(2) For a leader, a basic block consists of the leader and all statements following up to but excluding the
next lcadcr. Notc thatthc hcg inning ofinacccs siblc codc {dead codcj is not con sidcrctl a lcadcr. [n fact,
dcad co-dc should hc cliniinatcd.
MtM,,.t.mg..qs,.m — 5,,

3?) Example10.6 Basic block construction in a bubble sort

program (S.Graham,j.LHennessy,and ].D.Ullman,
1992)
Abubblc sort prugralu sorts an array A[j] with statically allocated storage. Each clement of;-1 requires 4 bytes
of byte-addressable rnemoly. The elements ofA[;] are numberedj — 1, 2, ..., n. where rt is a variable. To be
Speeifie, /l[1] is stun:-tl in location addr(/1) + 4 * {j I), where addr(A} produces the starting address of the
array A. Thc following source code is for bubble sort:
Fur i:=r:—ldowntu 1 do
Fur j 2= 1 to idn
If AU] >A[j+ 1] rm.--1
Begin
~mw:=ALn
ALI] -= Alf + 1]
AU - 1] 2= temp
End
End nfj-Innp
End nf i-lnnp
If a three-acldress machine is assunied, the above code is translated into the following assembly language
cu-du. Variable names an the right of := stand for values, and on thc left for addresses.
lFH—l
s5: ifi <1 l gutu sl
j := l
s4: ifj > i gutu s2
tl := — I
:= 4 * t1
:= A[t2] /A[j]f
:= + I
= :4 - 1
EGE SI-1'3 := 4 * t5
t‘? := A[t6] .lA[j+l]f
ift3 < = sf gain $3 /it‘ A[j] >A[j+l] then begin ...,‘
t8 := j — l
t9 := 4 " IE
temp := A[t9] I temp := A[j]f
I10 := j + 1
t11 := tlfl — 1
SCH i -- Advanced Compiraerfitrchttectute

112 ;=4 "'tl1

113 ;=.-14112] tA[j+I 1:
114 :=j -1
t1s==4 * n4
A{tl5] == :13 fA[_i] := A[_i - I]!
ms :=j » 1
t11==t1e-1
t1s==4 * tn
A{tl 8] := temp IALH 1] := temp!
53: j:=j I 1
gotos4
52: i:=i—l
goto 55
st: halt
The above 31 statements are divided into B basic blocks as shown in Fig ll) T A programﬂow grqph ts
drawn to show the precedence relationship among the basis blocks. Each node I11 the ﬂow graph oorrespondtng
to one basic block may contain different numbers of statements. The entry node 15 B1, and the exit node is B2

Emit‘

B1‘ |:=n—1

B2 at-cigotoout

B3 |:=1 Em

B4 |'|'p-|go1oB5

B6
tt;=|"-1

|j13-i:=|j' 4+ 4 B5
B5 gotoﬂ-2

B;=y—1 BB I
B7 I gobﬂ-4
.Mt1ﬂ-t:=te|r'p

Fig. 10.? A flow grq:-I1 showing the pmoodcnon rchdonshlp among bash: blocks in the bubble sort program
{Courtesy ot'S.Graham.j. l_ Henrtesoy. and]. D Lllrnart Cottrse on Cede Dpflrnlzmlon and Code
GH'.|flfl‘flq|‘l,lNl!S1!fl‘fl Ire-drone of Compmer Salome. Stattforcl University. 199'!)
»...,.t.t.-.1..n..i....,.~.,i.,.....i — 5,,
While a program is being compiled, basic block records should keep pointers to predecessors and
successors. Storage reclamation techniques or linked list structures can be used to represent blocks. Sources
of program optimization include algebraic optimization to eliminate redundant operations, and other
optimizations conducted within the basic blocks locally or on a global basis.

10.4.1 Local and Global Optimizations

We first describe local code optimization within basic blocks. Then we study global optimizations among
basic blocks. Boflt iniraprooodural and interprocedutal optimizations are discussed. Finally, we identify
some machine-dependent optimizations. Readers will realize the limitations and potentials of these code
optimization methods.
Local Dptirnizortiona
These are code optimizations performed only within basic blocks. The information needed for optimization
is gathered entirely from a single basic block, not from an extended basic bloclt. No cont:roI—fiow information
between blocks is considered. Listed below are some local optimizations often performed:

{ii Locoi Common S-ubeirpression Elimination If a subeitpression is to be evaluated more than once within
a single block, it can be replaced by a single evaluation. For Example |o.s, in block B7, t9 and tl5 each
compute 4 "' (j - i}, and 112 and tlii each compute 4 “' j. Replacing 115 by t9, and 118 by 112, we obtain the
following revised code for B1’, which is shorter to execute.
tﬁ := j l
t'-1 := 4 " til
temp := A[t'5l']
tl2 := 4 * j
ti3 := .4.[ti2]
ans] -.=t13
A[tl'.'-'3] := temp
{.2} Local Constant Folding or Propogorion Sometimes some constants used in instructions can be computed
at compile time. This often takes place in the initialization blocks. The compile-time generated constants are
then folded to eliminate unnecessary calculations at run time. in other cases, a local copy may be propagated
to eliminate unnecessary calculations.

{3} Algebraic Optimization to Si'npl'iﬁv Earpressiorrs For example, one can replace the t'derrtr'.l}r statement A :=
B + 0 or A := B * 1 by A := B and later even replace references to this A by references to B. Or one can use
the commutative law to combine expressions C := A + B and D := B + A. The associative and di'srri£mrr've
i‘aw.'r can also be applied on equal-priority operators, such as replacing (rt — b) + c by ti (b — c) if (b — c) has
already been evaluated earlier.
{4} instruction Reordering Code reordering is often practiced to maximize tlte pipeline utilization or to enable
overlapped memory accesses. Some orders yield better code than others. Reordered instructions lead to better
scheduling, preventing pipeline or memory delays. In the following example, instruction I3 may be delayed
in memory accesses:
SUE i‘ Advoinced Covripuneriliirhiteczure

ll: Load R1, A

12: Load R2, B
13: Add R2, R1, R2 — delayed
14: Load R3, C
With reordering, the instruction 13 may experience no delay:
Load R1, A
Load R2, B
Load R3, C-
GEE: Add R2, R1, R2 — not delayed
{5} Efimination ofDeod Code or Unorv Operators Code segments or even basic blocks which are not accessible
or will never be referenced can be eliminated to save compile time, run time, and spaoe requirements.
Unary operators, sueh as arithmetic negation and logical cornplement, can often be eliminated by applying
algebraic laws, such as x I {—_v) = x — __\-', —{.t — y] = _t-' — x, (—x] " (—_t-'} = x ' _1-', Not(Not A) = A, etc. Boolean
expression evaluation can often be optimized after some form of minimization.
Gabe] Optimizations These are code optimizations performed across basic block boundaries. Control-
flow information among basic blocks is needed. John I-lennessy (1992) has classified intraprocedural global
optimizations into three types:

{I} Global Versions ofl_ocol Optimizations These include global common subeitpression elimination, global
constant propagation, dead code elimination, etc. The following example ﬁirtber optimizes the code in
Example 10.6 if some global optimizations are performed.

Ir)
é?) Example 10.1 Global optimizations in the bubble sort
program (5. Graham, j. L. Hennessy, and
]. D. Ullman,1992)
ln Example I116, block B? needs to compute A[t9] =A[4 * (j l)], which was oomputed in block B6. To reach
Bi’, Bo must be executed ﬁrst and the value ofj never changes between the two nodes. Thus the ﬁrst three
statements of B7 can be replaced by temp := t3. Similarly, t5‘ computes the same value as t2, I12 computes the
same value as 1:6, and tl3 computes the same value as t7. The entire block B7 may be replaced by
temp := t3
A[t2] := t7
A[t6] := temp

and, substituting for temp by

A[t2] := t7
A[t6] := t3
»..t..t..n.t..n.t....,..@,s......i — ,,,,
The revised program, after both local and global optimizations, is obtained as follows:
Bl: i := n — l
B2: lt‘i<1gotoout
B3: j := l
B4: lfj :'=igoto B5
B6: tl:= j — I
12 '.= 4 * ll
13 := A[t2j iAl_j]l
to 1= 4 * j
17 ;= altsj iAlj+1ji'
ll‘ 13 <= :7 golo B8
Bl: Alli] := ti‘
Alto] := t3
BS: j :=j + 1
golo B4
B5: i := i — I
golo B2
out:

{2} Loop Optiniizorions These include various loop transformations to be described in subsequent sections
for the purpose of vectorization, parallelization, or both. Somtimes code motion and induction variable
elimination can simplify loop structures. For example, one can replace the calculation of an induction variable
involving a multiplication by an addition to its former value. The addition takes less time to perform and thus
results in a shorter execution time.
In other eases, loop-invariant variables or codes can be moved out of the loop to simplify the loop nest.
One can also lower the loop control overhead using loop unrolling to reduce iteration or loop fusion to merge
loops. Loops can be exchanged to facilitate pipelining of long vectors. Many loop optimization examples will
be given in subsequent sections.
(3) Ccmrel-flmv Gptimizofim These are other global optimizations dealing with control structure but not
directly with loops. A good example is code hoisting, which eliminates copies of identical code on parallel
paths in a llow graph. This can save space significantly, but would have no impact on execution time.
lnterprocedural global optimizations are much more difficult to perfonn due to sensitivity and global
dependence relationships. Sometimes, procedure integration can be performed to replace a call to a procedure
by an instantiation of the procedtne body. This may also trigger the discovery of other optimization
opportunities. Interprocedure dependence analysis must be performed in order to reveal these opportunities.
Machine-Dependent Optimizations With a finite number of registers, memory cells, and ftlnctional
units in a machine, the efficient allocation of machine resources affects both space and time optimization
of programs. For example, strength reduction replaces complex operations by cheaper operations, such as
replacing 211 by o + H, :12 by n " ti, and i'engr.lr[.5‘l + S2} by lcngrh[.S‘l) + i'ength[5'2}. We will address register
and memory allocation problems in code generation methods in the next section.
Sill i‘ - Advorrced Compmerfirrhitecture

Other ﬁow control optimizations cart be conducted at the machine level. Good examples include the
elimination of unnecessary branches, the replacement of instruction sequences by simpler equivalent code
sequences, instruction reordering and multiple instruction issues in superscalar processors. Machine-level
parallelism is usually exploited at the ﬁne-grain instruction level. System-level parallelism is usually
exploited in coarse-grain computations.

10.4.] Veceorization and Parallelizaticm Methods

Besides scalar optimizations, we need to perforrn vector andfor parallel optimizations. The purpose is to
improve the perfonnance of programs that manipulate large data arrays or can be partitioned for parallel
execution. Kectoriaarion is the process of converting scalar looping operations into equivalent vector
instruction execution. Poroilelizotion aims at converting sequential code into parallel form, which can enable
parallel execution by multiple processors.
An optimizing compiler that does vectorization automatically or semiautomalically with directives from
programmers is-called a vectorizing compiler or simply a veclortcer. Similarly, aparofielizing compiler should
be designed to generate parallel code from sequential code automatically or semiautomatically. We introduce
below various methods suggested for vectorization and parallelization. Vector hardware must be provided to
speed up vector operations. lvlultiprocessors or multicomputers must be used to execute parallelized codes.
lnhibitors ofvectorization and parallelization are also identiﬁed in some program constructs in order to avoid
unrewarding attempts.
Hicrorixurion Method: We describe below several basic methods for vectorization. Many other methods
can be found in the extensive literature available on the subject. We use Fortran 90 notation; for example,
successive iterations in the following loop are totally independent:
Do 2t]I=8. 120,2
20 sn)=a(1|s)~cn+1}
This scalar loop can be convened into one vector-odd instruction deﬁned by the following array assignment:
Alli: 120:2) = Btfl 1 :l23:2l I C|_'5l:l2 1 :2)
{ll Lise of Tempomry Storage Consider the following Do loop:
Do 20 [= l, N
All] = BU) + CU)
2o Bo] = 2 * A(l+1)
This loop represents the following sequence of scalar operations:
a[1)= art) + cu)
B(l) = 2 " M2}
M3) = Bill + Cl?)
B[2) = 2 * M3}

ln order to enable pipelined execution by vector hardware, we need to introduce a temporary array
TEMP{l :N) to produce the following vector code:
Mtnnt.t,,.g..qe,.n — 5,,
TEM Pll IN) = A|{21N+I)
All :N_) = B{1iN)+ C[l:'N)
B{1:N} = 2 " TEMP{l:N}
Without the TEMP array, the second array assignment B(l:N) may use the modified Ml :N) array which
was not intended in the original code.
{2} Loop interchanging Vectorization is often performed in the inner loop rather than in the outer loop.
Sometimes we intercltenge the loops to enable the veetorization. The general rules for loop interchanges are
to make the most profitable vectorizable loop the innermost loop, to make the most profitable parallelizable
loop the outermost loop, to enable memory accesses to consecutive elements in arrays, and to bring a loop
with longer vector length (iteration count} to the innermost loop for vectotimtion. The profitability is defined
by i1'nproverne11t in execution time. Consider the following loop nest with two levels:
Do 2(lI=' 2, N
[lo ll} J = 2, N
S|: A('l,I}=(A(L.T—l)1A{I,.l P 1)f2 (10.6)
ID Continue
20 Continue
The statement S] is both flow-dependent and imtidependent on itsell‘witl:t the direction vectors {=, <1 and
(I, >1, or more precisely the distanoe vectors (0, -1) and [0, 1), respectively. This implies that the loop cannot
be vectorized along the J-dimension. Therefore, we have to intettltange the two loops:
Du 20 J = 2, N
Do 20 1 = 2, N
A(l,_I}=(A(I,J 1)+A{l,.I+1)}!2
26 Cnnfi nue

Now, the inner loop [I-dimension] can be vectorized with the zero distance vector. The vectorized code is
Do 201 = 2, N
A{2:N, J) = (A{2:N_. J — 1) t A{2:N, J I 1)}/2
20 Cnnti nue
ln general, an innermost loop cannot be vectorized if forward dependence direction (*1) and heclcward
dependence direction (>1 coexist. The [=} direction does not prevent vectorization.
{3} Loop Distribution Nested loops can be veetoriizod by distributing the outermost loop and vectorizing each
of the resulting loops or loop nests.
Do ID] = 1, N
B(I, 1) = U
Do 20] = l, M
Aﬂ) = AU) + B|[I, J) * C(l, J]
20 Continue
D[[] = E{l] + All}
lll Continue
SH] i ' Advoiiced CompuoerA.n:hitecto.re

The I-loop is distributed to three copies, separated by the ne-am J-loop from the assigrunent to array B and
D, and vectorized as follows:
Bil :N, 1] = 0 (a zero vector]
Do 30 I = 1. N
AH) =A(l] I- Bil, i:M) “‘ CH, l:l'vl)
3G Continue
D(1:N) = E(l:N] I-A{l:l"~l)
{4} Vector Reduction We have deﬁned vector reduction instructions in Eqs. 3.6 and 3.7. A number of
reductions are deﬁned in Fortran §0. In general, a vector reduction produces a scalar value from one or two
data arrays. Examples include the sunl, product, maximum, and minimum of all the elements in a single array.
The dot pmduct produces a scalar S = I.;| A, >< B, from two arrays 1'-it-[I :11} and B(1:n]. The loop

Do 4'0 I = l, N
S]: A{l) = Blil) + C(I)
5'2: S = S + A(_l]-
S5: AMAX = ]'v'lAX[AMAX, A(I})
40 Continue
has the following dependence relations: .S'|(=E2, S|{=].S'3, .S‘2(~=i)S2, S3 ‘=1 S3. Altltotlgh statements S2 and S3
each forrn a dependence cycle, each statement is recognized as a reduction operation and can be 'v’¢ClIG1‘l2JCtIl
as follows:
5,: A(l:N) = B(1:N] I C(l:N}
S1: S = S I SUM{A{l:N]|]
S3: ANIAX = IVL#tX(A.\'L#tX, lviAXVAI.{A{1:N)])
where SUM, MAX, and MAXVAL are all vector operations.
{Q Node Spitting The data dependence cycle can sometimes be broken by node splitting. Consider the
following loop:
Do 50 I = 2, N
5-'1: T{I}=A(l~1)iA(I i l)
5'2: AU) = EH] I CU}
S0 Continue
Now we have the dependence cycle .5‘,{-Q5‘; and S,{>‘_tS‘;_, which seems to prevent vectorization. However,
we can split statement 5| into two parts and apply statement reordering:
Do 50 I=2, N
Si“: X(I} = All t 1]
5'2: A(I) = B(I] I CH}
51.51 T(1l=Aii—1l' *' Kill
Si} Continue
i...t.n.1.....t.t...,...,........i — ,,,
The new loop structure has no dependence cycle and thus can be vectorized:
5‘;.rI X(2:N) = A(3:N + I)
5;: A(2:N) = B(2:N] + C(2:N}
.5-‘lb: T[2:N_) = A(ltN — 1] + X{2:N}

It should be noted that node splitting cannot resolve all dependence cycles.
(6) Other Vector Optimizations There are many other methods for vectorization, and we do not intend to
discuss them all. For example, senior variables in a loop can sometimes be expanded into dimensional arrays
to enable vectorization. Suberpressions in a complex expression can be vectorized separately. Loops can be
peeled, um-oiled, rerolled, or tiled (blocking) for vectorization.
Some machine-dependent optimizations can also be performed, such as amp mining [loop sectioning]
and pipeline citainirig, introduced in Chapter 8. Sometimes a vector register can be used as an accumulator,
making it possible for the compiler to move loads and stores of the register outside the vector loop. The
movement of an operation out ot‘ a loop to a basic block preceding the loop is called hoisting, and the
inverse is call-cd sinking. Vector loads and stores can be hoisted or sunk only when the array reference and
assignment have the same subscripts and all the subscripts are the induction variable of :1 vectorized loop or
loop constants.
Veetorizction Inhibitor: Listed below are some conditions inhibiting or preventing vectorization:
(1) Computed conditional statements such as IF statements which depend on runtime conditions.
(2) Multiple loop entries or exits [not basic blocks).
(3) Function or subroutine calls.
(4) lnputioutput statements.
(5) Reettrrences and their variations.
A mourn-znce exists when a value calculated in one iteration of a loop might be referenced in another
iteration. This occurs when dependence cycles exist between loop iterations. In other words, there must be at
l-cast one loop-currieddependence for a recurrence to exist. Any number ofloop-independent dependences can
occur in a loop, but a recurrence does not exist unless that loop contains at least one loop-carried dependence.

Code Pnroillelization Parallel code optimization spreads a single program into many threads for parallel
execution by multiple processors. The purpose is to reduce the total execution time. Each thread is a sequence
ofinstructions that must execute on a single processor. The most often perallelized code structure is perfonned
over the outermost loop ifdependence can be properly controlled and synchronized.
Consider the two-deep loop nest in Eq. 10.6. Because all the dependence relations have a “=” direction in
the I-loop, this outer loop can be peuallelized with no need for synchronization between the loop iterations.
The parallelized code is as follows:
Doall I = Z, N
Do J = Z, N
5'1: A(l,.l}= (A(I,.l— 1] I A[l,.l I" 1)}! 2
Enddo
Endall
FM Mtﬁruw H'lHr'n.-rqn;u|n1'
5| I i " Adnorrced Covnpunerﬁrrhitecrure

Each of the N -1 iterations in the outer loop can he scheduled for a single processor to execute. Each
newly created thread consists of one entire J-loop with a constant index value for 1. If dependence does exist
between the iterations, the Doacross construct can be used with proper synchronization among the iterations.
The Following example shows five execution modes of a serial loop execution to various combinations oi‘
parallelization and vectorization of the same program.

I»)
8 Example 10.8 Five execution modes of a FXI Fortran loop
on the Alliant FXIBO multiprocessor (Alliant
Computer Systems Corporation, 1989)
PX-“Fortran generates code to execute a simple Do loop in scalar, vector, scalar-concurrent, vector-
concurrent, and concurrent outer‘/vector imter (CCWI) modes. The computations involved are performed
either over a one-dimensional data array A(l:2048) or over a two-dirnensional data an-av B[1:256, 1:8),
where A(K)=B(l, J) for K=8{[ l]+.l.
By using array A, the computations involved are expressed by a pure scalar loop:

Do K=1,20-48
MK] =-MK] I 5
End-do

where S is a scalar constant Figure 10.821 shows the scalar [serial] execution on a single processor in 30,616
clock cycles.
The same code can be vcctorized into eight vector instructions and executed serially on a single processor
equipped with vector hardware. Each vector instruction works on 256 iterations of the following loop:
.t(1;2n4s;2ss) = an -.2o4s;2ss) + s
The total execution is reduced to 6043 clock cycles as shown in Fig. 1l].Sh.
The scalar-concurrent mode is shown in Fig. lO.Sc. Eight processors are used in parallel, perforrning the
following scalar computations:
Doall I =1, 3
Do l= l, 256
B{I,I]=B[I,.l) F 5
End-do
En-dall

Now, the total execution time is further reduced to 3992 clock cycles.
Figure l0.8d shows the vector-concurrent mode on eight processors, all equipped with vector hardware.
The following vector codes are executed in parallel:
Doall J = l, 3
A[K:2ﬂ40-I—l(:8} = A(K:2040+I(:S] + S
Endall
aatino.t,o,.,..t..,.,.i — 5,,
This voetorized execution by eight processors results in a total time of 961] clock cycles.
Finally, the sarne program can he executed in GOV! mode. The inner loop executes in vector mode, and
the outer loop is executed in parallel mode. In Fortran 90 notation, we have:
EH18, l:256) = B{1:S, l:256] + S
The notal execution time is now reduced to 756 clock cycles, the shortest among the ﬁve execution modes.

A111] =A[1]+ S,A[2j|=A{2j| + S, ...,A[2ID4-B) =A{2tJ43]+ S

[a] Scalar exiacutlonon one processor in 30,616 cycle-s

A[1: 25ﬁ]= A{1:256) + s, M251: 512)= A1251 : 5121+ s,

M1793 : zotai = A11 rss : 2o¢a] + s
[lo] Vector oxoctttlort on one processor sequentially in 604-B cycles

P1: H-[1, 1i=a[1. 11+s.a[1.2i=st1,2y+s, a{1.2so1=a-[1,2so1+s

P2: s12, 1i=a{2. 1)+s,a{2_21=et2.2i+s, ...,ot2,2sci=at2,2sci+s

Pa: B[B,1]= B[6, 11+ S, B[E., 2] =B[B, 2]+S, 5118,2515) =8-[8, 256) + S

[oi Scalar-concurrent execution on eight processors In $92‘ cycles

P,: A{1:2D:t1:B}=A[1:2ﬂ41:B]+S
P2: n{2:2o¢2:ai=A{2:2m2:ai+s

P8: A[B:2G4B:3]=A{8:2Il}4B:6]+S

[ell \1'iBCiCI'-GOt1GUffB‘ﬂ‘l&}tB=t>tJ'[|0t‘| on elgtt processors In 96-O cycles

P1: a[1,1:25s]=a(1,1:2sey+s
P2: a{2,1:2ss1=s[2,1:2soi+s

PB: s[s,1:2so)=a{a,1:2so)+s
[oi CCWI exeoutlonon eight processors in 156 cycles
Fig. 1Il.,ll Five execunlon modes of a FX.tFornran loop on 1:l-teﬁliant l"1uidp1rocesso1~ {Courtesy ofhl lianr
Computer Systems Cou~p-oration. 1989]

Inhibitor: of Parallelization Most inhibitors of vectorimtion also prevent parallelization. Listed below
are some inhibitors ofparallelization:
(1') Multiple entries or exits.
(2') Function or subroutine calls.
Ff» Mtﬁrnii H'l'Ht'mn;|wtn-\'
5| 4 i ' Aduortced Cmnptunerﬁtchitccturc

(3) Input.-‘output statements.

{4} Nontlctctniinisni of patallcl execution.
(5) Loop-carried dcpcndenccs.
While only baclrward dependences interfere with vecto:-irration, forward and backward dependences
both affect parallclization. Thc overhead of synchronization code can outweigh pcrlbrmancc gains from
parallelization. We will illustrate this tradeoff analysis for multitasking on the Cray X-MP in Chapter 1].
Most code paralleliration is conducted at the loop level. To reduce or increase grain size, one must consider
thc tradcotfs between computations and communication. This is a dilﬁcult problem, and none of the existing
compilers for parallelism has this capability. ln most cases, the tradeoﬁ' studies are done by programmers.
However, compiler directives can he used to guide the code optimization process.

10.4.4 Code Generation and Scheduling

Issues involved in code generation include order of cxccution, instruction sclcction, rcgistcr allocation,
branch handling, post-optimizations, etc. We describe the concepts ofbasic blocks and instruction scheduling
schemes for basic blocks. Then we consider register allocation, pattem matching, and other table-driven
methods for advanced code generation. How to expand code generation methods for multiple processors
systematically is still a wide-open research area.
Directed Acyclie Graph: Because instructions within each basic block are sequenced without any
backtracks, computations performed can thus be rcprescntcd by a directed acyclic graph (DAG). A DAG
can he built in one pass through a basic bloclt. The nodes in a DAG represent values. Each interior node is
labeled by the operator that produces its value. Edges on the DAG show the data dependence constraints.
The child.rcn of a node arc the nodcs producing the operand values. The leaf nodes carry the initial values or
constants existing on entry to a basic block.
DAG construction repeats the following steps item node to node. Consider the statement A := B + C in a
basic block. We first find nodes rcprcscnting thc values of B and C. If B a.nd C arc not computed in thc block,
they must be retrieved from leaf nodes. Otherwise, B and C should come from interior nodes of the DAG.
Then we create a node labeled “+”. Children oi‘ this node are the nodes for values of B and C. If there is
already an identical nodc [same labcl and satnc child nodes}, nodc creation can bc skipped. The node for “+"
becomes the current node for A. tn the case of a data transfer operation A := B, find the node representing the
value of B, Then the node representing B becomes the current node for A. Exceptions do exist. A procedure
cal] must assume all variable values have changed. Ifa variable could possibly point to another variable, thcn
that variable could now have a new value. Assignment to elemenu of an array must be assumed to alter the
entire array.

I»)
g Example 10.9 Construction of a DAG for the inner loop
kernel of the bubble sort program (5.Graham,
]. L. Hennessy, and j. D. U|lman,1992)
Listed below arc the statements contained in the basic block B7 ofthe bubble sort program in Fig. lO.T.
,=»,,.e.m..,..-.t.,t.,,.,...,i.,..,..t — 5,
IS := 1-I 1
19 := 4'18 }r.emp:=A[jj
lemp := awn]
t1D := j+l
t11 := no 1 '2 -I?
U
H2 := 4*n1
t13 := Ann] I
t14 :=' j-l 1

H5 4*n4 lAfiL=Afi¢1]
A[t15] := t13 ]
\
L16 := j+ I 1
tl T := no-1 !Afi+lL=mmp
113 := 4"‘ ll?
A [tl 8] temp 1
The corresponding DAG representation of block B‘? is shown in Fig. ll) 9 For nodes Will‘! the same
operator, one or more names are labeled provided they consume the same operands [although they may use
different values at different times]. The initial value of any variable x is denoted hyxu fiueli as the A519, and
te.rnpD at the leafnodes.

xi“
H‘

\
to
A
/
\M

|Aq
0 0 0 '""==1
leaves

FIg.1I|I.! Dtrec1:ed acyclic graph representation of the basic block B?.1:he Inner lo-op of 1:hehubl:|le sort:
program in E':tan\pl-es 10.6 and 10.? {Ct5I.l"|2QS}I' of S. Graharm]. L. Horne-ss}t and j D. Ulrnan,
Course an Code Opfl-nlmflfln and Code Ge|1erutlonMksnem lrtscteute of Curnpueer Scienoe. Stanford
University, 1991]
5| Ii i‘ - Adrorrced Compmerfirrhitecture

ln order to construct a DAG systematically, an auxiliary table can he used to keep track of variables and
temporaries. The DAG constmction process automatically detects common subertpressions and eliminates
thorn accordingly. Copy pmpagnriort can be used to compute only one of the vat'iab1es in each class. The
construction process can easily discover the variables used or assigned and the nodes whose values can be
computed at compile time. Any node whose children are constants is itself a constant. One can also label the
edges in a DAG with the delays.
Lin Scheduling A DAG represents the How of instructions in a basic block. A topological sort‘ can be used
to schedule the operations. Let READY be a buffer holding all nodes which are ready to execute. Initially,
the READY buffer holds all leaf nodes with Zero predecessors. Schedule each node in READY as early as
possible, until it becomes empty. Alter all the predecessor (children) nodes are scheduled, the successor
(parent) node should be immediately inserted into the READY buffer.
With list scheduling, each interior node is scheduled after its children. Additional ordering constraints are
needed for a procedtue call or assignment through a pointer. When the root nodes are reached, the schedule
is produced. The length of the schedule equals the critical path on the DAG. To determine the critical path,
both edge delays and nodal delays must be counted.
Some priority scheme can be used in selecting instructions from the READY buffer for scheduling.
For example, the seven interior nodes of the DAG in Fig. ltl.9 can be scheduled as follows, based on the
topological order. In the case of two ternporaries using the same node, we select the lower-numbered one.
The following sequential code results:
112 :=4 *j
til :=j —l
113 :=A[tl2]
19 :=4 '13
temp := A119]
A[tSl] .=t13
A[tl2] := temp
List scheduling schedules operations in topological order. There are no haclrtracks in the schedule. I1 is
considered the best among critical-path. branch-and-bound for nticroinstruction scheduling (Joseph Fisher,
I979}. Variations of topological list scheduling do exist such as introducing a prr'ort'r_y junction for ready
nodes, using tap-dawn versus barium-up direction, and using cycle scheduling as explained below. Whenever
possible, parallel scheduling of nodes should be exploited, of course, subject to data, control, and resource
dependence constraints.
Cycle Scheduling List scheduling is operation-based, which has the advantage that the highest-priority
operation is scheduled first. Another scheduling method for instructions in basic blocks is based on a cycle
.rc.la-eduling concept in which "cycles" rather “operations” are scheduled in order. Let READY be a buffer
holding nodes with zero unscheduled predecessors ready to execute in a current cycle. Let LEADER be a
buffer holding nodes with zero unscheduled predecessors but not ready in a current cycle (cg. due to some
latency unfulfilled). The following cycle scheduling algorithm is modified from the list scheduling algorithm:

Cu rrent-cycle = D
s.o.iMdi,.o.s.,q....,...i — 5,,
Loop until READY and LEADER are empty
For each node rt in READY (in decreasing priority order]-
Try to schedule n in current cycle
If successful, update READY and LEADER
Increment Current-cycle by l
end of loop

The advantages ofcycle scheduling include simplicity in implementation t‘or single-cycle resources, such
as in a superscalar processor. There is no need to keep records of source usage and it is also easier to keep
track of register lifetimes. It can be considered an improvement over the list scheduling scheme, which
may result in more idle cycles. LEADER provides another level ofbulleiing. Nodes in LEADER that have
become ready should be immediately loaded into the READY queue.
Register Allocation Traditional instniction scheduling methods. minimize the number of registers used,
which also reduces the degree ofparallelism exploited. To optimize the code generated from a DAG, one can
convert it to a sequence of expression trees and then study optimization for the trees. The registers can be
allocated with instructions in the scheduling scheme.
in general, more registers would allow more parallelism. The above bottom-up scheduling methods
shorten register lifetimes for expression trees. A round-robin scheme can be used to allocate registers while
the schedule is being generated. Dr one can assume an infinite number of registers to produce a schedule first
and then allocate registers and add spill code later. Another approach is to integrate register allocation with
scheduling by keeping track ofthe liveness ofregisters. When the remaining registers are greater than a given
threshold, one should maximize parallelism. Otherwise, one should reduce the number of registers allocated.
Register allocation can be optimized by register descriptors [or tags) to distinguish among constant,
variable, indexed variable, frame pointer, etc. This tagged register may enable some additional local or global
code optimizations. Another advanced feature is special branch handling, such as delayed branches or using
shorter delay slots if possible.
Code generation can be improved with a better instruction selection scheme. We can first generate code for
expression trees needing no register spills. Dne can also select instructions by recursively matching templates
to parts ofexpression trees. Amatch causes code to be generated and the subtree to he rewritten with a subtree
for the result The process ends when the subtree is reduced to a single node. When template matching fails,
heuristics can be used to generate subgoals for another matching. The key ideas of instruction selection by
pattem matching include:
-['1] Convert codc gcncration to primarily a systematic process.
-[2] Usc trcc-structured pattcms describing instrtlctions and use a trcc-stnicturcd intcrmotliatc ihrm.
{3} Scloct instructions by covering input to instruction patterns.
Major issues in pattern-based instruction selection include development of pattern matching algorithms,
design of intermediate fonn and target machine descriptions, and interaction with low—level optimization.
Extensive table descriptions are needed. Therefore table compression techniques are needed to handle large
numbers of patterns to be matched.
Advanced code generation needs to be directed toward exploitation oi‘ parallelism. Therefore special
compilation support is needed for superscalar and tnultithreaded processors. These are still open research
SIB ‘i Advanced Cmnprioernrchiteeture

problems. Partial solutions to these problems cart he found in subsequent sections. There is still a long way
to go in developing really “intelligent” compilers for parallel computers.

10.4.5 Trace Scheduling Compilation

Branch prediction has been used in :1 software scheduling technique called trace .tciT:eduIr'ng. The idea was
originally developed for scheduling andpacking operations into honeontai micminstntctions. Trace scheduling
was proposed for use in VLIW architecture designed for scientiﬁc computation without vectorization.
The concept of trace scheduling is illustrated in Fig. 10.10. A trace is formed by a sequence of basic
blocks, separated by assuming a particular outcome for every branch encountered in the sequence. The code
example shows the ﬁrst trace involving tiuce basic blocks (A, B, and C). There are many traces for differerit
combinations of branch outcomes. The second trace corresponds to another branch combination. Each trace
is scheduled for parallel execution by a VLIW processor.

Load A [First Trace]

mad B VLIW instructions
AddA,B Loadﬁt Loae|B luloveR2. R1
Store C Add A.B Store C
Him”; Mme R2. R1 Branch J(.1ID€i

MimiR2.a.s§ |'u'lu_illR1,R3 HranchY,2DD li'IC_R'ir,l CIQBJVR5

= an-ch .'it,1 I I 0
Soon-no Trace :
o

is 1 =
- B.ran¢hY,2 Bibi?-l<B

First Traoe

inc R4 1
\
ranch Y
MR5
" Brew
i

Fig. 1Il'.l.1Il'.l Code compaction §orVL|W processor based on trace scheduling developed by joshep Fisher
(1931)

Code Compaction Independent instructions in the trace are compacted into VLIW instructions. Each
VLIW word can be packed with multiple short instructions which can be independently executed in parallel.
t...».n.t....i..i....,....,......... — ,,,
The decoding of these independent instructions is carried out simultaneously. Multiple function units [such
as the memory-access unit, arithmetic unit, branch unit, etc.) are employed to can'y out the parallel execution.
Duly independent operations are packed into a VLIW insmiction.
Branch prediction is based on software hem-istics or on using profiles of previous program executions.
Each trace should correspond to the most likely execution path. The first trace should be the most likely one,
the second trace should be the second most likely one, and so on. In fact, reduction in tl'|e execution time of
an earlier trace is obtained at the expense of that of later traces. in other words, the execution time of likely
traces is reduced at the expense of that of unlikely traces.
Compensation Code The effectiveness of trace scheduling depends on correct predictions at successive
branches in a program. To cope with the problem of code movement Followed by incorrect prediction,
compensation codes are added to ofiitrace paths to provide correct linkage with the rest of program. Because
code compaction may move short instructions up or down in the program, different compensation codes must
be inserted to restore the original code distribution in the code blocks involved.

5*) Example 10.10 Trace scheduling with code compaction

and compensation
Consider an example program consisting oi‘ five basic blocks in Fig. 10.1 la. The initial trace contains blocks
A, B. and D, after it is predicted that execution of I4 and I9 will lead to the lefi path. In Fig. 10. 1 lb, instruction
13 has been moved fi'orn block A to block B, and I7 moved to blocli: D, to form the new blocks A’, B’ and D’.
Therefore, we need to insert instruction I3 in block C’ also and modify the target address to 201 in
branch instruction I9. Similarly, some instructions have been moved up to preceding blocks (Fig. lU.l1c_‘,t.
Compensation codes. Undo I5 and Ill), must he inserted in block C’ to restore the original program semantics.

Compensation codes (in shaded boxes in Fig. 10.11) are needed on off-trace paths.'I'his will make
the second trace correctly executed, without being aflected by the first trace due to code movement. The
compensation code added should be a small portion of the code blocks. Sometimes the Undo operation
cannot be used due to the lack of an inverse operation in the instruction set. By adding compensation code,
software is perfomiing the function of the branch history bufi'er described irt Chapter 6.
The efiiciency of trace scheduling depends on the degree of correct prediction oi‘ branches. With accurate
branch predictions, the compensation code added may not be executed at all. The fewer the number of most
likely traces to be executed, the better the perforrnance will be.
Trace scheduling was mainly designed for VLIW processors. For superscalar processors, similar techniques
can exploit parallelism in a program whose branch behavior is relatively easier to predict, such as in some
scientific applications.
510 i ' Adrmvced Cmnpmerfixrhéteczure

IIII I III I I I IIIIIII II I I I I I I I I I I III I I I IIII I

""I p‘ _L. ;.::-5-1 RL,;t ___.|

I2. L.».:-.=1--.1 R2, I3
L]. 4'1.-itﬁ. Rl, RE
1-1. Bsanelt >:.1'lJ, Luu
I I I |_ _ _ _
“U C
_ _ '15 _ _' :5. Sub R:
- '-~ - :9. oral-ml-. '='~.~u zoo
_ _. =1

{a)Thefi|'sttraoeon1l1-aﬂowgraeh.

A, I‘?
E:j
;
I...5“
_;
[4 {Branch} E53
E
B’ [4 {Branch}
[3
ts
le "5 B” B”
»B~»m~
Y>l1,2lI]1 :2
nu [B
Branch

2oo H D. H3 Q
*':
"m' no
201 no :14 D»
[11
[12 [11 H3 E
[14

{hj Downward ooda motion. {c} Upward coda rnction

FIg.10.11 Code mot-lems for trace scheduling oompactlng In Example1lJ.1D

LOOP PARALLELIZATION AND PIPELINING

-
I This section describes the theory and application of loop lransfommﬁons for veetorizetion or
parallelization purposes. At the end, we address soﬁware pipelining techniques.

10.5.1 LoopTransfom1atiunTI1eor'y
Parallelizing loop nests is one of the most fundamental program optimization techniques demanded in
a vcetorizing and peralleliziug compiler. In this section, we study a loop tn-msformatrion theory and loop
trausfonnafion algorithms derived from this theory. The bulk of the material is based on the work by Wolf
and Lam (1991).
snnnt.rt.r.,..g...,t.r..,...t _ — 5,,
The theory unifies all combinations of loop interchange, skewing, reversal, and tilting as rmimoduiar
rrmisfon-notions. The goal is to maximize the degree of parallelism or data locality in rt loop nest. These
trarrsformations also support efficient use of the memory hierarchy on a parallel machine.
Elementary Transformation: A loop transformation rearranges the execution order of the iterations in a
loop nest. Three elementary loop transformations are introduced below.
{lj Per'rrrrrr.n'rirJ.rr—A permutation 5on a loop nest transforms iteration {]J|, ...,p"] to @151, ...,p|§,j. This
transformation ean he expressed in matrix form as I5 , the rr >< n identity matrix with rows permuted by
5. For rr = 2, a loop interchange transformation maps iteration {'r',jj to iteration (j, ij. In matrix notation,
we ean write this as
O 1 i j
1 0 1‘ — r
U l
The following two-deep loop nest is being trartsformed using the permutation matrix [ 1 Oil as
depicted in Fig. 1l).l2a.
Dur‘=l,N Dnj=l,N
Doj=l,N l]or'=l,N
AU'l=A{.r'J -I Ctlljl => P\(_1'l=-I“-U! |"CI1lfl
Enddo Enddn
Enddn Enddn
-[2] Rm-‘r'rsof—Re\rersa1 of the ith loop is represented by the identity matrix with the ifn element on the
diagonal equal to ~l.
The following loop nest is being reversed using the transformation matrix ii I 0 ] as depleted in
Fig. |u.12b. '3 ‘I
Dnr'=l,N Dor'=l,N
Dnj=l,N Duj=—N__—l
="\(1}J'l=Aii—1,.f+ ll =1‘ Ail}-,f‘l=1°'»{F-1,-J‘+I)
En-rlrln Enddu
Enddn Enrldu

-['3j Skeu-'r':rg—Sltewing loop ii by an integer iaetorfwith respeet to loop L maps iteration {p1, ..., p,- |, p,-,
.f'Ir'+I= ' ' '=P_,|' I!.f‘:I,r'r.f':I,|'+l1 "'= par) tn Iiplr ' " rpi Ir .f']r'# .fJr'+I> ' ' "P pf Ir p_,|' ‘I _f.l;Jr'=.f‘J_,l'+l= "'= Anni‘

In the following loop nest, the transformation performed is a skew of the irmer loop with respect

to the outer loop by a faetor of l , represented by the transformation matrix I: 1 as depicted in

Figtue l0.12e.
Dnr'=l,N Doj=l,N
Dn__r'=l,N Duj=l,N
A{i,j)=A{r',_j— l_]|+ =9 A[r',j—r'j=A{rf,_i—r'— l}+
A{r'—l,jj A[r'—l,j—r'j
Enddo Enddu
Enddu Enrlrlu
511 i‘ " Advanced Comprmerﬁrehiteetum

|—II-1—lIr—II'~|—II|

i Z-_' i |*"|*"1*"l*'1
-- - -- -- Ii

I—IIrI—Ir'—|I-|—sI-I
j I
Bslore Ans:
[a} Loop permutation (interchange of I and jloops}

\\\ _,. AW]

was Ill
\\\
\\\ III
mar
[bl |=::i:a| ofthe} loop Mt"

. Ii
_.. - __1011 ‘T"1
tr |||I
__L__1
_.L_ |

__-_1_

l l

(c} Showing ofthe inner loop by a factor of1

Fig. 10.1 I Loop transformations perforrned in Example 10.‘? {Courtesy of Monica Lran. WTSC Tumrid Nores.
Smndford Unlve-rslry,1‘?92]

Various combinations of the above elementary loop transfomiarions can be defined. Wolf and Lam
have called thcsc rmimodular rrorufomratioru. The optimization problem is thus to find the unimodular
transformation that maximizes an objective fimction given a set of schedule constraints.
Transformation Nlotrice: Unimodular transformations are defined by unrmoaluior matrices. A unimodular
matrix has three important properties. First, it is square, meaning that it maps an n-dimensional iteration
-space imo an n-dimensional iteration space. Second, it has all integer components, so it maps integer vectors
to integer vectors. Third. the absolute valll-B of its tleterminant is l.
Bccausc of these propcrtics, the product of two unimodular matrices is unimodular, and thc invcrsc of a
unimodular matrix is unitnodular, so that combinations of unimodular loop transformations and the inverse
of unimodular loop transfonnations are also unirnodular loop transfonnarions. A loop transformation is said
to he legal if thc transformed dependence vectors are all lexicographieally positive.
MtMt.t.,,.g..s..,.....i — 52,
A compound loop transformation can be synthesized from a sequence of primitive transforrnations, and
the effect of the loop transformation is represented by the products of the various tralisformation matrices for
each primitive lransforlnation. The major issues for loop transformation include how to apply a trartsform,
correctness or locality, and desirability or advantages of applying a transform. Wolf and Lam (1991) have
stated the following conditions for unimodular transfonnations:
[1] Let D be thc set oftlistanoc vectors ofa loop nest. .-it unimodular transformation (matrix) T is Iago], if
and only if Vd E D,
r- tr 2 o (10.1)
(2) Loops 1' through} ofa nested computation with dependence vectors Darcfidft-' pcrmrrroinic, if Vd E D,

((d1*d;, ...,d, |)>G or [_"-?’r'£.icEj:a';,2U} (I03)

Proofs of these two conditions are left as exercises for the reader. 'l'he following example shows how to
determine the legality of unirnodular trartsformarions.
Do i = I, N
Du j = I, N
ac.n=_:(st»".n, st»'+ 1.; In
Enddn
Enddn
This code has dite dependence vector J = (1, -1). The loop interchange transformation is represented by
the matrix

l1 Ql
fl l
T:

The transformation is illegal since T- d = (—l, 1) is lexicographicallgr negative. However, compounding

the interchange with a reversal represented by the transformation matrix

T,_ —l {l ‘D 1 _ {ll —l
U l 1 ‘U l D

is legal because T’ - d = (I , l) is lexicographically positive.

10.5.2 Parallelization and Wavefronting

The theory of loop transformation can be applied to execute loop iterations in parallel. ln this section, we
describe loop parallelization procedures. A wavefrontlng approach is presented for ﬁne-grain parallelization.
Tiling is applied to reduce synchronization costs in coarse-grain computations.

Parallelimrion Condition: The purpose of loop parallelization is to maximize the numberofparallelizable

loops. For n-deep loops whose dependences can be represented with distance vectors, at least (n - l) degrees
of parallelism can be exploited in both fine-grain and coarse-grain cornputations.
The algorithm for loop parallelization consists of two steps: It first transforrns the original loop nest into
a canonical form. namely, aflzlly perrmtrtabie loop nest. It then transforms the fitlly permutable loop nest to
exploit coarse- andfor fine-grain parallelism according to t1'|e target architecture.
514 i‘ Advorrced Cmnpunerfirehitecture

{l 'j Canonical fiarm. Loops with distance vcctors have the special property that they can always bc
transformed into a fully permutable nest via skewing. it is easy to determine how much to skew an
inner loop with respect to an outer one to make these loops fully permutable. For example, if a doubly
nested loop has dependences {[0, 1), (1, -2), (1, -1)}, then skewing the inncr loop by a factor of2 with
respect tothe outer loop produces {[0, 1), (1, 0), (1, 1]}.
{'2} Paralfeiirarion ,rJrot'ess. Iterations ofa loop can execute in parallel if and only if no dependences are
carried by that loop. Such a loop is called a Doall loop. To maximize the degree of parallelism is to
transform thc loop nest to maximize thc numb-cr of Doall loops.
Let (ll, .. ., In) be a loop nest with lexicographically positive dependences d E D. I,- is parallelizable if and
only if "-I-‘d E D, (oi, _ .., d, 1) > {CL . ..,CI}, thc zcro vector, or d,-= 0. Once thc loops are made fitlly permutablc,
the steps to generate Doall parallelism are simple. in the following discussion, we show that the loops in
canonical form can be trivially transformed to produce both fine- and coarse—grain parallelism.
Fine-Grain Hhvefionting A nest ofn fiztlly pctrnutable loops can be transforrncd into code containing at
least in — 1) degrees of parallelism. in the degenerate case where no dependences are carried by these n loops,
the degree of parallelism is n. Otherwise, (rt I) parallel loops can be obtained by skewing the innemiost
loop in thc Fully pcrmutablc nest by cach of the other loops and moving thc innermost loop to the outermost
position.
This transformation, called wavefront transjbrmation, is represented by the following matrix:

11---11
10---on
r= 01---on (no.9)

on---10
Fine-grain parallelism is exploited on vector machines, superscalar processors, and systolic arrays. The
following example shows the entire process of loop parallelization exploiting fine-grain parallelism. The
process includes skewing and wavefront transformation.

l»)
Cg Example 10.11 Loop skewing and wavefront transformation
(Michael Wolf and Monica Lam, 1991)
Figure 10. I 3a shows the iteration slﬁlﬁe and dependence ofa source loop nest. The skewed loop nest is shown
l O
in Fig. 10. l 3b atler applying thc matrix T = [1 I ] . Figure 10.13-c shows thc result of applying wavefront

transformation to the skewed loop code, which is a result of ﬁrst skewing the innermost loop to make the
two-dimensional loop nest fully permutahle and then applying the wavefront transfomietion to create one
degree of parallelism.
MlMt.n.g...,g.t..,...l _ — 5,,

FmFu-TI; : 5 503310
A[I2+1] :=1x;=.' [A1121-+A[I2+11+A[|2+2]
D = {(0.1 1. {1 .01. :1. -1:}
iiifllifl
Efiflilii I2
(a) Extract dependence lnfennatlon from source loop nest

For I§r!I’1 toE»+I§do + J ’ 1

“‘i:‘.f*1f';$1’1i?“1ii"i
l2—1 } [2-1 ll M-m~'
[1 '1'] 4M-Miltllr
£r= rn={{n,1y, [1.1], (1,0) [2
(bj Skew to make Inner loop nest fully permutable

" ,1
5'
ForI’1:=0to16de
om lg ;= max (0. Hr, - aim; to mln [5, L1’, 121: do
5??
A [15 -25+ 11 = = 1)'3'A[r’1 -2151 +A[l'1,-2|§+1]
+A[l’1-21‘,+2]
Kw
" '~\-.~"a
T:
"~§~."\w w-.
Iii
D'= TD={{1,G],[2,1],[1,1]} ‘\*~‘$'.~\-‘ens.
“'-*~\“:.~‘ 1-"~a\‘1 '.*-~
[cl Wavefront transformation on the skewed loop nest

Fig.1ll.13 Flne-grain praraiellratlen by loop skewing an-cl warefrent transfnrrruatlon In Example 10.11
(Cournesy eflﬁblfand Lam: reprinted from IEEE Tmns. Pnmllel D-lstrlbumd 5)cterrrs. 1991}

There are no dependences between iterations within the innermost loop nest. The transform is a wavefront
transfennatiun because it causes iterations aleng the cliagnnal of the original loop nest to execute in parallel.
SIG i‘ Advorrced Covnpunerﬁrrhitecture

This wavefront transformation automatically places the maximum Doall loops in the innemiost loops,
maximizing ﬁne-grain parallelism. This is the appropriate transformation for superscalar or\="LIW machines.
Although these machines have a. low degree of parallelism, ﬁnding multiple parallelizahle loops is still ttsetitl.
Coalesoing multiple Doall loops prevents the pitfall of paralleliring only a loop with a small iteration count.
Course-Grain Parallelism For MIMI) coarse-grain multiproccssors, having as many outermost Doall
statements as possible reduces the synchronization overhead. A wavefront transformation produces the
maximum degree of parallelism but makes the outermost loop sequential if any are. For example, consider
the following loop nest:
Do i = l, N
Dn j = 1, N
Atoll =fl'Ali -1 J - lll
Enddo
En-tl-do
This loop nest has the dependence (1, 1), and so the outermost loop is sequential and the innermost

loop is a Doall. The wavefront tran.sformation does not change this. In contrast, the unimodular
l —l . .
transfomiation [0 I il transforms the dependence to (D, ll, making the outer loop a Doall and the tnnecr

loop sequential.
In this example, the dimensionality of the iteration space is two, but the dimensionality of the space
spanned by the dependence vectors is only one. When the dependence vectors do not span the entire iteration
space, it is possible to perform a transformation that makes outermost Doall loops.
A heuristic though nonoptimal approach for making loops Doall is simply to identify loops I, such that all
d, are zero. Those loops can be made outermost Doall. The remaining loops in the tile can he wavefronted to
obtain the remaining parallelism.
Loop parallelization can be achieved through unimodular transformations as well as tiling. For loops with
distance vectors. n-deep loops have at least (n - I) degrees of parallelism. The loop parallelization algorithm
has a common step for fine- and coarse-grain parallelism in creating an n-deep fillly permutable loop nest by
skewing. The algorithm can be tailored for different machines hased on the following guidelines:
' Move Doall loop innermost [if one exists} for fine-grain macl'ii.nes. Apply a wavefront transformation
to create up to {rt — l j Doall loops.
' Create outermost Doall loops for coarse-grain machines. Apply tiling to a fully permutahle loop nest.
~ Use tiling to create loops for both fine- and coarse—grain machines.

10.5.3 Tiling and Localization

Tiling and locality optimization techniques are studied in this section. The ultimate purpose is to reduce
synchronization overhead and to enhance multiprocessor efficicney when loops arc distributed for parallel
execution.
‘Tiling to Reduce Synchronization It is possible to reduce the synchronization cost and improve the data
locality OfP3IB.l.|CllIGCl loops via an optimization known as tiiing (Wolfe, 1939). Tiling is not a unirootlular
sst,ii.t..i,t.r.,..,.~,,s..,,..i — ,,,
transforrnation. In general, tiling maps an n-deep loop nest into a Zn-deep loop nest where the inner ri In-ops
include only a small fixed number of iterations. Figure 10. 14a shows the code after tiling the example in Fig.
I0. l3h using a tile size of 2 >< 2. The two innermost loops execute the iterations within each tile, represented
as 2 >< 2 squares. The two outer loops, represented by the two axes, execute the 3 >< 4 tiles. The outer loops oi‘
the tiled oode control the execution of the tiles.
The Barrie property that supports parallelization, fiill pettltutability, is also the key to tiling; Loops 1,-
through J} of a legal computation ean be tiled if they are fully perrnutable.
Thus, loops in the canonical form of the parallelization algorithm can also be tiled. Moreover, the
characteristics of the controlling loops resemble those of the original set of loops. An abstract view giving
the dependences of the tiles is shown in Fig. ID. 14h. These controlling loops are lhemseh-‘es permutable and
so are easily parallelizaliie. However, each iteration of the outer r: loops is s til-c of iterations instead oi" an
individual iteration. Parallel execution of the tiled loops is shown in Fig. 1iIl.l4c.
Forlti:=IDto5by2oo
ForIl§:=0to11ny2-do
ForI{:=|t{tomin[t{+1,5]do
For|§:=max{I|{,Il§j|tomin{6+Ii, I15-+1]do
A[l§1: = 113* [A[l§-11+.-Ii.[l’2] +A[I’2+ 1}]
[a] Tits-cl code from thosicewseioode in Fig. 1tJ.13b

t1
‘E
fliilfi
. iigifl§.i i§is ‘I T.EIBIEIL fii. ilfii fiYIEI_.BIEIE4 -‘ififiij-I
_ ’ Eli; rrj,,_

in; iteration space and dependences of the tiled oodo

ii;

rig
V I-

[G] Paraii-at execution of trio tiiod ioo pet.

Fig. 10.14 Tiling ofthe sitiewed loops for parallel execution on a ooarse-grain rmritlproc-user {Commas} of
Wolf and |_3'i‘\‘ll'I’€1Jtl‘ll'lt£d from IEEE li-ans. Parallel Dietribuiied Sietﬂﬂs. W91)
SIB i‘ - Advoricad Compuriierilirehitecture

Tiling can therefore increase the granularity of synchronization and data are often reused within a tile.
Without tiling, when a Doall loop is nested within a non-Doall loop, all processors must be synchronized with
a barrier at the end of each Doall loop.
Using tiling, we can reduce the syrlchronization cost in the following two ways. First, instead of applying
wavefront transfonnation to the loops in canonical form, we first tile the loops and then apply a wavefront.
transformation to the controlling loops ofthe tiles. In this way, the synchronization cost is reduced by the size
of the tile. Certain loops cannot he represented as distances. Direction vectors can be used to represent these
loops. The idea is to represent direction vectors as an infinite set of distance vectors.
Locality optimization in user programs is meant to reduce memory-access penalties. Software pipelining
can reduce the execution time. Both are desired improvements in the performance of parallel computers.
Pro-grain locality can be enhanced with loop interchange, reversal, tiling, and prefetching techniques. The
effon requires a reuse analysis of a “localized” iteration space. Software pipelining relies heavily on sufficient
support from a compiler working effectively with the scheduler.
The fetch of successive elements of a data array is pipelined for an interleaved memory. In order to reduce
the access latency, the loop nest can interchange its indices so that a long vector is moved into the innermost
loop, which can he more effective with pipelined loads. Loop transformations are performed to reuse the data
as soon as possible or to improve the effectiveness of data caches.
Prefctching is often practiced to hide the latency of memory operations. This includes the insertion of
special instmctions to prefetch data from memory to cache. This will enhance the cache hit ratio and reduce
the register occupation rate with a small prefctch overhead. In scientific codes, data prefetching is rather
important. Instruction prefetching, as studied in Chapter 6, is often practiced in modern processors using
prefetch bufiers and an instruction cache. I11 l:he following discussion, we concentrate on data prefetching
techniques.
Tiling fur Locality Blocking or tiling is a well-known technique that improves the data locality ofnumerical
algorithrns. Tiling can he used for different levels of memory hierarchy such as physical memory, caches,
and registers; multilevel tiling can be used to achieve locality at multiple levels of the memory hierarchy
simultaneously.
To illustrate the iniportarlce of tiling, consider the example of tnatrix multiplication:
[In i= l,N
Dnj=1. N
Du k= l, N
Ctr‘, Ir] = Cij, Ir) + A(i,j) >< BU, E]
End-tln
Enddo
Enddn
ln this code, although the same rows of C and B are reused in the next iteration of the middle and outer
loops, respectively, the large volume of data used in the intervening iterations may replace the data fiom the
register file or the cache before they can be reused. ‘filing reorders the execution sequence such that iterations
from loops of the outer dimensions are executed before all the iterations ofthe inner loop are completed. The
tiled matrix multiplication is:
smI.m..i.s.i.,,.,...s.§..,...i — 5,,
Du F =' 1, N, S
Do m = 1, N, 5
[In r'=' 1, H
Dn j= E, |nin{£+s—1,N)
Du k= m, mi_n[m + s-I, N]
CU‘, k] = C(r', k] + A(r',j} X BU, k]
Endihl
Enclcln
En-rl-do
En-lidn
End-do

Tiling reduces the number of intervening iterations and thus dam fetched beiween data reuses. This allows
reused data to still be in the eache or register file and hence reduces memory accesses. The tile size s can be
chosen to allow the maximum reuse for a specific lcvel of rnemnry hierarchy. For ciuaniplc, the tile size is
relevant to the cache size used or the register file size used.

I MFlcip5

9°" I beihtling
j cachetling
55‘ Q ragistsiiling
* nailing
m_

45-

4ﬂ_

ﬁ_

25_

23..

'15-

'10-

F'm-oassels
D I I I I I I I I -
CI 1 2 3 4 5 B 3" B

Fig. 10.15 Performance of a SUD x 500 double precision rratrix muiriplieaizion on the SGI 4D.I'3BD. Cache
1:IIesare64>< 64 ieeradons and regisnertiies are4x‘1{CaureesyofV\hlfand L:u1'urep:~h1:edfn:n1
PCM SIGPLAN Conf. Pmgmmmlng Language Design and Impien-raimﬂon.Tomnec.\ Canada. 1991}
5311 i‘ " Advanced Comptioerniehiteeture

The improvement obtained from tiling can be far greater than that obtained from traditional compiler
optimizations. Figure 10.15 shows the performance of Sill) >< 500 matrix multiplication on an SGI 4Df3B0
machine consisting of eight MIPS R3000 processors rtnining at 33 M]-lz. Each processor has a 64-Khyte
direct-mapped first-level cache and a 256-Kbyte direct-mapped second-level cache. Results from four
different experiments are reported: without tiling, tiling to reuse data in caches, tiling to reuse data in registers,
and tiling for hotl1 register and caches. For cache tiling, the data are copied into consecutive locations to avoid
cache interference.
Tiling improves the performance on a single processor by a factor of 2.75. The effect oftiling on multiple
processors is even more significant since it reduces not only the average data~access latency but also the
required memory bandwidth. Without cache tiling, contention over the memory bus limits the speedup to
about 4.5 times. Cache tiling permits speedups of over T for eight processors, achieving an overall speed of
64 Mflops when combined with register tiling.
Localimd It-emtion Space Reusing vector spacing offers opportunities for locality optimization. However,
re-use do-es not imply locality. For example, if reuse does not occur soon enough, it may miss the temporal
locality. Therefore, the idea is to make reuse happen soon enough. A localized iteration space contains the
iterations that can exploit reuse. In fact, tiling increases the number of dimensions in which reuse can be
exploited.
Consider the following two~deep nest. Reuse is exploited for loops i andj only, which form the localized
iteration space:
Du i = l, N
Du j = I, N
Bttljl =J'(-”\(t’)- AU?)
En-ddo
End-do

Reference AU") touches different data within the inner loop but reuses the same elements across the outer
loop. More precisely, the same data A-lfj) is used in iterations {i,j], I E i S N. There is reuse, but the reuse is
separated by accesses to N — l other data. When N is large, the data is removed from the cache before it can be
reused, and there is no locality. Therefore. a reuse does not guarantee locality. The tiled code is shown below:
Do F = 1, N, S
Du i = l, N
[In j=f, n1ax[£"+ s-1,N')
Blflfl =.r'(A{fi. Mil)
End-do
Endd-n
Enddo

We choose the tile size such that the data used within the tile can be held within the cache. For this
ettarnple, as long as s is smaller than the cache size, AU} will still he present in the cache when it is reused.
Thus, reuse is exploited for loops i andj only, and so the localized iteration space includes only these loops.
s.s..ti.1...tt.t..,.,.s,.,,.....t — ,,,
ln general, if rt is the ﬁrst loop with a large bound, counting from innermost to outermost, then reuse
occurring within the inner n loops can be exploited. Therefore the localized vector space of a tiled loop is
simply that of the innermost tile, whether the tile is coalesced or not.
Obviously, memory optimizations are important. Locality can be upheld by intersecting the localized
vector space with the reuse vector space. ln other words, reuse directs the search for unimodular and tiling
transformations. One should use locality information to eliminate unnecessary prefetches.

10.5.4 Software Pipelining

This refers to the pipelining of successive iterations of a loop in the source programs. The advantage of
software pipelining is to reduce the execution time with compact object code. The idea was validated
by implementation of a compiler for Warp, a systolic array of ll) processors built at CMU (Lam, 1983).
Obviously, soft-ware pipelining is more effective for deep hardware pipelines. The concept is illustrated with
an example taken from Larn‘s Tutorial Notes on Compilers for Parallel Machines (1992).
Pipelining of Loop lteration: Successive iterations of the following loop nest are to be executed on a
two-issue processor ﬁrst without software pipelining and then with pipelining.
Do I= 1, N
All) =A(Tl >< B t '3
Enddn
This is an example of Doall loops in which all iterations are independent. It is assumed that each memory
access (Read or Write] takes one cycle and each arithmetic operation (Mn! and Add] requires two cycles.
Without pipelining, one iteration requires six cycles to execute as listed below:
Cycle Instruction Comment
Read .tFetch A[l]f
Mu] IMultiply by Bi’

3*E: i’AddtoCi'

-e\ucuw|u- Write IStore A[I]»’

Therefore, N iterations require 6N cycles to complete, ignoring the loop control overhead. Listed below is
the execution of the same oode on an 8-deep instniction pipeline:

Cycle lteration
l 2 3 4
Read
Mul
Read
M ul
Add Read
D‘-Lh-I5-U-Jl‘\-'lI—' Mul
531 ‘i Advanced Cmnptioerﬁirhiteezure

Add Read
Write Mul
\GlIKi‘- J Add
I0 Write
11 Add
I2 Write
I3
I4 Write
Four iterations ofthe software-pipelined code are shown. Although each iteration requires 8 oyeles to flow
through the pipeline, the four overlapped iterations require only 14 clock cycles to execute. Compared with
the nonpipelined execution, a speedup factor of 24¢’ I4 = l .7 is achieved with the pipelining of four iterations.
N iterations require ZN + IS cycles to execute with the pipeline. Thus, a speedup factor of 6Ni"(2N + 6} is
achieved. As N approaches infinity, a speedup factor of 3 is expected. This shows the advantage of software
pipelining, if other overhead is ignored.
Dnnemu Loop: Unlike unrolling, sofiware pipelining can give optimal results. Locally compacted code
may not he globally optimal. The Doall loops can fill arhitrarily long pipelines with infinite iterations. In the
following Doaeross loop with dependence between iterations, software pipelining can still be done but is
harder to implement
Doaeross l= 1, N
AH} = A{I) >< B
Sum = Sum — A(]);
End-do
The sofiware-pipelined code is shown below:
Read
E.‘ E
Add Read
Write Mul
Add
I:-.»-est-is)- Write

lt is assumed that one memory access and one aritllmetic operation can be concurrently executed on
the two-issue superscalar processor in each cycle. Thus TBCUITBIICBS can also be parallelizted with software
pipelining.
As in the hardware pipeline scheduling in Chapter ti, the objective of software pipelining is to minimize
the interval at which iterations are initiated; i.e. the initiation latency determines the throughput for the loop.
The basic units of scheduling are minimally indivisible sequences of microinstructjons. [I1 the above Doall
loop example, the initiation latency is two eyeles per iteration, and the Doaeross loop is also pipelined with
an initiation latency of two cycles.
s...t..n.......r..i................... L = 5..
To summarize, soﬁrware pipelining demands maximizing the tltnougltput by reducing the initiation latency,
as well as by producing small, compacted code size. Thc lrick is to ﬁnd an itltmtical schedule for cvcry
iteration with a constant initiation latency. The scheduling problem is tractable. if every operation takes unit
execution time and no cyclic dependences exist in the loop.

ti“A-_ U! ummary

A programming model is a collection of program abstractions which present the programmer with a
well-defined view of the software and hardware system. Parallel programming models are defined for
the various types of parallel architectures which we have studied in the rlier chapters of the book.
‘IMa started this chapter with a study of the parallel programming models which have become well-
established. namely: shared variable model. message-passing model. data parallel model. object-oriented
model and functional and logic models.
Foranygiven model for parallel progr'amming.the user needs to be provided with a parallel programming
environment. which consists of parallel languages. compilers. sup port tools for program development. and
runtime supporLThe programming environment must: provide specific features for parallelism aimed at:
optimization. availability. synchronimtion and communication. control of parallelism. data parallelism. and
process management. Paralld language constructs are needed in the programming language. of whidw a
few examples have been presented in this chapter. And the compiler must be capable of optimizing t:he
machine code generated for the type of parallelism available in hardware.
lkctorizing or parallelizing compilers can. in theory. detect and exploit the potential parallelism which
is present in a sequential program. ln this procas. dependence analysis of data arrays can reseal the
presence or absence of dependences between successive references to army elements in a loop. or in
nested loops. in general. two operations can be carried out in parallel only if there is no data or control
dependence between them.We reviewed some specific techniques for dependence analysis of data arrays,
such as iteration space analysis. subscript separability and partitioning. and categorized dependence tests.
Optimimtion of the machine code generated by the compiler. and cycle-by-cycle scheduling of
machine instructions for execution on the procasor. are both critical to achieving high performance
computing. Local optimization can be carried out within basic bloclcs. but in general both local and global
optimizations are r‘equired.We studied several vectorization and parallelization methods. such as the use
of temporary storage. loop interchanging. loop distribution. vector reduction. and node splitting.
Code generation and scheduling make use of directed acyclic graphs ofoperations within basic blocks.
and should utilize a register allocation soategy which does not inhibit parallel execution of instructions.
Trace-sdweduling compilation males use of program traces obtained from multiple previous executions
of the same program.
Loop trartsformations may in general be required prior to prarallelization andior vectorization of
program code. Perrnutatlon, reversal. skewing, and transformatican matrices are some of the specific
techniques which can be applied. ‘Wavefioncing can be useful in exploiting fine-grain parallelism. while
tiling can help achieve locality and reduce synchronization costs in coarse-grain computa1;lons.Software
pipelining of loop iterations is another possible technique to parallelize a sequential program.
rh- Mcfimu-~ um T
534 i" mmun“ Adiwrced Cfl'lT‘lpl.rI|EJ'J\1|.liC|lhlI|E\'IIt|'J'E

55 Exercises
Problem 10.1 Explain the following terms program sequentially on a uniprocessor system.The
associated with message-passing programming of chain should be sufﬁciently long to see the tlilference.
multicomputers:
Problem 10.3 Gaussian elimination with partial
(a} Synchronous sersus asynchronous message- pivoting was implemented by [Quinn90] and
passing schemes.
Hatcher in C* code on dwe Connection Machine. as
(b} Blocking versus nonblocking communications.
well as in concurrent C code on an nCUBE 3200
(c) The rendezvous concept introduced in the multicomputer.
Ada programming system. (a) Discuss the translationicompiler effort from
{d} Name-addressing versus channel-addressing C* to C on the two machines after a careful
schemes for message passing. rding of the paper by Quinn and Hatcher.
(e} Uncoupling between sender and receiver (b) Comment on SPMD [single program and
using buffers or mailboxes. multiple data streams} programming style
{f} Lost-message handling and interrupt-message as opposed to SIMD programming style. in
handling. terms of synchronintion implementation and
related performance issues.
Problem 10.1 Concurrent object-oriented
programmingwasintroduced in Section 10.1 .4.Chain (c) Repeat the program conversion experiments
multiplication ofa list of numbers was illustrated in for a ﬁrst Fourier transform (FFT) algorithm.
Example 10.2 based on a divide-and-conquer strategy. Perform the program conversion manually at
A fragment of a Lisp-like code for multiplying the the algorithm level using pseudo-codes with
sequence of numbers is given below: parallel constructs.

{deﬁne tree-product Problem 10.4 Explain the following terms related

to shared-wariable programming on multiprocessors.
(lambda [tree]
(a) Multiprogramming.
{if [number ? tree]-
(b} Multiprocessing in MIMD mode.
UEE
(c) Multiprocessing in MPMD mode.
rem-rwducmerr-mi ml) (d) Multitasking.
(*l"E~‘-‘-Pmdufilfifilfl-"Eel "'EP}l})}
(e) Multithrding.
In this code.a tree is passed to tree-product. which {f} Program partitioning.
tests to see ifthe tree is a number [i.e. a singleton at
the leaf node}. If so. it returns the tree; otherwise it Problem 10.5 The following arrays are declared
sub-divides the problem into two recursive calls.The in Fortran 90.
left-flee and right-tree are functions which pick off REAL A(10. 10. 5)
the left and right branches of the tree. Note tint the REAL B(9. 9)
argument to * may be evaluated concurrently. REAL C(3. 4, 5}
Write a Lisp code to implement this divide- (a) Llst. array elements specified by the following
and-conquer algorithm for chain multiplication array expressio ns:A(5, 8?‘. *}. B[3:*:3. 5:8). and
on a multiprocasor or a multicomputer system. C(*. 3,4}.
Compare dwe execution time by running the same (b) Canyou makethefollowingarrayassignments?
..........................g.............. — 5,5
A{3:5. 14.5} = cr. 3, 3=5}. Problem 10.9 Consider the following loop nest:
241.2. rs) = B{7:9,4:6}. Do] = 1.N
c(*. 4. 4.5) = arm. e.-9). and Do I = 1. N
A(5.9:10.2:4} =A(?. 3.4. an + c(2.4.-s.1=3). S1: A(l,j+1} = B{l.j) + C{l.j}
Problem 10.6 Determine the dependence S1: D(l.j} =A{l,])i2
relations among the three state mentsin the following Enddo
loop nest.The direction vector and distance ‘oEC1'.Ol" Enddo
should be specified in all dependence relations. (H) Show how to compile the code for
Do 1= 1. N vectorization in the l-loop. assuming Fortran
Do j = 2. N column-rnrajor storage order.
lb) Show how to compile fl"|E code for
S1: All-J} =P~iLJ—1) + BU-J} parallelization in fl"|E j-loop using the
$1: Cfil-J} = NI-J) + DI{I+1-J} Doacross and Endacross commands. You
$3. D([.j} = 0.1 can use a conditional statement or S.ignol(_|)
Enddo and ‘IM:|it[|—1) for synchronimtion in the
Enddo concurrent loop.
Problem 10.7 Consider the following loop nest; (c Show how to compile the loop to perform
Do I = 1. N the j-loop in vector mode. while using the
Doall and Endall commands for the outer
$11 Ail) = BU) l-loop.
5:‘ CU} =AlU * BU}
S3; E(]) = C[I+1} Problem 10.10 Eaqalain the following loop
Enddo transformations and discuss how to apply them for
loop vectorization or parallelization:
(a) Determine the dependence relations among ,
(a) Loop permutation.
the dwree statements. (b)
Loop reversal.
(b} Show how to vectorize the code with Fortran
90 statements. (C) Loop skewing’
(<1)Loop tiling.
Problem 10.8 Consider the following loop nest; (E) wmefront n.anSfm.maIiOn_
Dol = 1. N (f) Locality optimization.
Do] = 2.N (gi Software pipelining.
S1: M11} = Bill) + CKLJ} Problem 10.11 Loop-carried dependence (LCD)
$1; C(I.j} = D(L J}!2 exists in the following loops:
S3‘ Ell-ll =Al-l-l_1}#1 1' Ell-l_1l|' (H) Consider the forward LCD in the following
Enddo loop:
E"dd° Dol = 1. N
(a} Show the data dependences among the Am =A(-1+1) .|. 114159
statements. Enddo
(b} Show how to parallelize the loop. schedulin
3 Explain why a forward LCD does not prevent
the parallelizable |terat|ons to concurrent wctflrizafion Ofakmp‘
PFOCESSOFS.
530 i‘ " Advoarced Cnmpaimerfiichiteczure

(b) The following loop contains backward LCDs: (bl

Do I = 1. N — 1 Do 1 = 1.1\1
Aﬂ} = B|[I) + CH} 11 (.1. 11} .LE 0.0) then
B-(H1) = D|{I) * 3.14159 s = s + 5(1) * ca}
Enddo x = an)
Show that the loop can be vectorized by Enclif
statement reordering. Enddo
Problem 10.11 Vectorize or parallelize the Problem 10.13 Tanenbaum and associates have
following loops if possible. Otherwise. explain why suggested a hybrid parallel programming paradigm
it is not possible. using shared objects and broadcasting. Study the
(1) paper that appeared in IEEE Computer (August 1992)
Do 1 = 1. N and explain how to apply the software paracligrn for
A(I+1) = AU) + 3.14159 either multiprocessors or multicomputers.
Enddo
Instruction Level Parallelism

INTRDD UCTI DH
1 The period between the l9'I"Os and the 1990s saw agreat many innovative ideas being proposed
in computer architecture. The basic hardware technology o fcomputers had been mastered by
the 1960s, and several companies had produced successﬁrl commercial products. The time was therefore
right to generate new ideas, to reach performance levels higher than that of the original single-processor
systems. As we have seen, parallelism in its various fbrms has played a central tole in the development of
newerarchitectures.
The carlicr part ofthisbook has presented a comprehensive overview ofthe many architectural innovations
which had been attempted until the early 1990s. Some ofthesc were commercially successful, while many
others were not so fortunate—which is not at all surprising, given the large variety of ideas which were
proposed and the fast-paced advances taking place in the underlying technologies.
In the last two chapters of the book, we take a loolt at some of the recent trends and developments
in computer architecturc—ir|cluding, as appropriate, a brief discussion of advances in the underlying
technologies which have made these developments possible. ln fact, we shall sec that thc recent advances
in computer architecture can be understood only when we also talte a look at the underlying technologies.

Wl"|ut is computer are hitecture?

{aj We deﬁne r.-onrprrrer art-lrireerrrre as the arrangement by which the various system building blocks-
processors, functional units, main memory, cache, data path s, and so on—are interconnected and inter-
operated to achieve desired .t_1-'srenr perjforrmnce.
(bj Processors make up the most important part of a computer system. Therelbre, in addition to (aj,
processor desigrr also constitutes a central and very important element of computer architecture.
Various functional elements of a processor must be designed, interconnected and inter-operated to
achieve desired pro;-e.s'.s0r perﬁsrrrrunce.

.5_‘r-‘stem perjbrmrmee is the ltcy benchmark in the study of computer architecture. A computer system
must solve the real world problem, or support the real world application, for which the user is installing
it. Therefore, in addition to the theoretical peak performance ofthe processor, the design objectives ofany
oomputer architecture must also include other important criteria, which include system performance under
585 li .
Advanced Cornpurterﬁrchiteeture

realistic load conditions, scalability, prioe, usability, and reliability. ln addition, power consumption and
physical size are also often important criteria.
A basic rule of system design is that there .sh0ul'o' be no perfsrmanee borrleneelrs in the system. Typically,
a performance bottleneck arises when one pan of the system—i.c. one of its subsystems—cannot keep
up with the overall throughput requirements of the system. Such a performance bottleneck can occur in a
production syslcrn, a distribution system, or even in trafﬁc systcmlu. [fa performance bottleneck docs occur
in a system—i.c. if one subsystem is not able to keep up with other subsystcms—then the other subsystems
remain idle, waiting for response from the skmwerone.
[n a computer system, the key subsystems are processors, memories, L-"O interfaces, and the data paths
oonnecting them. Within the processors, we have subsystems such as ﬁlnctional units, registers, cache
memories, and intemal data buses. Within the computer system as a whole—or within a single prooessor-
designers do not wish to create bottlenecks to system performance.

I»)
8 Example 12.1 Performance bottleneck in a system
ln Fig. ll l we the schematic diagram ofa simple computer system consisting of fourprocessors, a large
shared main memory, and a processor-memory bus.

Shared maln
memory Fm"
pro-oe-ssors

Pro-oa-asor-
memory bus

Fig. 11.1 A drnple shared memory muttlprooessor systern

For the three subsystems, we assume the following performance ﬁgures:

(ij Each of the four processors can perform double precision ﬂoating point operations at the rate of 500
million per second, i.e. 506 MFL-OPs.
{ii} The shared main memory can teartl-‘write -rlata at the aggregate rate of 1000 million 32-bit wortls per
second.
(iiij The processor-memory bus has the capability of transferring 500 million 32-bit words per second tor"
from main memory.

ill ltt oommcm language, we say that tr r.'hur'.n r'.'r o.nf_v as strong as 1'11: n'euke.i1' flair.
mwermuurnwmassm -—. 5,,
This system exhibits a pcrtbnnanee mismatch between the processors, main memory, and the processor-
memory bu s. The data tran sfer rates supported by the main memory and the sharedprocessor-memory bus do
not meet the aggregate requirements of the tburproccssors in the system.
The system architect must pay earclill attention to all such potential mismatches in system design.
Otherwise, tl1e sustained pcrtbrmanee which the system can deliver can only equal the performance of the
slowest part ofthe system—i.c. the bottleneck.
While this is a simple example, it illustrates the key challenge facing system designers. It is elearthat, in the
above system, if processor pcr_,ﬁrni.rmec is improved by, say, 20"?-"ft, we may not see a matching improvement
in .s_rs1t’m pe.rj"ormanee, because the performance bottleneck in the system is the relatively slower processor-
rnemory bus. In this particular case, a better investment for increased system peribrrnanee could be (aj faster
processor-memory bus, and (bj improved eache memory with each processor, i.e. one with better hit rate—
which reduces contention tbr the processor-memory bu s.
In fact, as we shall see, even achieving peak theoretical performance is not the ﬁnal goal of system design.
The system performance must be maintained for real-life applications, and that too in spite ofthe enormous
diversity in modem applications.

In earlier chapters of the book, we have studied the many ways in which parallelism can be introduced
in a computer system, for higher processing performance. The concept of instruction level parallelism and
superscalar architecture has been introduced in Chapter ti. In this chapter, we take a more detailed look at
instn.|c1i-on level parallelism.

BASIC DESIGN ISSUES

1 As we have seen in Chapter 6, a linear in snruction pipeline is the basic structure which exploits
instruction level parallelism in t.he executing sequence of machine instructions. We have also
discussed in brief how further hardware techniques can be employed with a view to achieve .s1.*per.sr-afar
processor architeeture—i.e. multiple instruction issues in every processor clock cycle. In this chapter, we
shall study these and other related concepts in some more detail.
Instruction pipeline and eache memory [or multi-level cache n'ren'r-ories] hide the memory aecess latencies
of instruction execution. With multiple functional units within the processor, .s'uper.senfnr instruction
csecution rates—greater than one per process-or clock cyclc—can be targeted, using multiple issue pipeline
architecture. The aim is that the enormous processing power made possible by VLSI technology must be
utilized to the full, ideally with each limctional unit producing a result in every clock cycle. For this, the
processor must also have data paths of requisite bandwidth—within the processor, to the memory and IICI
subsystems, and to other processors in a multiprocessor system.
‘With a single processor chip today containing a billion (109) or more transistors, system design is not
possible in the absence ofa target application. For example, is a processor being designed for intensive
scientific number-eninelling, a commercial server, or for desktop applications?
One key design choice which appears in such contexts is the following.
Should the primary design emphasis be on:
{aj exploiting fillly the parallelism present in a single instruction stream, or
Adtwrced Computerfirehiteeture

lb] supporting multiple instruction streams on the prooessor in multi-core andfor multi-threading mode‘?

This design choice is also related to the depth ofthe instruction pipeline. In general, designs which aim to
maximize the exploitation of instruction level parallelism need deeper pipelines; up to a point, such designs
may support highcr clock rates. But, beyond a point, deeper pipelines do not necessarily provide highcr net
throughput, while power consumption rises rapidly with clock rate, as we shall also discuss in Chapter 13.
Let us examine the trade-olTinvolvei:l in this context in a simpliﬁed way:
total chip area = number of cores X chip area per core
or
total transistor count = numberofcores >< transistor cotmt per core

Here we have assumed for simplicity that cache and interconnect area—and transistor count—can be
considered proportionately on a per core basis.
At a given time, ‘s-'LSl technology limits the left hand side in the above equation s, while the designer must
select the two factors on the right. Aggressive exploitation of instruction level parallelism, with multiple
functional unitsand more complex control logic, increases the chip area—and trans is tor count—pcr processor
core. Alternatively, ibra difiierent category oftarget applications, the designer may select simpler cores, and
thereby place a larger number ofthem on a single chip.
Clfcourse system design would involve issues which are more complex than these, but a basic design is sue
is seen he-re: For the targeted application and performance, how should the designers divide available chip
resources among processors and, within a single processor, among its various fttnetional elements?
Within a prooessor, a set of instntctions are in various stages of execution at a given time—within the
pipeline stages, functional un its, operation bufiers, reservation stations, and so on. Recall that iimctional units
themselves may also be internally pipelinod. Therefore machine instructions are not in general executed in
the order in which they are stored in memory, and all instructions under execution must be seen as ‘work in
progress‘.
As we shall see, to maintain the work flow oi‘ instructions within the processor, a superscalar prooessor
makes use of lirrmeh prcdr'ction—i.e. the result of a conditional branch instruction is predicted even bcibre
the instnlction executes—so that instntctions from the predicted branch can continue to be processed,“-'ithout
causing pipeline stalls. The strategy works provided fairly good branch prediction accuracy is maintained.
But we shall assume that instnrctions are eonmtitreri in order. Here corriniitring an instntction means that
the instruction is no longer ‘under execution’—tl1e processor state and program state reflect the completion
of all operations specified in the instruction.
Thus we assume that, at any time, the set of committed instructions correspond with the program order
of instructions and the conditional branches actually taken. Any hardware exceptions generated within
the processor must reflect the prooessor and program state resulting fi'om instructions which have already
committed.
Parallelism which appears explicitly in the source prog ram, which may be dubbed as s'n'rieIre*riIprtrrrl!cIisnr,
is not directly related to instruction level parallelism. Parallelism detected and exploited by the compiler is a
form ofinstnlction level parallelism, because the compiler generates the machine instntctions which result in
parallel execution ofmultiple operations within the processor. We shall discuss in Section 12.5 some ofthe
main issues related to this method ofexploiting instruction level parallelism.
instruction mamas -—. 5,,
Parallelism detected and exploited by processor hardware on firefly, within the instructionswhich are under
execution, is certainly instruction level parallelism. Much ofthe remaining part ofthis chapter discusses the
basic techniques ibr hardware detection and exploitation of such parallelism, as well as some related design
trade-ofi's.
While the student is expected to be familiar with the basic concepts related to instruction pipelines, the
earlier discussion of these topics in Chapterfi will serve as an introduction to the techniques discussed more
fully in this chapter.
Weak memory consistency models,which are discussed elsewhere in the book, are netdiscussed explicitly
in this chapter, since they are relevant mainly in the case of parallel threads of execution distributed over
multiple processors. Similarly——since the discussion in this chapter is primarily in the context ofa single
proccssor——thc issues of shared memory, cache coherence, and message-routing are also not discussed here.
The student may refer to Chapters 5 and ?, respectively, fora discussion ofthesc two topics.
‘With this background, let us start with a statement ofthe basic system design objective which is addressed
in this chapter

FRO BLEH DEFINITION

1 Let us now ibcus our attention on the execution of machine instructions from a single
sequential stream. The instructions are stored in main memory in program order, from where
they must be fetched into the processor, decoded, executed, and then committed in program order. In this
context, we must address the problem of detecting and exploiting the parallelism which is implicit within the
instruction stream.
We need a prototype instruction for our processo r. We assume that the processor has a lend-store type of
instnrctien set, which means that all arithmetic and logical operations are carried out on operands which are
present in programmable registers. Operands are transferred between main memory and registers by forrdand
store instructions only.
We assumea three-address instruction fo rmat, as seen on most RI SC processors, so that a typical instnrction
for arithmetic or logical operation has the format:
opcode operand-l operand-2 result

Our aim is to make the discussion independent of any specific instruction set, and tlterefere we shall use
simple and self-explanatory opcodes, as needed.
Data transfer instructions have only two operand:-;—souree and destination registers; fear"! and stare
instnrctions to.-"from main memory specify one operand in the form ofa memory address, using an available
addressing mode. Efiective address for fem’ and store is calculated at the time of instruction execution.
Conditional branch instructions need to be treated as a special category, since each such branch presents
two possible cominuations ofthe instruction stream. Branch decision is made only when the instruction
executes; at that time, if instructions from the branch-not-taken are in the pipeline, they must beflushed. But
pipeline flushes are costly in terms oflost processor clock cycles. The payoff ofhraneh prediction lies in the
fact that correctly predicted branches allow the detection ofparallelism to stretch across two or more basic
EH1 D Admrrced Compurterfirchiteeture

blocks ofthepmgram, without pipeline stalls. It is for this reason that branch prediction becomes an essential
technique in exploiting instnrction level parallelism.
Limits to detecting and exploiting instruction level parallelism are imposed by dependences between
instnlctions. After all, ii'N instructions are completely independent of each other, they can be executed
in parallel on N firnctienal units—ii'N iirnctional units are avaiIablc—and they may even be executed in
arbitrary order.
But in fact deperldences amongst instruetionsareacentral and essential part of program logic.A dependence
specifies that instruction 1;, must wait for instruction lj to complete. Within the instruction pipeline, such a
dependence may create a Fmzrrrrt‘ or smH—i.e. lost processor ckick cycles while lk waits tbr I1- to complete.
For this reason, for a given instruction pipeline design and associated fimctional units, dependences
amongst instnrctions limit the available instruction level parallelism—a.nd therefore it is natural that the
eentral issue in exploiting instnretion level parallelism is related to the correct handlingofsuch dependences.
We have already seen in Chapter 2 that dependences amongst instructions fall into several categories; here
we shall review these basic concepts and introduce some related notation which will pmve usefiil.

Data Dependence:
Assume that instruction lk follows instn.|etion [J in the program. Dam r'li’]Jc'm'l'e’H£'c' between Ii and lk means
that both acoess a common operand. Forthe present discussion, let us assume that the common operand of [J-
and I1, is i11 a programmable register. Since each instruction either reads or writes an operand value, accesses
by l1- and I 1, to the common register can occur in one of four possible ways:
Read by Ii after read by lj
@ 1"-'1' It tfltrrnlnc 1"-I)‘ It
EYE ll‘? It by It
iltnlebr ltsitentrrrlcbr It
Of these, the first pattern of register access does not in fact; create a dependence, sinoe the two instructions
can read the common value ofthe operand in any order.
The other three patterns of operand access do create dependences amongst instructions. Based on the
tmderlined words shown above, these are known as rend .n_,r‘i'er n'rin: [RAW'] dependence, 'n'rr're qficr read
[WAR] dependence, and ii-rire .n_,fi‘er it-‘rite (WAW) dependence, respectively.
Read afierwrite (RA‘W] is truedata dependence, inthe sense that the registervalue written by instruction lj is
read—i.e. used—by instruction Ik. This is how computations proceed; a value produced in one step is used
Further in a subsequent step. Thereibre RAW dependences must be respected when program instnrctions are
executed. This type of dependence is also known a_s_/‘low depandemre.
‘Write attcr read {WAR} is known as mitt’-n'r=penn'ertee, because in this instance instruction I; should not
overwrite the value in the common register iii] the previous value stored therein has been used by the prior
instnrction IJ- which needs the value. Such dependence can be removed from the exccuting program by simply
assigning another register for the write instruction I1 to write imo. With read and write occurring to two
difiicrent registers, the dependence between instructions is removod. In fact, this is the basis ofthe register
mnmring technique which we shall discuss later in this chapter
tsm.- Lcirelibrollelism I1 JI|r.u| u :
._,_ 5,“
Write aiterwrite-[WAIPI-'1 is kriovm as output dependence, since two instructions are writing to a common
register. Ifthis dependence is violated, then subsequent instructions will a value in the register which
should in tact have been ovcrwritten—i.e. they will see the value written by Ij rather than Ig. This type of
dependence can also be removed from the executing program by assigning another target register tor the
second write instruction, i.e. by register romrrrring.
Sometimes we need to show dependences between instructions using graphical notation. We shall use
small circles to represent instructions, and double line arrows between two circles to denote dependences.
The instruction at the head ofthe arrow is dependent on the instruction at the tail; if necessary, the type of
dependence between instructions may be shown by appropriate notation next to the arrow. A missing arrow
between two instnrctions will m-can explicit absence of dependence.
Single line arrows will be u.sod between instructions when we "wish to denote program order without any
implied dependence or absence of dependence.
Figure 12.2 illustrates this notation.

Ii Ii Ii O Ii Ii

RAW R WAR R WAW R "0 PYQQFQF"

l mi I ml I mi dependence order

it '|t lk O llt lit

Fig. 1 2.2 Dependence sheiwn in graphical notation {Rm In-dlcanes register}

When dependences between multiple instnrctions are thus depicted, the result is a dirccrcri graph
of dependences. A nnrfiz in the graph represents an instnrction, while a directed edge between two nodes
represents a dependence.
Often dependences are thus depicted in a basic block of inst:ructions— i.e. a sequence of instructions with
entry only at the first instruction, and exit only at the last instruction ofthe sequence. in such cases, the graph
of dependences becomes a directed nc_vc!ic graph, and the dependences define a pnrrirri order amongst the
instructions.
Part (a) of Fig. 12.3 shows a basic Heck of six instructions, denoted I | through lg in program order. Entry
to the basic block may be from one ofmultiplc points within the program; continuation aflerthe basic block
would be at one of several points, depending on the outcome of conditional branch instnrction at the end of
the block.
Part (b) of the figure shows a possible pattcm of depcndcnocs as they may exist amongst these six
instnlctions. For simplicity, we have not shown the type of each dependence, e.g. RAW(R3], etc. In the
partial order, we see that several pairs of instn|ctions—sueh as (I|, I3] and (I3, I,,]—are not related by any
dependence. Therefore, amongst each of these pairs, tl'|e instnlctions may be executed in any order, or in
parallel.
Dependenoes amongst instructions are inherent in the instruction stream. For processor design, the
important questions are: Fora given processor architecture, what is the efiect ofsuch dependences on prooessor
porforrnanoc? Do those dependences create hazards which necessitate pipeline stalls andfor flushes’? Can
these dependences be removed on rire_,I'ifr using some design technique‘? Can their adverse impact be reduced?
592 i .
Admncad Cmnpusterfirchitecture
ll‘ If

la. J

6 '1 '1 '2

<5 '2
(5 '3
C) '4 I5 ls

(5 '5 [I1] partiat order of dependences

,1’
<5 =6 is, [a] program order

Fig. 12.3 A basic block of ix Instructions

Consider -once again the pattern of dependences shown in Fig. 12.3(b}. If the processor is capable of
completing two {ormorc} irtstnictions pcrclock cycle, and ifno pipeline stalls are caused by the dependences
shown, then clearly the six instr1.|ction_s can be completed in three consecutive processor clock cycles.
lnstruction latency, from fetch to commit stage, will ofcourse depend on the depth of the pipeline.

Control Dependence: ln typical application programs, basic blocks tend to be small in length, since
about 15% to 20% instructions in programs are branch and jump instmctions, with indirect jumps and return.-r
from procedure calls also included in the latter category. Because oftypically small sizes ofbasic blocks in
program s, the a.rno |.|nt ofinsrruction level parallelism which can be exploited in a single basic block is limited.
Assume that instruction lj is a conditional branch and that, whether another instruction lk executes or not
depends on the outcome ofthe oonditional branch instruction [J-. In such a ease, we say that there is a comm!
nbpendencze of instruction IL on i11stn.|c.tion ll-.
Let us assume that a processor has instruction pipeline of depth eight, and that the designers target
superscalar performance of four instmctions completed in every clock cycle. Assuming no pipeline stalls,
the number of instructions in the processor at any one timc—in its various pipeline stages and functional
units—would be 4 >< E = 32.
If] 5% to 20% of these instmctions are branches and jump s, then tl'|e execution of subsequent instructions
within the processor would he held up pending the resolution ofconditional branches, procedure returns, and
so on—eausing frequent pipeline stalls.
This simple calculation shows the potential adverse impact of conditional branches on the performance
of a superscalar processor. The key question here is: How can the processor designer mitigate the adverse
impact of such comrol .nbj.1en:1'enees in a program?
n.~.1m- Levclflwutlelism "1 .I| \ -
-—. ,,,
Answer: Using some form of brand: rrrrn‘junr;J pre.rfic'ri0rr—i.e. predicting early and correctly [most of
the time] the results ofconditional branches, indirect jump s, and proocdure rctums. The aim is that, for every
correct prediction made, there should be no lost processorclock cycles due to the conditional branch, indirect
jump. or procedure return. For every mis-prediction made, there would be the cost of flushing the pipeline of
instructions from the wrong continuation after the conditional branch orjinnp.

l/l
El Example 12.2 Impact of successful branch prediction
Assume that we have attained 93% accuracy in branch prediction in a prooessor with eight pipeline stages.
Assume also that the mis~prediction penalty is 4 processor clock cycles to ﬂush the instruction pipeline. What
is the performance gain from such a branch prediction strategy’?
Recall that the expected cost of a random variable X is given by Zr‘,-jr,-, where .r,- are possible values of
X, and pi are the respective probabilities. In our case, the probability ofa correct branch is 0.93, and the
corresponding cost is zero; the probability ofa wrong branch is O.'l]'F, and the corresponding cost is 2. Thus
the expected cost ofa conditional branch instruction is 0.07 >< 4 = 0.23 clock cycle i.e. much less than one
clock cycle.
As a primitive form of branch prediction, the processor designer could assumethat a conditional branch is
always taken, and continue processing the instructions which follow at the target address. Let us assume that
this simple strategy works E(l‘.‘rit ofthe time; then the expected cost of a conditional branch is 0.2 >< 4 = 0.8
clock cycles.
Suppose that not even this primitive ibrrn ofbranch prediction is used. Then the pipeline must stall tmtil
the result of every branch condition, and thc target address of every indirect jump and procedure return, is
known; only then can the processor proceed with the correct continuation within the program. lfwe assume
that in this case the pipeline stalls over halfthe total number of stages, then the number of lost clock cycles
is 4 for every conditional branch, indirect jump and procedure return instruction.
Considering that l 5% to ED‘!-"ii ofthe instructions in a program are branches and jumps, the difference in
cost between 0.28 clock cycle a.nd 4 clock cycles per branch instruction is huge, Lnrderlining the importance
ofbranch prediction in a superscalar processor.

Later, in this chapter, we shall study the techniques employed ibr branch prediction.

Resource Dependence: This is possibly the simplest kind ofdependence to understand, since it refers to
a resource constraint causing dependence amongst instructions needing the resource.

L»)
éjd Example12.3 Resource dependence
Considcra simple pipclincti prooessor with only one floating point multiplier, which is not internally pipclined
and takes three processor clock cycles for each multiplication. Assume that several independent floating point
multiply instructions follow each other in thc instruction stream i11 a single basic block under execution.
594 O ridmncad Computerfirchitccture

Clearly, while thepmcessor is executing these multiply instr|.|ction s, it cannot forthat duration get even one
instnrction completed in every clock cycle. Therelbre pipeline stalls are inevitable, caused by the absence of
sufficient ﬂoating point multiply capability within the processor. in fact, for the duration ofthese consecutive
multiply operations, the prooessor will only complete one instruction in every three clock cycles.
We have assumed the instructions to be independent of each other, and in a single basic block—i.e. there
are no conditional branches within the sequence. Thus there is no data dependence or control dependence
amongst these instnrctions. What we have here is msorrree rkijitrndenee, i.e. all the instructions depend on
the resource which ha_s not berm provided to the extent it is needed forthc given workload on the prooessor.
We can say that there is an imbalance in this processor between the ﬂoating point capability provided and
the workload which is placed on it. Such imbalances ir| system resources usually have adverse peribrmance
impact. Recall that Example lll above and the related discussion illustrated this same point in another
context.

A resorrrce rrfejwndrzrrc-e which results in a pipeline stall can arise for access to any processor resourc-e—
iirnctional unit, data path, register bank, and so onp]. We can certainly say that such resource dependences
will arise if hardware resources provided on the processordo not match the needs of the executing program.
Now that we have seen the various types of dependences which can occur between instructions in an
executing program, the problem ofdetecting and exploiting instruction level parallelism can ﬁnally be stated
in the following manner:
Problem Definition Design a superscalar processorto detect and exploit the maximum degree ofparallelism
available in the instruction stream—i.c. execute the instructions in the smallest possible numberofprocessor
clock cyclcs—by handling correctly the data dependences, control dependences and resource dependences
within thc instruction stream.
Before we ean make progress in that direction, however, it is necessary to keep in mind a prototype
processor design on which the problem solution can be attempted.

MODEL OFATYFICAL PROCESSOR

1 We assume a processor with fond-.s'mre instruction set architecture and a set of programmable
registers as seen by the assembly language programmer or tl1e code generator of a compiler.
“Miller these registers are bifiueatcd into separate sets of integer and floating point registers is not ilnportant
lor us at present, nor is the exact ntnnber ofthesc registers.
To support parallel access to instructions and data at the level of the fastest cac he, we assume that L l cache
is divided into instruction cache and data cache, and that this split Ll cache supports single cycle access lor
inst ructions as well as data. Some processors may have an instruction brrjjitirr in place of L 1 instruction cache;
for the purposes ofthis section, however, the difference between them is not important.
The first three pipeline stages on our prototype processor arejizn-h, rieeonir and is.srre.
Following these are the various functional tmits of the processor, which include integer unittsl, floating
point unit(s). loadlstone unit[s}. and other units as may be needed for a specific design as we shall see when
we discuss specific design techniques.

-‘iiTl1is type ofdep-endenee may also be ealled .'r.r‘.r'r.rcl‘;rrrI.r.i dc'pendcn¢'£, since it is related to the structure ofthe prooessor;
however n=.'.'r0r.rre'¢' dependenc¢.> is the more co-mrnon term.
s.~.m..- Lcvcllibrullelism "1 .I| \ _
._,. 5,,
Let us assume that our superscalar processor is designed lor It instruction issues in every prooessor clock
cycle. Clearly then tliefeteh, rfeeonic and issue pipeline stages, as well as the other elements of the processor,
must all be designed to process Ir instructions in every clock cycle.
On multiple issue pipelines, issue stage is usually separated from decade stage. One reason for thus
increasing a pipeline stage is that it allows the processor to be driven by a fasterclock. Decode stage must be
seen as preparation for instruction issue wl1ieh—by deﬁ|1ition—can occur only if thc relevant functional unit
in the processor is in a state in which it can accept one more operation for execution. As a result ofthe issue,
the opcration is handed over to the ﬁmctional unit for execution.

Note 12.1
The name of instruction rfceonb stage is somewhat inaccurate, in the sense that the instruction is never
ﬁilly decoded. [fa 32-bit instruction is li.|lly decoded, for example, the decoder would have some 4 X
109 outputs! This is never done; an immediate constant is never decoded, and memory or HO address
is decoded outside the processor, in the address decoder associated with the memory or l.~'U module.
Register select bits in the instruction are decoded when thq-' are used to access the register bank;
similarly, ALL; function bits can be decoded within the ALU. Therefore register select and ALU
function bits also need not be decoded in the instruction decode stage ofthe prooessor.
Wliat happens in the instruction decode stage of the processor is that m of
the instruction are decoded. For esample, opcode bits must be decoded to select the functional unit,
and addressing mode bits must be decoded to determine the operations required to calculate cfiectivc
memory address.

The processofissuing instructions to ﬁmctional units also involves instruc-!r'onsehedulinglji. Forexample,

if instruction lj cannot be issued because the required ﬁ.|nctional unit is not free, then it may still be possible
to issue the next instruction IJ-+ |—provided that no dependence between thc two prohibits issuing instruction
lJ'+|.

When instruction scheduling is specified by the compiler in the machine code it generates, we refer to it as
srnrie sc-heduling. in theory, static scheduling should free up the processor hardware lrom the comp lcxities of
instnietion scheduling; in practice, though, things do not quite turn out that way, as we shall see in the next
section.
if the processor comrol logic schedules instruction an the _fi_}r——taking into account inter-instnietion
dependences as well as the state of the fiunctional units—wc refer to it as .r.fm.nmr'c sehe.r.|‘u!ing. Much of the
rest ofthis chapter is devoted to various aspects and techniques ofdynamic scheduling. Ofcourse the basic
aim i11 both types of seheduling—static as well as dynamic—is to maximize the i11stn|ction level parallelism
which is exploited in the executing sequence of instnlctions.
As we have seen, at onc time multiple instructions are in various stages ofcxecution within the processor.
But pmewssor stare and program stare need to be maintained which are consistent with thc program ordcrof
completed instructions. This is important from thc point ofyiew ofprcserving the semantics ofthe program.
Therefore, even with multiple instructions executing in parallel, the processor mu st arrange the results of
completed instructions so that their sequence reflects program order. One way to achieve this is by using a

ill Instruction scheduling as discussed here has some similarity with other types of task or job scheduling systems. It
should he noted, ofoourse. that a typical production system requiring job scheduling does not involve conditional
l:-runcl1es_ i.e. control dependences.
5% li - rldmrrced Cmnpusterﬁrrchitectum

reomier buﬁir, shown in Fig. 12.4, which allows instructions to be commirreu‘ in program order, even if they
execute in a diflerent order; we shall discuss this point in some more detail in Section 12. '1'.

F unctlcmat Functional
unit unit ii_
Branch
Prediction

1‘ Fetch Deco-do Issue I Raff?

tocachei
main mommy

Loao.r'Store Register
unit bank

Fig. 'l 2.4 Processor design vd1:h reorder bulfcr

if instructions are executed on the basis of predicted branches, before the actual branch outcome is
available, we say that the processor performs specuinrii-'e ereeurion. ln such cases, the reorder buffer will
need to be clearod—wholly or partly—if the actual branch result indicates that speculation has occurrod on
the basis ofa mis-prediction.
Functional units in the processor may t:hem.selvcs be internally pipclined; they may also be provided with
reservation smrians, which accept operations by the issue stage ofthe in znruction pipeline. A functional
unit performs an operation when the required operands for it are available in the reservation station. For the
purposes of our discussion, memory imm‘-store uniIr's,l may also be treated as functional tmits, which perform
their finictions with respect to the cacheimernory subsystem.
Figure 12.5 shows a processor design in which ﬁmctional units are provided with resen-nrion smrirzws.
Such designs usually also make use of operrrmifcni-‘arrfing over a common dam bus [CD13], with tags to
identify the source of data on the bus. Such a design also implies regi.srer renaming, which resolves RAW
and W.-KW dependences. Dynamic scheduling of instructions on such a processor is discussed in some more
detail in Sections 12.8 and 12.9.
A branch prediction unit has also been shown in Fig. 12.4 and Fig. 12.5 to implement some form ofa
branch prediction algorithm, as discussed in Section 12.10.
Data paths connecting the various elements within the processor must be provided so that no resource
dependem:es—and consequent pipeline stalls—are created for want of a data path. if Ir instructions are to be
completed in every processor clock cycle, the data paths within the processor must support the required data
transfers in each clock cycle.
t.,m..,.- “imam ._,y 5,,

Functional F mctlonal
unlt unlt I I I

Reservation Reservation
57316 h stations stations
Preelletlon

Fetch Decode Issue I

to 12Gl‘lBJ'
main memory

Loadfﬁtme Rag later

m It ban k

Fig. 11.5 Froc-ass-or deslgn wlth reservatlon stations on functional unit:

At one extreme, a primitive arrangement would be to provide a single common bus within the processor;
but such a bus would become a scarce and pcrtbrmance limiting resource amongst multiple irtstructiorns
executing in parallel within the processor.
At the other extreme, one can envisage a emnplere gr.n_;Jh of data paths amongst the various processor
elements. In such a sy stem, in each clock cycle, any processorelement can transfer data to any other processor
clement, with no resource dependences caused on that acooum. But unfortunately, for a processor with n
internal elements, such a system requires n — 1 data ports at every element, and is therefore not practical.
Therefore, betwoen the two extremes outlined above, processor designers must aim for an optimum
design of intemal processor data paths, appropriate tbr the given instruction set and the targeted processor
performance. This point will be discussed further in Section 12.6, when we discuss a technique known as
0perand_f0r'n-erding.
AS mentioned HIJOVB, the important question of defining program (or thread} smite and proces.-sor smte
must also be addressed. I f a contest switch, interr1.|pt or exception occurs, the program-"thread state and
processor state must be saved, and then restored at a latertime when the same programfthread resumes. From
the programmer's poim ofview, tl1e state should correspond to a point in the machine language program at
which the previous insrmction has completed execution. but the next one has not started.
in a multiploissue processor, clearly this requires carefi.|l thought—s i11ee, at any time, as many as a couple
of‘ dozen instructions may be in \-arious stages ofexeeution.
Aproeessorofthe type described here is often designed with hardware support t'ormm"!i-threading, which
requires maintaining thread status of multiple threads, and switching between threads; this type ofdcsign is
discussed fi.|rther in Section 12.12.
EFF i _ Admncad Cmnputerfireltiteeture

Note also that, in Fig. 12.4 and Fig. 12.5. we have separated control elements fl-om data ﬂow elements and
functional units in the proeessor—and in fact shown only the latter. Design ofthe control logic needed tbrthc
processor will not be discussed in this chapter in any degree of derail beyond the bricfovcrvicw contained
i11 Note 12.2.

Note 12.1
The processor designer must select the architectural components to be included in the processor—for
example a reornlrr brijirr of B. particular type, a specific method of opernmf_fi'Jrn-trrrfing, a specific
method ofbrmmli rrredicrion, and so on. Thedesignermust also specify li.|lly the algorithms which will
govern the working ofthe selected architectural components. These algorittrms are very similar to the
algorithms we write in higher level programming languages, and are written using similar languages.
These algorithms specify the comrol logic that would be needed for the processor, which would be
finally realized in the form of appropriate digital logic circuits.
Given the complexity of modem systems, the task of translating algorithmic descriptions of
processor functions into digital logic circuits can only be carried out using very sophisticated VLSI
design software. Such software offers a wide range of firnctionality; sirrrrdtrrirm software is used to
verity the correcmcss ofthe selected algorithm; l'0gical design software translates the algorithm into a
dig ital circuit; p.li_i's'icnu" ralrsign sofiware translates the logical circuit design into a physical circuit which
can be built using VLSI, while design vergficafion software verifies that the physical design does not
violate any constraints of the underlying cecuit fabrication technology.
All the architectural elements and control logic which is being described in this chaptercan thus be
translated into a physical design and then realized in ‘s-'LSl. This is how processors and other digital
systems are designed and built today. For our purposes in this chapter, however, it is not necessary to
go into the ktails ofhow the required circuits and control logic are to be realized in VLSI.
We take the view that the architect decides n-'hm is to be designed, and then the circuit designer
designs and realizes the circuit accordingly. ln ottrcr words, our subject matter is restricted to the
functions ofthe architect, and does not extend to circuit design—i.c. to the question oflmw aparticular
function is to be realized in VLSI. We assume that any required control logic which can be clearly
specified can be implemented.

EH COMPILER-DETECTED ||~|s1'nuc1'|o|~| LEVEL PARALLELISM

In the process oftranslating a sequential source program into machine language, the compiler
performs extensive syntactic and semantic analysis ofthe source program. Therefore computer
scientists have considered carefully the question of whether the compiler can uncover the instruction level
parallelism which is implicit in the program. As we shall sec, there are several ways in which the compiler
can contribute to the exploitation ofimplicit instruction level parallelism.
One relatively simple technique which the compiler can employ is lcnown as ll-mp unro-Hing, by which
independent instructions from multiple successive iterations ofa loop can be made to execute in parallel.
r,sm.- uwtm-umtum 5,,
L-"nmHr'ng means that the body of the loop is repeated n times for n successive values of the control variable—
so that one iteration of the transformed loop performs foe work ofn iterations of the original loop.

3,13 Example 12.4 Loop unrolling

Consider the following body of a loop in a user program, where all the variables exocpt the loop control
variable i are assumed to be ﬂoating point:
fori-= D to 55 do
r:,;_ = a.;
_ _ ='-lo.;_
_ — pt-el,;_,:

Now suppose that machine code is generated by the compiler as though the original program had been
written as:

for 3 = O to 52 stop 4 do
\\

cf‘_.. = a='
._.. *h . J
_ prd£j:;
C.E::"'.: = ﬂ::j"" '~h'j-L1 - p"dT:
c:f"—2: = afi- 2 -~b'j=2§ - p*dTo—2 :
. ._: .

cij-EZ = ali- 3 _-bfj-31 - p*d5j-3';

j.
cjsej = a;5ej~bj 56: — p*d:5B:;
C151‘: = aT_5"|‘:*lo: 51, - p~a551;;
Q1531 = ajﬁé-I--“b: 5a; - p~a§5a;:

Note careﬁtlly the values of loop variable j in the transformed loop.

The reader may verify, without too much difficulty, that the two program fragments are equivalent, in the
sense that they perfomr the same computation. Ofcourse the oompilcr does not transform one source program
into another—it simply produces machine code corresponding to the second version, with the unroflerfloop.
ln the urrrolled program fragment, the loop contains four independent instances of the original loop
body—indoed this is the meaning ofloqo unrolling. Suppose machine code corresponding to the second
program fragment is executing on a processor. Then clearly if the prooessor has sufficient floating point
arithmetic resources—irrstn.|ctions from the four loop iterations can he in progress in parallel on the various
functional units.
lt is clear that code length ofthe machine language program increases as a result of loop unrolling; this
increase may have an effect on the cache hit ratio. Also, more registers are needed to exploit the instruction
level parallelism within the longer unrolled loop. ln such cases, techniques such as register ta-nmm'ng—
discussed in Section 12. El—can allow greater exploitation ofinstruction level parallelism in the urrrolled loop.

To discover and exploit the parallelism implicit in loops, as seen in Example 12.4, the compiler must
perform the loop enrolling transformation to generate the machine code. Clearly, this strategy makes sense
only if su.ﬂicir:nt hardware resources are provided within thc processor for ewtecuting instr-ut:tion.s in parallel.
Hm --. ' Amwmfmmwdmmmm

ln the simple example above, the loop control variable in the original program goes from ID to 5B—i.c. its
initial and final values are both known at compile time. If, on the other hand, the loop control values are not
lorown at compile time, the compiler must generate code to calculate at run-time the control values for the
unrolled loop.
Note that loop unrolling by the compiler does Qt in itself involve the detection of instruction level
parallelism. But loop unrolling makes it possible for the compiler or the processor hardware to csploit a
greater degree of instnrction level parallelism. ln Example 12.4, since the basic block making up the loop
body becomes longer, it becomes possible for the compiler or prooessor to find a greater degree ofparallelism
amongst the instructions across the unrolled loop iterations.
Can the compiler also do the additional work of actually scheduling machine instructions on the hardware
resources available on the processor? Or must this scheduling be necessarily performed on rtrefly by the
processor control logic?
Wlren the compiler schedules machine instnrctions forexecution on the processo r, the form of scheduling
is known as srrrrie sche.r.fur"ing. As against this, instruction scheduling carried out by the processor hardware
on the is known as oft-'ncrrnr'e .~rcfrcr2‘rrling, which has been introduced in Chapter 6 and will be discussed
further later in this chapter.
lfthe compiler is to schedule mach inc instruction s, then it must perform the requ ired depcndenoc analysis
amongst instructions. This is certainly possible, since the compiler has access to full semantic information
obtained from the original source program.

I»)
g Example 12.5 Dependence across loop iterations
Consider the following loop in a source program, which appears similar to the loop seen in the previous
ex-ample, but has a crucial new dependence built into it:
for i = D to 55 do
cjif = afijsbfif — p*e[i—Lf;

Now the value calculated in the i'":" iteration ofthe loop makes use ofthe value -: Ii — L I calculated in
the previous iteration. This does not mean that the modiﬁed loop cannot be unrolled, but only that extra care
should be taken to account for the dependence.

Dependenoes amongst refbrenoes to simple variables, or amongst array elements whose index values are
known at compiletime -[as in the two -osamples seen above), can be analyzed relatively easily at compile time.
But when pointers are used to refer to locations in memory, or when array index values are known only
at run-time, then clearly dependence analysis is not possible at compile time. Therefore processor hardware
must provide support at run-time for dim rrr'arf_y,-'.s'r's—i.e. based on the respective effective addresses, to
detemtine w hethertwo memory accesses for read or write operations refer to the same location.
There is another reason why static scheduling by the compiler must be backed up by dynamic scheduling
by the processor hardware. Cache misses, l~"O interrupts, and hardware exceptions cannot be predicted
em-0,. comes -—.. H,
at compile time. Therefore, apart from alias amilysis, the disruptions caused by such events in statically
scheduled rtmning code mu.st also be handled by the dynamic scheduling ha.rdware in the prooessor.
These arguments bring out a basic point—compiler detected instnlction level parallelism also requires
dynamic scheduling support within the processor. The fact that compiler performs extra work does not really
make the processor hardware much simplerl“.
A further step in the direction ofcompiler detected instruction level parallelism and static scheduling can
be the following:
Suppose each machine instruction specifics multiple operations——to be carried out in parallel within
the prooessor, on multiple functional units. The machine language program produced by the compiler then
consists of such multi-operation instructions, and their scheduling takes imo account all the dependences
amongst instructions.
Recall that conventional machine instntetions speciiy one operation cach—c.g. fond, ants‘, rrmLrip!__r, and
so on. As opposed to this, multi-opcration instn.|ctions would require a larger number of bits to encode.
Theretore processors with this type of instruction word are said to have v+:r__r long ins£r'r.1cr‘r'rm ti-'orr1’("v'Ll"t‘t-’].
A preliminary discussion ofthis concept has been included in Chapter -4 ofthe book.
A little further refinement of this concept brings us to the so-ealled t."_?t',l'J.l'i7i1‘i'l"|l'l-‘}'J|f'll"|:I.|"|liE'.|l iristrucrion eortsprtttrr
(EPIC). The EPIC instruction format can be more flexible than the fixed format of multi-operation VLIW
instntction; tbrexample, it may allow the compiler to encode explicitly dependences between operations.
Another possibility is that of having predicttrtrd insrrucr1'0n.~r in the instruction set, whereby an instruction
is executed only if the hardware condition (predicate) specified with it holds tnte. Such instructions would
result in reduced number of conditional branch instntctions in the program, and could thereby lower the
number of pipeline flushes.
The aim behind VLIW and EP [C processor architecture is to assign to the compiler primary responsibility
for the parallel exploitation ofplentifitl hardware resources ofthe processor. In theory, this would simplify
the processor hardware, allowing for increased aggregate proces sor throughput. Thus this app roach would, in
theory, provide a third alternative to the RISC and CISC styles of processor architecture.
In general, however. it is fair to say that VLWV and EPIC concepts have not fulfilled their original promise.
lntel ltanium 64-bit processors make up the mo st well-known processor family ofth is class. Experience with
that processor showed, as was argued briefly above, that processor hardware does not really become simpler
even when the compiler bears primary responsibility for the detection and exploitation of instruction level
parallelism. Events such as imcrrupts and cache misses remain unpredictable, and therefore execution of
operations at run-time cannot follow completely the static scheduling specified in VLIWM EPIC instructions
by the compiler; dynamic scheduling is still needed.
Another practical clitficulty with compiler detected instruction level parallelism is that the source program
may have to be recompiled for a different processor model of the same processor family. The reason is
simple: such a compiler depends not only on the instruction set architecture [ISA] of the processor family,
but also on the hardware resources provided on the specific processor model for which it generates code.

'-‘ll Recall in this context the basic argument for RISC architecture, whereby the instruction set Ls J"l'.‘dl'I£'£'£-l for the sake of
higher processor throughput. A similar trade-ofihetween hardware and soltware complexity do-es not exist when the
compiler performs static sclretiulirrg of instructions on a superscalar processor.
F?» Mtfiruw Hfllrlrmpwrnw
H11 T ridmnced Ccmputcrfirclriteeturc

For highly compute-intensive applications which run on dedicated hardware platforms, this strategy may
well be feasible and it may yield significant pertbrnianee benefits. Such special-pr:|.rpose applications are fine-
tuncd fora given hardware platform, and then run for long periods on the same dedicated plattbrm.
But commonly used programs such as word processors, web browsers, and spreadsheets must nrn without
recompilation on all the processors ofa family. Most users ofsotlware do not have source programs to
recompile, and all the processors ofa family are expected to be instruction set compatible with one another.
Therefore the role of compiler-detected instruction level parallelism is limited in the case of widely rnred
general purpose application programs ofthe type mentioned.

Niiiiiili (JFWERJhlIIJ|Fi)FUIVlHR[)Hfll5
1 We know that a superscalar processor offers opportunities for the detection and exploitation
of instnrction level parallc|ism—i.e. potential parallelism which is present within a single
instruction stream. Exploitation ofsuch parallelism is enhanced by providing multiple functional units and
by other techniques that we shall study. Tr|.|c data dependences between instnrctions must of course be
respected, since they reflect program logic. On the other hand, two independent instructions can be executed
in parallel—or even out ofsequence—ifthat results in better utilization of processor clock cycles.
We now know that pipelinejirlslres caused by conditional branch, indirect jump, and procedure return
instructions lead to degradation in performance, and therefore attempts must be made to minimize them;
similarly pipeline .s'r.nHs caused by data dependences and cache misses also have adverse impact on processor
performance.
Therefore thc strategy should be to minimize the number of pipeline stalls and flushes encountered while
executing an instruction stream. In other words, we must minimize wasted processor clock cycles within the
pipeline and also, if possible, within the various iirnctional units ofthe processor.
In this section, we take a look at a basic technique known as o;Jerond_fom-'arti‘r’ng, which helps in reducing
the impact oftrue data dependences in the instruction str-cam. Considerthe following simple sequence oftwo
instructions in a running program:
Ann R1, R2, as
sn:rca #4, as, as
The result ofthe.-KDD instruction is stored in destination register ,r,n|_i 141, rig, n;_=,
R3, and then shifted right by ibur bits in the second instnrctinn,
with the shifted value being placed in Ft-4. Thus, there is a simple
RAW rfeperidenr-e between the two instructions—the output of the RAW [R3]
first is required as input operand of the second.
In terrrts ofour notation, this RAW dependence appears as shown
in Fig. 12.6, in the form ofa graph with two nodes and one edge. SE1 3.1.5“ y-,1. R3’ M
[n a pipelined processor; ideally the second instnretion should
Fig.12.b RAW dependence between
be executed one stagc—-and therefore one clock cyclc—bchind the
nsolnsnnerlons
first. However, the diliiculty here is that it takes one clock cycle
to transfer ALU output to destination register R3, and then another clock cycle to transfer the contents of
mm Letrelflwullelism -—.. ,0,
register R3 to ALU input for the right shift. Thus a total of two eloek cycles are needed to bring the result
of the first instruction where it is needed for the second instruction. Therefore, as things stand, the second
instruction above cannot be executed just one clock cycle behind the first.
This sequence of data transfers has been illustrated in Fig. ll? fa). In clock cycle T1,, ALU output is
transferred to R3 over an internal data path. ln the next clock cycle Th +1, the content of R3 is transferred to
ALU input for the right shift. When carried out in this order, clearly the two data transfer operations take two
clock cycles.
But note that the required 135-_o transfers ofdata can be achieved in only one clock cycle ifALU output is
sent to both R3 and ALU input in the same eloelt cycle as illustrated in Fig. l2.7 (b). ln general, if X is to
be oopiod to Y, and in the next clock cycle Y is to be copied to Z, then we canjust as well copy X to both Y
and Z in one clock cycle.
if this is done in the above sequence of instructions, the second instruction ean be just one clock cycle
behind the first which is a basic requirement of an instruction pipeline.
ALL! output Tr R3

Tit + 1
ALU Input
la)

ALU output Tk R3

Tk
ALU Input
[bl
Fig. 11'? ‘limo data transfers {a} in sequence and {ti} in parallel

In technical ternrs, this type of an operation within a processor is known as qrJertrnd_forwtn'ding. Basically
this means that, instead of periorming two or more data transiers from a oomrnon source one after the
other, we perform them in parallel. This can be seen as parallelism at the level ofclementary data transfer
operations within the processor. To achieve this aim, the processor hardware must he designed to detect and
exploit on tkefly all such opportunities for saving clock cycles. We shall see later in this chapter one simple
and elegant technique for achieving this aim.
The benefits of sueh a technique are easy to see. Thc wait within a functional unit for its operand becomes
shorter because, as soon as it is available, the operand is scnt in one clock cycle, ovcr the common data bus,
to every destination where it is needed. We saw in the above example that thereby the common data bus
renrainod occupied ior one clock cycle rather than two clock cycles. Since this bus itself is a key hardware
resource, its better utilization in this way certainly co mributes to better processorperformance.
HM i Admrrced Compurterfirchitecture

The above reasoning applies even if there is an intervening instruction between ADD and SHIFTR.
Consider the following sequence of instructions:

son 3;, R2, R3

sun 25, as, at
snzrcs #4, 23, 24
S-HIFTFL must be executed aflerADD, in view ofthe RAW dependence. But there is no such dependence
between SUB and any ofthe other two instructions, which means that SUB can be executed in program order.
or before ADD, or after SHI FTR.
If SUB is executed in program order, then even without operand forwarding bctwotm ADD and SHIIFTR,
no processor elock cycle is lost, since S1-lI.l-TR does not directly follow ADD. But now suppose SUB is
executed either before ADD, or afier SHIFTR. ln both these cases, SHIFTR directly follows ADD, and
therefore operand forwarding proves useful in saving a processor cycle, as we have seen above.
Figure 12.8 shows the dependence graph of these three
AIIII FLL, l-‘LI, Rh
instructions. Since drere is only one dependence in this instanoe
amongst the three instructions, the graph in the figure has three
nodes and only one edge. RAW {R33 5l5'1l'.R51 R51 F*"'
But it-it}-' should SUB be executed in any order other than
program order?
The answer can only be this: to achieve better utilization of mt;t-;-t,-it t.»., R3, ta».
processor clock cycles. For example, if fiar some reason ADD
cannot be executed in a given clock cycle, then thc prooessor may Fig. 11-B Dependence graph of three
well decide to execute SUB before it. i"5"‘"=¢!i¢l'**
Therefore thc processormust make on hllefly decisions such as
(il transferring ALU output in parallel to both R3 and ALU input, andfor
{iij out ofordcr esecution ofthe instruction SUB.
This implies that the comrol logic of the processor must dctoct any such possibilities and generate
the required control signals. This is in tact what is needed to implement t'{t-'mImit- sehoittiing of machine
instnicfions within the processor.
Of course, to achieve performance speed-up through dynamic scheduling, considerable complexity must
be added to processor control logic—lrut that is a price which must be paid for exploiting instnlction level
parallelism in the sequence of executing instructions; the complexity in achieving superscalar performance
would of course be greater.
Machine instructions of a typical prooessor can be classified into dam Irtmsfer instnictions, nrit!mie1"r'c
and iogie instructions, comptrrison instructions, trtrrrrfer qfcorrtro! instructions, and other miscellaneous
instnictions.
Ofthese, only the second group of instruc1ions—i.e. arithmetic and logic instn|clions—actually alter the
values of their operands. The other groups of instructions involve only transfers ofdata within the processor,
between the processor and main memory, orbctween the prooessor and an l.-"O adapter
smsstmsm -- M,
Arithmetic and logic instructions are basically functions to be computed—either unary iirnctions of
the form _§,-' =_,i'{_Jr], or binary functions ofthe tbrm _§,-' =_,i\'_.r|,r;j. And these computations are carried out by
functional units sueh as nrirhnrt-Ire rrmfiogit-11:1?! (ALU),fl0atingp0r'nt um’! (FPU), and so on. But even to get
any computations done by these Functional units, we noed (ij transferofoperarrds to the inputs of functional
units, and (ii) transferof results back to registers or to the reorder buffer.
From the above arguments, it should be clear that data transfers make up a large proportion ofthe work of
any processor. The need to fully utilize available hardware resources forces designers to pay close attention
to the datatransfers required not only fora single executing instruction, but also across multiple instructions.
ln this context, operand forwarding can be seen as a potentially powerful technique to reduce the number of
clock cycles spent in carrying out the required data transfers within the processor.
In Fig. 12.4 and Fig. 12.5, we have not shown details of the data paths connecting the various elements
within the processor. This is intentional, because the nature and number of data paths, their widths, their
access mechanisms, er eeterrr, must be designed to be consistent with {ij the various hardware resources
provided within the processor; and {iii} the target performance ofthe processor. Details of the data paths
cannot be pinned down at an early stage, when the rest ofthe design is not yet completed.
We have discussed earlier B basic point related 10 any system perfon'nanee: rhere.sh0rr)'r1'be no rreffbrntrrnt-e>
horticner-its in the .st*.s!em. Clearly therefore the system of data paths provided within the processor should
also not become a performance limiting clement. A multiple issue processor targets It _1=- 1 instruction issues
per processor clock cycle. Hence the demands made on each of the elements of the processor—including
cache memories, functional Lmits, and intemal data paths—would be I: times greater.

FIEDRDER BU FFER
— The rmrder huflirr as a processor element was introduced and discussed briefly in Section
1'-3.4. Since instructions cltecute in parallel on multiple firnctional units, the reorder bufter
serves thc firnction of bringing completed instructions back into an order which is consistent with program
order. Note that instructions may eorrr,rJ!ere in an order which is not related to program order, but must be
comrrrirred in program order.
At any time, program s'r‘.rrre and pmeessor s'r‘rn'e arc defined in terms of instructions which have l:|-een
committe4;l—~i.e. their results are reflected in appropriate registers andfor memory locations. The concepts
of program state and processor state are important in supporting context switches and in providing precise
esceptions.
Entries in the reorder buffer are completed instructions, which are gueucd in program order. However,
since instructions do not necessarily complete in program order, we also need a flag with each reorder buffer
entry to indicate whether thc instruction i11 that position has completed.
Figure 12.9 shows a reorder buffer of size eight. Four fields are shown with each entry in the reorder
bufli::r—instruetion identifier, value computed, program-specified destination of the value computed, and a
flag indicating whether the instruction has completed (i.e. the computed value is available).
[n Fig. 12.9, the head ofqueue of instructions is shown at the top, arbitrarily labeled as instr[i]. This is the
instruction which would be committed nest—-if it has completed execution. When this instnrction commits,
HIE F Admrrced Compurterfirchitecture

its result value is copied to its destination, and the instruction is then removed from the reorder buffer. The
nest instnrction to be issued in the issue stage ofthe instruction pipeline then joins the reorder bu.ﬁ'er at its
tail.

Head of queue lnetruetlon wl ll

commit lf lts value ls available

|nstr[|] valuo[l1 dest[l] reacly[|1

|nstr[|+1] valuo{l+1] cloat[l+1] roacly[l+1}
instr{|+2] vatuo[l-+2} tIlflG‘l[l-+21 reiaclyfl-+2]
instr[|+3] vaJuo{l+3] do-st{l+3} roaoy[l+3}
|nstr[|+4] valuoll +4] dest[l+4] ready[|-Ht}
|nstr[|+5] 1.ralue{l+5] clest[l+5] reacly1|-+5]
lnstr[|+B] va1ue[l+fi] eeat[t+B] reaclyll-H5]
ins-tr[|+T] 'u'aJuo{l+?] clest{l+?] roaoy[l+ T]

Fig. 12.! Entries in a reorder buffer ofslze eight

[fthe instruction at the head ofthe queue has not completed, and the reorder butter is firll, then further
issue of instructions is held up—i.e. the pipeline stalls—because there is no free space in the reorder buffer
throne more entry.
The result value of any other instruct ion lower down in tl1e reorder buffer, say value[i+ ls], can also be used
as an input operand for a subsequent operation—provided ofcourse that the instruction has completed and
therefore its rcsult value is available, as indicated by the corresponding flag ready[i+k]. In this sense, we see
that the technique of operandfortt-'arrfing can be combined with the concept of the reorder bufier.
lt should be noted here that operands at the input latches of functional units, as well as values stored in
the reorderbuffer on behalfof completed but uncommitted instructions, are simply ‘work in progress‘. These
values arc not reflected in the state of the program or the processor, as needed for a context switch or for
exception handling,
We now take a brief look at how the use ofreorder buffer addresses the various types ofdependences in
the program.

{i} Data Dependence: A RAW dependcnce—i.e. tnre data dependcnee—will hold up the execution ofthe
dependent instruction ifthe result value required as its input operand is not available. As suggested above,
operand forwarding can be added to this scheme to speed up the supply ofthe needed input operand as soon
as its value has been computed.
WAR and WAW dependcnces—i.e. anti-dependence and output dependence, respectively—aIso hold up
the execution ofthe dependent instruction and create a possible pipeline stall. We shall see below that the
technique of register remmzfng is needed to avoid the adverse impact ofthesc two types ofdependences.

(ii) Control Depend-en:-B Suppose the in.struction{s] in the reorder buffer belong to a branch in the
program which should it have been taken—i.e. there has been a mis-predicted branch. Clearly then the
emmotmsm -- ,,,,
reorder buffer should be ﬂushed along with other elements of the pipeline. 'l'he-refore the performance impact
ofcontrol dependences in the r|.|nning program is determined by the accuracy ofbranch prediction technique
employed. The reorderbuffer plays no direct role in the handling ofcontrol dependences.

Resource Dependenoes lfan instruction needs a functional unit to execute, but the unit is not free,
then the instruction mu st wait for the unit to become lre1.~clearly no technique in the world can change that.
ln sueh cases, the pnocessor designer ean aim to achieve at least this: if a subsequent instruction needs to use
another firnctional unit which is free, then the subsequent instruction can be executed out ofordcr.
However, the reorder buffer queues and commits instructions in program order. In this sense, therefore,
the technique of using a reorder bufier does not address explicitly the resource dependences existing within
the instruction stream; with multiple firnctional tmits, the processor can still achieve nut ofordcr completion
ofinstruetions.
ln essence, the conceptually simpletcchnique ofreorder buFI'er en sures that if instructions as prog rammed
ean be carried out in pa1allel—i.e. ifthere are no dependences amongst them—then they are carried out
in parallel. But nothing clever is attempted in this technique to resolve dependences. Instruction issue and
oommit are in program order; program state and processor state are co rroctly preserved.
We shall now discuss a clever technique which alleviates tl1e adverse performance effect of WAR and
WAW dependences amongst instructions.

REGISTER RENAMING
-
Traditional compilers allocate registers to program variables in such a way as to reduce the
main memory accesses required in the running program. ln programming language C, in fact,
the programmer can even pass a hint to the compiler that a variable be maintained in a processor register.
Traditional compilers and assembly language programmers work with a fairly small number of
programmable registers. The number of programmable registers provided on a processor is determined by
either

(it the need to maintain backward instruction compatibility with other members ofthe processor family,
or
{ii} the need to achieve reasonably compact instruction encoding in binary. With sixteen programmable
registers, lor example, four bits are needed foreach register speciﬁed in a machine instruction.

Amongst the instructions in various stages of-e.\recution within the processor, there would be occurrences
of Rs-‘Wt-', WAR and WAW dependenoes on programmable registers. As we have seen, RAW is true data
dependenoe—since a value ‘written by one instmction is used as an input operand by another. But a WAR
or WAW dependence can be avoided ifwe have more registers to work wit.h. We can simply remove sueh a
dependenoe by getting the two instn.|ctions in question to use two diflerent registers.
But we must also assume that the insrruc-firm se! rtrehirccrnre (ISA) of the processor is fi1tod—i.e. we
eannot change it to allow access to a larger number of programmable registers. Rather, our aim here is to
esploretoehniques to detect and exploit instruction level parallelism using a given instruction set architecture.
Hill i Adrwrced Cornpurterfirelritecture

Therefore the only way to make a larger number of registers available to instructions under esecution
within the proces sor is to make the additional regi sters invisibfe to machine language instructions. lnstructions
under ere:-urrhn would use these additional registers, even it" instructions making up the machine language
program stored in memory cannot refer to them.
Let us suppose that we have several such additional registers available, to which machine instructions
of the nrrming program cannot make any direct reference. Ofcoursc these machine instructions do refer to
programmable registers in the processor—and thereby create the WAR and WAVI.-' dependences which we are
now trying to remove.
For example, let us say that the instruction:
rape 2;, 22, as
is followed by the instrr.rct:ion:

rsus as, art, as

Both these instructions are writing to register R5, creating thereby a 11*-|l'|l' l‘~1.~ 113- R5
WAW dcpendcnce—i.e. output dcpendcnce—on register R5. Clearly,
any subsequent instruction should read the value written into R5 by “WW lFi5l
FSUB, and E the value written by FADD. Figure 12.10 shows this
dependence in graphical notation. &'s|.'|s R3 , tr-1, as
With additional registers available tor use as these instructions Hg "Jo WAW. dependence
execute, we have a simple techniqueto remove this output dependence.
Let FSUB write its output value to a register other tl1a.n R5, and let us call that other register X. Then
the instructions which use the value generated by i-‘SUB will refer to X, while the instructions which use
the value generated by FADD will continue to refer to R5. Now. since FADD and FSUB are writing to two
different registers, the output dependence or WAW between them has been removed! ls]
When FSUB conunins, then the value in R5 should be updated by the value in X—i.c. the value computed
by l-‘SUB. Then the physical register X, which is not a program visible register, can be freed up for use in
another such situation.
Note that here we have mrrpped—or renamed R5 to X, for the purpose 01° ato-ring the result of FSLTB_ and
thereby removed the WAW dependence from the instnrction stream. A pipeline stall will now not be created
due to the ‘N.-°t\7t-' dependence.
ln general, let us assume that instruction [j writes a value imo register Ry. At the time of instruction issue,
we map this programmable register R‘, onto a program invisible register Km, so that when instnrction [J-
esecutes, the result is written into X,“ ratherthan Rx. [n this program invisible register Km, the result value is
available to any other instnrction which is truly data dependent on lJ— i.e. which has RAW dependence on lj.
lfany instruction other than l_i is also writing into Rt, then that instanceofR;, will be mapped into some other
program invisible register X“. This renaming resolves the waw dependence between the two instructions
involving R1,. When instruction lj commits, thc value in Xm is copied back into Rk, and the program invisible
register X,“ is freed up for reuse by another in.str|.|crion.

:51 In fact the prooessor may also rename R5 in F.-XDD to another program invisible register, say Y. But ctearly the
argument made here still remains valid.
hm nsmrm -—.. H,
A similar argument applies if lj is reading the value in R1,, and a subsequent instruction is writing into
R1,—i.e. there is a WAR dependence between them.
The technique outlined, which can resolve WAR and W.-KW dependences, is lcnown as register remtrrring.
Both these dependences are caused by a subsequent instruction writing into a register being used by a
previous instruction. Such dependences do not reﬂect program logic, but rather the use of a limited number
of registers.
L/ct us now consider a simple example of WAR dependenoe, i.e. ofanti-dependence. The case of W.-KW
dependence would be very similar.

Example 12.6 Register renaming andWAR dependence

Assume that the instructions:
Faun RE, R7, R2
FADD R2, R3, R5
are followed later in the program by the instruction:
rsus 2;, 23, 22
Thc first FADD instnrction is writing a value into R2, which tl1e second FADD instruction is using, i.e.
there is true data dependence between these two instructions. Let us assume that, when the first l-‘ADD
instruction executes, R2 is mapped into program invisible register Km.
The latter FSUB instmction is writing another value into R2. ':'FilJlIl RE-, RT, E2
Clearly, the second FADD {and ot.her intervening instructions
before 1~'Sl_JB) should see the value in R2 which is written by the RAW [R23
first FADD --and not the value written by FSUB. Figure 12.11
showsthese two dependences ir| graphical nocstion. '='1"-IJIJ 1'11 1 P~3- R5
With register renaming, it is a simple matter to resolve the
WAR anti-dependence between the second l-‘ADD and 1-‘SUB. WAR [R21
As mentioned, let X“, be the program invisible register to ':‘Sl.'B R1, R], R:
which R2 has been mapped when the first FADI) executes. This
is then the remapped registerto which the second FADD refers |:;:_ 1111 RAW H-H1 WAR Qmndmfls
for its first data operand.
Let FSUB write its output toaprograro invisible register other than X,,.,, which we denote by X”. lnstnietions
which use the value written by FSUB refer to X“, while instructions which use the value written by the first
FA DD refer to Km.
Thc WAR dependence between the second FADD and FSUE is removed; but thc RAW dependence
between the two FADD instructions is respected via Xm.
When the first FADD commits, the value in X,“ is transferred to R2 and program invisible register X,“
is freed up; likewise, later when FSUB CDl'flI't'!ll5, the value in X“ is transferred to R2 and program invisible
register Xm is freed up.
5| ll T ' Advanced Computerfirchitecture

Thus we sec that register renaming removes ‘N.-“LR and W.-‘KW dependences from the instruction stream by
re-mapping programmable registers to a larger pool of program invisible registers. For this, the processor
must have extra registers to handle instructions under execution, but these registers do not appear i11 the
instruction set.
Consider true data dependence, i.e. RAW dependence, between two instructions. Under register renaming,
the write operation and the subsequent read opcration both occur on the same program invisible register. Thus
RAW dependence remai11s intact in the instruction stream—as it should, since it is true data dependenoe. As
seen above, its impact on the pipeline opcration can be reduced by operand forwarding.
Dcpendences are also caused by reads and writes to memory locations. In general, however, whether
two instructions refer to the same memory location can only be lcnov.-n aﬁer the two eiilective addresses are
calculated during execution. For example, the two memory references 2OCI'I][Rl1 and 4i]'I]'l][R3] occurring in
a running program may or may not refer to the same memory location—this cannot be resolved at compile
time.
Resolution of whether two memory references point to the same memory location is ltnovm as alias
rmrtlysis, which must be carried out on the basis of the two eﬁ'ective memory addresses. If a Joan‘ and a store
opcration to memory refcrto two different addresses, their order may be interchanged. Such capability can be
built into the load-store ur|it—which in essence operates as another iilnctional unit ofthe processor.
An elegant implementation ofregister renaming and operand fotrwarding in a high performance processor
was seen as early as i11 l9ti7—even before the term register rermnring was coined. This tcchnique—which
has si11cc become well-ltnown as Torrmsuio Is‘ rii'gorit!rm—is described in the next section.

TOMASI-lLO’S ALGORITHM
1 In the I BM 3-fit] family ofcomputer systems of 1960s and 1970s, model 3t"i{3.-"91 was developed
as a high performance system for scientific and engineering applications, which involve
intensive floating point computations. Thc prooessor in system was designed with multiple floating point
units, and it made use ofan innovative algorithm for the efficient use of these units. The algorithm was based
on operand forwarding overacommon data bus, with tags to identity sources ofdata values sentoverthc bus.
The algorithm has since become known as ii’;-rrmsrrio h afgoririmr, afier the name of its chief designerlfi];
what we now understand as register ramming was also an implicit part ofthe original algo rithm.
Recall that, for register renaming, we need a set of program invisible registers to which programmable
registers an: re-mapped. TorrrasuIo‘.s algorithm requires these program invisible registers to be provided with
reservation stations of iimctional units.
Let us assume that the iirnetional units are intemally pipelinecL and can complete onc operation in every
clock cycle. Therefore each filnctional rmit can initiate one operation in every clock cyclc—provid-od of
oourse that a reservation station of the unit is ready with the required input operand value or values. Note that
the exact depth ofthis firnctional unit pipeline docs not concern us for the present.

2°] See .-in e'f_fic'ic‘nt uigrrriihnrjirr crpfri-iring nsrrftipfe an'i‘hnrei‘r'c r.rnr'i’.s. by R. M. Tortmsulo. IBM Journal of Researcli dc
Development I I: I, January I961.-‘i preliminary discussion on Tomnsul-o’s algorithm was included in Cliapter 6.
n.~.1m- lxvelibrollelism ._. H,
Figure 12.12 shows sueh a timetional unit connected to the common data bus, with three reservation
stations provided on it.

funetlonal unlt

opncl-1 It qsnd-2 I2
reservation
statlms

tdfrom oommm data

bus [CDB]

Fig. 12.1! Reservation seatlons pmnlded with a functional unit

The various ﬁelds making up a typical reservation station are as follows:

op operation to be carried out by the functional unit
qrind-I &
nprm‘-2 two operand values needed for the operation
If 81 12 two source tags associated with the operands

‘W'l1en the needed operand value or values are available in a reservation station, the fimctional unit can
initiate the required operation in the next clock cycle.
At the time of instruetioli issue, the reservation station is filled out with the operation. eode {op}. If an
operand value is available, for example in a programmable register, it is transferred to the corresponding
source operand field in the reservation station.
However, ifthe operand value is @ available at the time of issue, the corresponding source tag -[I1 and-‘or
£2) is eopi-ed into the reservation station. The source tag identifies the source of the required operand. As soon
as the required operand value is available at its soun:e—which would be typically the output ofa fimetional
unit—the data value is forwarded over the -common data bus, along with the source tag. This value is -copied
into all the reservation station operand slots which have the matching tag.
Thus operand forwarding is achieved here with the use oi" tags. All the destinations which require a data
value receive it in the sameeloelt cycle overthe common data bu s, by matching their stored operand tags with
the source tag sent out over the bus.

I»)
g Example 12.1 Tomasulo's algorithm and RAW dependence
Assume that instruction l1 is to write its rcsult into R4, and that two subsequent instructions 12 and I3 are
to read—i.e. make use of—that result value. Thus instructions I2 and 13 are truly data dependent (R.-‘WI.-’
dependent) on instruction ll. See Fig. 12.13.
$1 I i _ Advanced Cmnputerﬁrchiteeture

A.ssume that the value in R4 is not available when 12 and H

I3 are issued; the reason eould be, for example, that one ofthe
operands needed for ll is itself not available. Thus we assume RFWIFHI R-l\WlR4l
that ll has not even started executing when I2 and I3 are issued.
When [2 and I3 are salad, they are parked in the reservation '2 '3
stations ofthe appropriate fi.|nctionaI units. Since the required
result value from I1 is not available, these reservation station Hi-11-13 EIEMPIE Bf RAW dEP¢fl'¢|¢fl005
entries of I2 and I3 get source tag corresponding to the output
of I 1 —i.e. output ofthe functional unit which is performing the opcration of ll .
When the result ofll becomes available at its firnctional unit, it is scnt over the common data bus along
with the tag value ofits souree—i.e. output of limctional unit.
At this point, programmable register R4 as well as the reservation stations assigned to I2 and I3 have the
matching source tag—since they are waiting for the result value, which is being computed by l1.
When the tag sent over the common data bus matches the tag in any destination, the data value on the bus
is copied from thc bus into the destination. The copy occurs at the same time into all the destinations which
require that data value. Thus R4 as well as the two reservation stations holding I2 and I3 receive the required
data value, which has been computed by ll , at the same time over the common data bits.
Thus, through the use of source tags and the common data bus, in one clock cycle, three destination
registers receive the value produced by Il—programmable register R4, and the operand registers in the
reservation stations assigned to 12 and I3
Let us assume that, at this point, the second operands of I2 and [3 are already available within their
corresponding reservation stations. Then the operations corresponding to I2 and I3 can begin in parallel
as soon as the result of ll becomes avaiIabIe—since we have assumed here that I2 and I3 execute on two
separate functional units.

It may be noted from Example 12.7 that, in eﬁiect, programmable registers become rmamed to operand
registers within reservation stations, which are program invisible. As we have seen in the previous section,
such renaming also resolves anti-dependences and output dependences, since the target register of the
dependent instruction is renamed in these cases to a different program invisible register.

I/I
£1 Example 12.8 Combination of RAW andWAR dependence
Let us now consider a combination of RAW and WAR dependences.
A.ssume that instruction 11 is to write its result into R4, a subsequent instructions E is to read that result
value, and a latter subsequent instruction I3 is then to v.-rite its result into R4. Thus instruction I2 is truly
data dependent {RAW dependent] on instnrction I I, but I3 is anti-dependent (WAR dependent} on 12. See
Fig. 12.14.
Irrstmetm t....a...tt.,. '1 .I| \ _
._, m
:1
RAW(R4{|

:2
nswtso

Fig. 1 2.14 Exarnpie of RAW EWAR dependences

As in the previous example, and keeping in mind similar possibilities, let us assume onee again that the
output of ll is not available when I2 and I3 are issued; thus R4 has the source tag value corresponding to the
output of I 1.
When I2 is issued, it is parlted in thc reservation station ofthe appropriate functional unit. Since thc
required rcsult value irom ll is not available, the reservation station entry of I2 also gets the source tag
corresponding to the output of |l—i.e. the same source tag value which has been assigned to register R4,
since they are both awaiting the same result.
The question new is: Can I3 be issued even before II completes and I2 starts execution?
The answer is that, with register renaming—carricd out here using source tags—I3 Q be issued even
before I2 starts execution.
Recall that instruction I2 is RAW dependent on ll , and therefore it has the correct source tag forthc output
of ll. I2 will receive its required input operand as soon as that is available, when that value would also be
oopicd into R4 overthe common data bus. This is exactly what we observed in the previous example.
B1.|t suppose I3 is issued even betbre the output ofll is available. Now R4 should receive the output oi
I3 rather than the output of I1. This is simply because, in register R4, the output ofll is programmed to be
overwritten by the output ofl3.
Thus, when I3 is issued, R4 will receive the source tag value corresponding to the output of I3—i.e. the
fi.|nctional unit which performs the opcration ot'l3. Its previous source tag value eorresponding to the output
of 11 will be overwritten.
When the output of I1 {finally} becomes available, it goes to the input of I2, but log to register R4, since
this register's source tag now refers to I3. When the output ofl3 becomes available, it goes correctly to R4
because ofthe matching source tag.
For simplicity of discussion, we have not tracked here the outputs of I2 and I3. But the student can
verify easily that the two data transfers described above are consistent with the specified sequence of three
instructions and the specified dependences.

g : Example 12.9 Scheduling across multiple iterations

Consider now the original iterative program loop discussed i11 Example 12.4.
Par I J11!!!‘ l'mrJI||r_.u|i¢\
GI4 i Admrrced Compirterﬁrehiteeture

Let us assume that,without any unrolling by the compiler, this loop executes on a proeessorwh ich provides
branch prediction and implements Tomasulo‘s algorithm. if instn.|ctions from successive loop iterations are
available in the processor at one time—b-ecause of suooessful branch pt-ediction{s)—and if floating point
unitr~'. are available, then instructions from successive iterations can esecute at onetime, in parallel.
But if instructions from multiple iterations are thus executing in parallel within the proccssor—at one
time—then the net efi'-ect ofthesc hardware techniques in the processor is the same as that ofan unrolled loop.
ln other words, the processor hardware achieves on iLl‘I€fl_P what otherwise would require unrolling assistance
from the compiler!
Even the dependence shown in Example 12.5 across successive loop iterations is handled in a natural
way by branch prediction and Tomasulo's algorithm. Basically this dependence across loop iterations
becomes RAW dependence between instructions, and is handled in a natural way by source tags and operand
forwarding.
This example brings ourclearly how a particular method ofexploiting parallelism—Ioop rmrofling, in this
ease—can be implemented eitherby the compiler or, equivalently, by clever hardware techniques employed
within the processor.

Example 12.9 illustratesthe combined power ofsophist icated hardware techniques for dynamic scheduling
and branch prediction. With such efﬁeient techniques becoming possible in hardware, the importance of
oompilcr-detected parallelism (Section 12. 5] diminishes somewhat in comparison.

lrl
éllj Example 12.10 Calculation of processor clock cycles
bet u s eon sider the number ofc lock cycles it takes to execute the following sequenoe ofmach ine instruction s.
We shall count clock cycles starting from the last cloeli: cycleoi'ins-truetion l , so that the answer is independent
of the depth oi'ir1struction pipeline.

1 LEAD ml.‘-:m—a , R11]

"' 1-‘SUE R1‘ , R1] , R1]
_ STU-E rr1orr|—a , R1]
FAD-D R-1] , R3 , R‘-'
_;1J!=_,ul~ STCIRE mc:m—h , R"-‘

We shall assume that (rt) one instruction is issued pecr clock cycle, (b) ﬂoating point operations take two
clock cyc lcs each to execute, and [cl memory operations take one clock cycle each when there is L1 cache hit.
lfwe add the number of clock cycles noeded lbr each instruction, we get the total as l+2+]+2+l = 7.
However, if no operand forwarding is provided, t11e RAW dependences on registers R-'1 and R7 will cost three
additional clock cycles {recall Fig. 12.7], fora total of 10 clock cycles ibr the given sequence of instnictions.
With operand lb rwarding—wh ich is built into Tornasulo‘s algo rithm—one clock cycle is saved on account
ofeach RAW dependence—i.e. between (ii instrticﬁons 1 and Z, { ii] instructions 2 and 3, and {iii instructions
4 and 5.
Thus the total number ofclock cycles required, counting from the last clock cycle of instruction 1, is 7.
With the assumptions as made here, there is no further scope to schedule these instructions in parallel.
n.~.1m- Levellibroilelism '1 .I| \ -
-—. M
In To masulo‘s algorithm, use ofthe common data bus and operand forwarding based on source tags results
in rferenrrrifized comm! of the multiple instructions in execution. In the 1960s and 1970s, Control Data
Corporation developed supercomputers CDC 6600 and CDC "F600 with a r.-en!rah':a'! technique to exploit
insmiction level parallelism.
In these supercomputers, the processor had a centralized st-orebmrd which maintained the status oi
functional units and executing instructions (see Chapter 6). Based on this status, processor control logic
governed the issue and execution of instructions. One part ofthe scoreboard maintained the status of every
in struction under exec ution, while another part maintained the status ofevery functional unit. The scoreboard
itselfwas updated at every clock cycle of the processor, as execution progressed.

BRANCH PREDICTION
3 _ The importance of branch prediction for multiple issue prooessor performance has already
been discussed in Section 12.3. About 15% to Ill‘?/ii of instructions in a typical program are
branch and jump instructions, including procedure returns. Therefore—ifhardware resources are to be f|.|lly
utilized in a superscalar processor—the processor must start working on instructions beyond a branch,
even before the branch instr1.|ction itself has completed. This is only possible through some fotrm ofbranch
prediction.
What can be the logical basis for branch preclietion? To understand this, we consider ﬁrst the reasoning
which is involved ifone wishes to predict the result ofa tossed coin.

Note 12.3 Predicting the outicorrlc ofa tossed coin

Can one predict the rcsult of a single eoin toss‘?
lfwe have prior lcnowledge—gained somehow—that the eoin is unbiased, then the answer is aclear
NO, in the sense that both possible outcomes hear! and rnii are equally probable. The only possible
prediction one can make in this case is that the coin will come Lip either header mi'!—i.e. a prediction
which is of no practical value!
But how can we come to have Ii knowledge that aooin is unbiased? Logieally, the only knowledge
we can have about a eoin is obtained through observations ofo utcomes in successive tosses. Therefore,
the more realistic situation we must address is that we have no prior knowledge about the coin being
either unbiased or biased. Having received a coin, any inference we make about it—i.e. whether it is
biased ornot—can only be on the basis of observations of outcomes of successive tosses ofthe coin.
In such a situation ofno prior knowledge, assume that a eoin comes up heed in its ﬁrst TWO 1055135.
Then simple conditional probability theory predicts that the third toss ofthe coin has a higher probability
ofcoming up hondthan ofcoming up mil‘.
This is a straightforward example of Bayesian reasoning using conditional probabilities, named
after Rev. Thomas Bayes [l ?U2—l'l'6l]. French Mathematician Laplace [l'l'4'9—1B2'l'] later addressed
this and related questions and derived a formula to calculate the respective conditional probabilitiesm.

ill For a detailed discussion, with applications, the reader may retier to the hook .=lm'.f:'c:'rn’ Ini’¢'H:'_|;enee.' A Modem Ap-
pmat-ii, by Russell and Norvig, Pearson Education.
sis li .
xmmm Com puwrﬁnmﬂcdrm

Like tossed coins, outcomes of conditional branches in computer programs also have __i-'es and no
answen-:— i.e. a branch is either taken or not taken. But outcomes of conditional branches are in
fact biased—because there is strong correlation between fa) successive branches taken at the same
eonditional branch instruction in a program, and 1'b) branches taken at two different conditional branch
instructions in the same program.
This is how programs behave, i.e. such correlation is an essential property ofreal-life program s. And
such correlation provides the logical basis for branch prediction. The issue for processor designers is
how to discover and utilize this correlation on t!ieﬂ_v—w'ithout incurring prohibitive overhead in the
process.

Abasic branch prediction technique uses a so-called In-'0-hit pncdiemr. A two-bit counter is maintained for
every eonditional branch instn.|ction in the program. The two-bit counter has four possible states; these four
states and the possible transitions between these states are shown in Fig. 12.15.

Dtyos t

1’ \'\- _\

,. 1" is t
I \\
I‘, \

ﬂyes

s. ,1‘,
\ I
'~ r
s 1
'\ tr
~.\ ‘X
-Ir‘ ,1

|n 513195 0 2. 1: Branch inknn Solid lino: Correct oieoication.

In 5131.91; 2 3. 3 Brannn notation Broken lino: incorrect prediction

Fig.12.15 Stare n~ansl'rien dog:-am ofl-bit branch predictor‘-'1

Wlien the counter state is O or 1, the respective branch is predicted as token; when the cotmter state is 2
or 3, the branch is predicted as not token. When the conditional branch instn.|etion is executed and the actual
branch outcome is known, the state of the respective two-bit counter is changed as shown in the ﬁgure using
solid a.nd broken line arrows.
‘W'l1en two successive predict ions come out wrong, the prediction is changed from fJ.|"ﬂYi£'.fi taken to brnrieh
nor token, and vice versri. ln Fig. 12. 14, state transitions made on mis-predictions are shown using broken line
arrows, while solid line arrows show state transitions made on predictions which come out right.

i-fl Note that Fig. I2. I4 is a slightly redrawn version ofthe state transition diagram shown earlier in Fig. 6. I9 (bl.
nsm.- stresses f'r .I| \ _
._.. H,
This scheme uses a two -bit counter for every conditional branch, and there are many conditional branches
in the program. CIvcralL therefore, this branch prediction logic needs a few kilobytes or more offast memory.
One possible organization for tl1is branch prediction memory is in the form of an array which is indexed
by low order bits of the instruction address. If twelve low order hits are used to define the array index, for
csample, then the numberof entries in the array is 4096M.
To be effective, branch prediction should be carried out as early as possible in the instruction pipeline.
As soon as a eonditional branch instruction is decoded, branch prediction logic should predict whether the
branch is taken. Acoordingly, the next instruction address should be taken eitheras the branch target address
(i.e. branch is taken], or the sequentially next address in the program (i.e. branch is not taken).
Can branch prediction be carried out even betbre the instruction is deeoded——i.e. at the instruction fetch
stage? YB5, ifa so-ealled brrmr.-it target bnfer is provided which hasa history ofrecently executed eonditional
branches. The branch target buffer is organized as an associative memory accessed by the instruction address;
this memory provides quick access to the prediction and the target instruction address needed.
In some programs, whether a conditional branch is taken or not taken correlates betterwith otherconditional
branches in the program—ra1l1er than with the earlier history of outcomes of the same eonditional branch.
Accordingly, r-ormlnterl pnafir-mrs can be designed, which generate a branch prediction based on whether
other conditional branches in the program were taken or not taken.
Branch prediction based on the earlier history of the same branch is known as fOC‘flf prerfietrion, while
prediction based on the history ofother branches in the program is known as glohrrlrtrttfiefionlt tournament
prerfietor uses [i] a global predictor, (ii) a local predictor, and {iiij a s'el'eetnr which selects one of the two
predictors for prediction at a given branch instr|.|ction. The selector uses a two-bit counter per conditional
branch—as in Fig. 12. l 4—to choose between the global and local predictors for the branch. Two successive
mis-predictions cause a switch from the local predictor to the global predictor, and t-fee wrsa; the aim is to
infer which predictorwo rks better for the particular branch.
The common element in all these cases is that branch prediction relies on the correlation detected between
branches taken or not taken in the running program—end for this, an efficient hardware implementation of
the required prediction logic is required.
Con sidcrations outlined here apply also to j1:n1p;Jrr'rfietrIon,which is applicable in indiroctjumps, computed
go to statements (used in FCIRTRAN], and switch statements (used in C and C++). Procedure returns can
also benefit from a form ofjump prediction. The reason is that the address associated with a procedure return
is obtained from the rtmtime procedure stack in main memory; therefore a correct prediction of the return
address can save memory access and a few proccssorclock cycles.
lt is also possible to design the branch prediction logic to utilize information gleaned froma prior exeerrfion
profile or execution trace of the program. lfthe same program is going to run on dedicatod hardware for
years—.say for an application such as weather forecasting—then such special effort put into speeding up the
program on that dodicated hardware can pay very good dividend over the life of the application. Suppose
the execution trace informs us that a particular branch is taken 95% of the time, for example. Then it is a
good idea to ‘prcdict‘ the particular branch as always taken—ir| this case, we are assured that 95% of the
predictions made will be correct!

ial Clearly, iftwo eonditional branch instr|1etion.s happen to have the same low order bits, then their predictions will be-
come ‘interrningled’. But the probability of two ore more sueh inst ruetions being in execution at the same time would
be quite lcsv.
FM Mtﬁruw H'l"I'ncl'q||;1:lI¢-\'
‘I B - ' rldmnced Computerﬁrehitecture

A.s we discussed ir| Section 12.3, under any branch prediction scheme, a mis-predicted branch means that
subsequent instructions must be ﬂushed from the pipeline. It should of course be noted here that thc actual
result ofaconditional branch irtstruction—as against its E_ﬂcte_d result—isonly known when the instruction
completes execution.

Speculative Execution lnstntctions executed on the basis ofa predicted branch, betbre the actual branch
result is known, are said to involve .speen!otr'\-'e execution.
lfa branch prediction tums out to be correct, the corresponding speculatively executed instnletions must
be committed. lfthe prediction tums out to be wrong, the effects of corresponding speculative operations
carried out within the processor must be cleaned up, and instructions from another branch of the program
must instead be cxecutod.
As we have seen in Example 12.2, the strategy results in net performance gain if branch predictions are
made with suflieicntly high accuracy. The performance beneﬁt of branch prediction can only be gained if
prediction is followed by speculative execution.
A conventional processor fetches ccne instr|.|ction alter another—i.e. it does not look ahead into the
forthcoming instr|.|ction stream more than one instruction at a time. To support a deeper and wider—i.e.
multiple issue—instructi-on pipeline, it is necessary for branch prediction and dynamic scheduling logic to
look further out into the lbrthcoming instruction stream. ln other words, more ofthe likely future instructions
need to be examined i11 support of multiple issue scheduling and branch prediction.
Instruction wirtdott-'—or simply it-'rno'ow—is the special memory provided upstream of the lbtch unit in the
processor to thus look ahead into the forthcoming instntction stream. Forthe targeted processor performance,
the prooessor designers must integrate and balance the hardware techniques of branch prediction, dynamic
scheduling, speculative execution, internal data paths, functional unitrc and an instruction window of
appropriate size.

LIMITATIONS m EXFLOITING msrnucrtou

LEVEL PARALLELISM
l"nere is no sueh thing rtsjiee l'unen."—an American proverb.
Technology is about trade-oﬁ's—and therefore it will come as no surprise to the st|.|dent to learn that there
are practical limits on the amount ofinstruction level parallelism which can be exploited in a single executing
instntction stream. In this section we shall try to identity in broad terms someofthe main limiting factorslml.
Consider as an example a multiple issue processor which targets four instnrction issues per clock cycle,
and has eight stages in tl1e instntction pipeline. Clearly, in this processor at onetime as many as thirty two
instructions may be in different stages of lirtch, decode, issue, execute, write result, commit, and so on—and
each stage in the processor must handle tbur instructions in every clock cycle.
Assuming that 15% of the executing instructions are branches and jumps, the pmcessor would handle
at one time four to ﬁve such instructions—i.c. multiple predicted branches would be executing at one time.

1"" The interested student rnay read L:'mr't.-r o,f'l'rr.'rtrrr-t'tr'rnt-1'-r.' tel PrtmHelr'.s"nr, by ow. Wall, Research Report ens, starts rtt
Research l_.-rboratory, Digital Equipment Corporation, l\love1nber I993. Note I24 below Ls a brie-f.s|.untna1'y ofthis
technical report.
s,.~.m..- levelfln-ulielism '1 .I| \ _
._. H,
Similarly, multiple loads and stores would be in progress at one time. Also, dynamic scheduling would
require a iairly large instnrction window, to maintain the issue rate at the targeted four instructions per clock
cycle.
Consider the instn.|ction window. In structions in thcwindow must be checked for dependences, to support
out of order issue. This requires associative memory and its control logic, which means an overhead in
chip area and power consumption; such overhead would increase with window size. Similarly, any form
of checking amongst executing instructions—e.g. checking addresses of main memory references, for alias
analys is—would involve overhead which increases with issue multiplicity Ir. In turn, such increased overhead
in aggressive pursuit of instruction level parallelism would adversely impact processor clock speed which is
achievable, for a given VLSI technology.
Also, with greater issue multiplicity Ir, there would be higher probability of less than it instructions being
issued in some clock cycles. The reason for this can be simply that a filnctional unit is not available, -or
that tr1.|e RAW dependences amongst instnlctions hold up instruction issue. This would result in missing
the target performance of the processor in acl1|al applications, in temrs of issue multiplicity k. Let us say
processor A has It = ti but it is only 60°»-it utilized on average in actual application s; processor B, with the same
instnrcfion set but with I: = -4, might have faster clock rate and also highcr average utilization, thus giving
better performance than A on actual application s.
The increased overhead also necessitates a largernumbcrofstagcs in the instruction pipeline, so as to limit
the total delay per stage and thereby achieve faster clock cycle; but a longer pipeline results in highercost oi
flushing the pipeline. Thus the aggregate p-erforrnance impact of increased overhead finally places limits on
what is achievable in practice with aggressively superscalar, VLIW and EPIC fl.I1;‘l'llIL‘Cl'LlIC.lH'l
Basically, the increased overhead required within the processor implies that:
(ij To support highcr multiplicity ofinstruction issue, the amount ofoontrol logic required in theprocessor
increases disproportionately, and
{ii} For higher throughput, the processor must also operate at a high clock rate.

But these two design goals are often at odds, ibrtechnical reasons ofcircuit design, and also because there
are practical limits on the amount of power the chip earl dissipate.
Power consumption ofa chip is roughly proportional to N><_,r", where N isthe number ofdevices on the chip,
andfis the clock rate. The number of devices on the chip is largely determined by the fabrication technology
being used, and power consumption must be held within the limits ofthe heat dissipation possible.
Therefore the question for processor designers is: For a targeted processor performance, how best to
select and utilize the various chip resources available, within the broad desigrr eorrsrrnfnrs ofthe given circuit
technology?
The student may recall that this was the introductory theme ofthis chapter{'S-ection 12.1 1, and should note
that such design trade-offs are shaping processor design today. To achieve their goals, processor designers
make use ofextensive soﬁware simulations of the processor, using various benchmark programs within the
target range ofapplication s. The designers‘ own experience and insights supplement the simulation results in
the process ofgenerating solutions to the actual problems ofprocessor design.

:1“ In this C1Dl"ll'l£!CllZOl'l, see aLso the discussion in the latter part 0|‘ Section l2.5.
War MIGIIILH Hf" I_1M‘KI|l[1rI|f\
Gill i ' Adtwrced Cemputerﬁirchitecture

Emergence of hardware support for multi-threading and of multi-core chips, which we shall discuss in
the next section, is due in part to the practical limits which have been encountered in exploiting the irnplicrt
parallelism within a single instruction stream.

Note 12.4 Wall’: Study on Instruction Level Parnllelimr

In this section, we have discussed theprimary factor limiting the amount ofinstruction level parallelism
which can be exploited in a sequence of executing instnictions.
The report by D. W. Wall cited above was the result of a landmark empirical investigation into the
amount of instruction level parallelism present in real-lite programs. Any detailed discussion on the
subject would benefit from a study of this report, and accordingly this note presents a brief summary
of the report.
With reference to the overall design space which is available to the process-ordesigncr, Wall's report
says:
Morcoter, this space is mriiri-dimcnsiontri, bccrnise ptrrtrfleilism om1i_vsis consists ofan ct-'er-growing
boofi-' of conipicrrierirtrrj-' rochn iques. The ]JtT_'|-‘rI_I):,lr of one choice depcnd.s srmngfi-' on its context in the
other choices rnnrte The purpose ofthis sruriv is to etpfore that rrniiri-oimensioritr! space, ono’pmi-‘iris
some irisigiir oborir the imporrrmcc ofriiffenrnr recim iqrics in rfifjfisrcnr corirerts.
The report was based on analysis ofthe instruction level parallelism present in 13 real-life programs;
the number of combinations of processor features tried out during the analysis—i.e. "points‘ in the
processor design spaee—was more than 350.
Ofthe eighteen programs analyzed, twelve were from the standard SPEC92 benchmark suite, three
were common utility programs, and three were engineering applications. The programs were compiled
into machine language for the MIPS R3000 processor. Table 12.1 presents some information about
these programs.
The technique employed in analyzing instruction level parallelism in these programs is lcnou-n as
oroci'e-driven rr'or-c-lJo.sco'.simrii'orii'Jn. For this, a complete trace ofthe program instructions executed
is obtained from a previous run; the trace includes data addresses, results ofbranches and jumps, and
so on. A scheduling algorithm is then used to pack these instructions, as tightly as possible, into a
sequence ofprocessor cycles.
As mentioned above, the simulation of processor performance was carried out at more than 350
different points in the space of possible processor configurations. These points differed from each
other in the type of parallelism detection techniques used. At each such processor configuration point,
an amt-it-1'11 built into the simulator provided the scheduling decision, one by onc, for each executed
instruction. In other words, for a given processor configuration, the simulator had fiinctionality built
into it to determine the earliest possflile schedule for each executed instruction ofthe program.
For each program, the final schedule of instructions generated by the simulator showed how the
program would execute on a processor having the given configuration, as defined by the techniques
used to exploit instruction level parallelism. The degree of parallelism obtained was then calculated
from the simulation result.

lll] lntite world ot'a11eient(}1~eeoe,ai1oraeit-was a power which could predict future events; one well-kiiown anti presum-
ably reliable oracle was at the temple of Delphi.
am Lcuclibrollclism .
-—.. ill
Forexample, suppose one ofthe programs listed in Table l2.l executed thirty million instructions,
as seen from its execution trace. Suppose further that, for a given processor conﬁguration, these
instructions could be paclred into six million processor cycles. Then the average degree of parallelism
obtained for this program, for this particular processor eonfigtrration, would be 3W6 = S.

Techniques Explored The range of techniques which was explored in tl'|e study to detect and exploit
instruction level parallelism is summarized briefly below:
Register renrrnring—witl'| (a) infinite number of registers as renaming targets, (b) finite number of
registers, and -[c) no register renaming.
Aiirrs rmrr{vsir—\sith {aj perfect alias analysis, {bl two intermediate levels of alias analysis, and
(cj no alias analysis.
Branch preo'ie1ion—with fa) perfect branch prediction, (b) three hardware-based branch prediction
schemes, (c) three profile-based branch prediction schemes, and (d) no branch prediction. Hardware
predictors used a combination of local and global tables, with difi'ercnt total table sizes. Some branch
fanout limits were also applied.
lnoireer junip prerfietion—with (a) perfect prediction, (bl intermediate level of prediction, and
{cl no indirectjump prediction.
ll5"indot1-'.s'i:e~——for some of the processor models, difierent window sizes from an upper limit of 2043
instructions down to 4 instn.|c.tions were used.
C_FC'f£' width, i.e. the maximum number ofinstructions which can be issued in one cycle—-[:a) 64,
(b) 128, and -[c] bounded only by window size. Note that, from a practical point ofvicw, cycle widths
ofboth 6-'1 and 128 are on the high side.
Loteneies ofprocessor operorions—five different latency models were used, specifying latencies [in
number ofclock cycles) for various processor operations.
Loop rrnroHing—was carried out in some ofthe programs.
Mispreoietion penoitt-'—values of ll to ltl clock cycles were used.

Conclusions Reached ‘W111 13 programs and more than 350 processor conﬁgurations, it should come
as no surprise to the student that Wall's research generated copious results. These results are presented
systematically in the full report, which is available on the web. For our purposes, we summarize below
the main conclusions ofthe report.
For the overall degree of parallelism found, the report says:
Using nontrit-'i.nI but enrnsnrfft-‘ lrno wn teenn iques, ‘Ht’ eonsisrenti_1-' got porolfefism between 4 and I O
for most of the progronrs in our rest suite. i-iretorirrrbfe or nerrrfy ve-:'to.rr':.niJIe pmgnnnrs went omen
higher.
Branch prediction and speculative execution is identiﬁed as the major contributor in the exploitation
of instruction level parallelism:
.S'peeu)'otive oreerrtioln driver: by good branch prediction is critical‘ to one exploitation ofmore than
rnoo'e.st amounts of instruction-fevei paroi'l'el'ism. {I ne stnrt with the Perfect model‘ and remove brrrnen
preoietion, the nreoiion ptnrrrHei'is'm ,rJl'1.1rnrnets_,r"ronr 3|'1 ti no 2.3
Perfect mode! in the above excerpt reicrs to a processor which performs perfect branch prediction,
jump prediction, register renaming, and alias analysis. The student will appreciate readily that this is an
ideal which is impossible to achieve in practice.
.
622 '3 - ' rid~\orrcedCornpr11er.5|.rcl|ritectu|c

OveralL Wall's study reports good results for the degree of parallelism; but the report also goes on
to say that the results are based on ‘rather optinristie ossnnrptions’. lli the actual study, this meant: (i)
as many copies of functional units as needed, {ii} a perfect memory system with no cache misses, -[iii]
no penalty for missed predictions, and (iv) no speed penalty ofthe overhead for aggressive pursuit of
instruction level parallelism.
Cleariy no real hardware prooessor can satisfy sueh ideal assumptions. Alter listing some more
factors which lead to optimistic results. the report concludes:
Any one oftnetre cons iderrrtions conl'o' reo'uce the expected_rJqt-'ojf,'fofan in.s'trnetio.n -prlrrtliel rnaen ine
he rt tit inn’: together the}-' could eiim inote it mnuileteft-1
The broad conclusion of Wall's research study thereibre certainly seems to support the proverb
quoted at the start ofthis section.
[n Chapter 13, we shall review recent advances in technology which have had a major impact on
processor design, and we shall also look at some specific commercial products introduced in recent
years. We shall sec that the basic techniques and trade-ofls discussed in this chapter are reflected, in one
fonrr or another, in the processors and systems-on-a-chip introduced in recent years.

Tobie 12.1 Pmgr-elm: lneltrded ln ‘Noll? study

Mrrme of'pro-gmm Function performed N-r'r. o,l"irr.rtrrre'tirr-n.-r arec'rrter.i'

flrniHion.-rJ

sed Stream editor L416-

egnep File search I312
}1'£IlDC Compiler-ooinpiler 30.29
1'l'iCll'0'l'l0'I'l'tC ‘liming vcriﬁef 'i'l.2'i
Grr PCB router I4-'l.44
Eco Recursive tree comparison 2139
Fintl pass GNU C compiler 2Q.'.i'5
Espresso Boolean function mhiirntaer I 34.43
Li Lisp interpreter 26314

Farm» *r;_aaaa asst-at saaaaa i4Lr'.‘2'? '

Do-due l-lydroeode simulation 2 B442
Tomeatv Wictortne ti mesh generation 3lIll .62
hydroid simulation so:
Compress Lentpel-Ziv ﬁle compression 88.17
Gra Ray tracing 2lll2
swm2 56 Shit llow enter s lmulat ion 3Dl .4-O
alvinn Neural network training 333.9?
mcll_jsp2 Molecular dynamics model 393.0?
s=,.m- lavclﬂwulletism '1 Jl
-- ,,,
THREAD LEVEL HERALLELISM

1 We have already seen that dependences amongst machine instructions limit the amount ot
instruction level parallelism which is available to be exploited within the processor. The
dependences may be true data dependences {RAW}, control dependences introduced by conditional branch
instructions, or resource dependeneesl '31.
One way to reduce the burden of dependences is to combine—with hardware support within the
processor—in.structions from multiple independent threads ofexecution. Such hardware support for multi-
threading would provide the processorwith a pool of i1'|str|.|ctions, in various stages ofcxecution, which have
a relatively smaller number of dependences amongst them, since the threads are ind-ependent ofone another.
Let us consider once again the processor with instruction pipeline of depth eight, and with targeted
superscalar performance of four instmctiorts complemd in every clock cycle (see Section 12.11]. Now
suppose that these instructions come from four independent threads of execution. Then, on average, the
number of instructions in the processor at any onetime f would be 4 >< B.-"4 = 8.
With t1'|e threads being independent ofone another, there is a smaller total number of data dependences
amongst the instructions in the processor. Further, with control dependences also being separated into four
threads, less aggressive branch prediction is needed.
Another major bcsnefit ofsueh hardware-supported |:nu.lti-threading is that pipeline stalls arcvcry cflbetively
utilized. lf one thread runs into a pipeline stall—for access to main memory, say—then another thread makes
use oftl1e corresponding processor clock cycles, which would otherwise be wasted. Thus hardware support
formulti-threading becomes an important latency hiding toehnique.
To provide support for multi-tl'|readi ng, the processor mu st be designed to switch between threads—eitl'|er
on the occurrence of a pipeline stall, or in a round robin manner. As in the case ofthe operating system
switching between running processes, in this case the hardware context ofa thread w1'thin the processormust
be preserved.
But in this case what exactly is thc meaning of the cantor! ofa rhnmfi
Basically, thread contest includes the fi.|ll set of registers {programmable registers and those used in
register renaming], PC , stack pointer, relevant memory map information, protection bits, interrupt control
bits, etc. For N-way multi-threading support, the processor must store at onetime the thread contests offs‘
executing threads. When the processor switchers, say, from thread A to thread B, control logic ensures that
execution of subsequent instruction{s) occurs with reference to the context ofthread B.
Note that thread contexts need not be saved and later restored. As long as the processor t;;g;_g;g_.5;s within
itself multiple thread contexts, all that is required is that the processor be able to switch betwoen thread
contexts from onc clock cycle to the next.
As we saw in the previous section, there are limits on the amount ofinstruction level parallelism which can
be extracted from a single stream of csecuting instn.|ctiorts—i.e. a single thread. But, with steady advances in
VLSI technology, the aggregate amount of functionality that can be built into a single chip has been growing
steadily.

-H" As dtscussod abort-"e, we assutne that WAR and W:'t‘W dc-p-cruienoes can be ham.tted using some term of register
rcrunnittg.
FM Mtﬁruw H'Illr'nm;|unn1'
514 T ' Admrrcad Cctmputerﬁrchitcctura

Therefore hardware support for multi-threading—as well as the provision ofmultiplc processoreores on a
single chip—can both be seen as natural consequences oftl1c steady advances in VLSI technology. Both these
developments address the needs of important segments of modem computer applications and workloads.
Depending on the speciﬁc strategy adopted for switching betwoen threads, hardware support for multi-
thrcadirrg may be classiﬁed as one of the following:
{ii Coarse-grain rnriiri-threading refers to switching between threads only on the occurrence of a major
pipeline stall—whieh may be caused by, say, access to main memory, with latencies ofthe order ofa
hundred processor clock cycles.
{ii} Fim:-grain inrriri-threading refers to switching betwoen threads on the occurrence of any pipeline
stall, which may be caused by, say, Ll cache miss. But this term would also apply to designs in which
processor clock cycles are regularly being shared amongst exccuting threads, even in the absence ofa
pipeline stall.
[iii] Sinwfrwieoris" nwfri-threading refers to machine instructions from two {or more] threads being
in parallel in each processor clock cycle. This would correspond to a multiple-issue processor where
the multiple instructions issucd in a clock cycle come from an equal number of independent execution
threads.

With increasing power of‘s-'LSl technology, the development ofmulti-core s1rsrcrri.s-aim:-drip -[|SoCsj was
also inevitable, since there are practical limits to the number of threads a single processor core can support.
Eaeh core on the Sun UltraSpa.rc T2, for example, supports eight-way fine-grain mulfi-tl:treadir1g, and the
chip has eight such cores. Multi-core chips promise higher net processing performance per watt of power
consumption.
St-'sreni.s-(Jr:-a~efn}J are examples of fasc irrating design trade-offs and tl'|c technical issues which have been
discussed in this chapter Clf course, we have discussed here only the basic design issues and techniques. For
any actual task ofprocessordesign, it is necessary to make many design choices and trade-ofi's, validate the
design using simulations, and then finally complete the design in detail tn the level of logic circuits.
Over the last couple of decades, enormous advances have taken place in various arois of computer
technology; these advances have had a major impact on processor and system design. ln the next chapter,
we shall discuss in some detail these advances and their impact on processor and system design. We shall
also study in brief several commercial products, as case studies in how actual processors and systems are
designed.

3*i Summary
Processor design—or the choice of a processor from amongst several alterrrarivcs—is the central
element of computer system design. Since system design can only be carried out with speciﬁc target
application loads in mind. it follows that processor design should also be tailored for target application
loads-.To satisfy the overall systom perfo rmanee criteria. various elements of the system must be balanced
in terms of their perFormaocc—i.e. no element of the system should become a performance bottleneck.
lrrstruction Level Parallelism ""*"G"""”m “""""'"" -' i 62

One of the main processor design trade-offs faced in this context is this: Should the processor be
designed to squeeze the maximum possible parallelism from a single thrd.orshould processor hardware
support multiple independent thrds. with less aggressive exploitation of instruction level parallelism
within each thread? In this chapter. we studied the various standard techniques for exploiting instruction
level parallelism. and also discussed some of the related design issues and trade-offs.
Dependences amongst instructions make up the main constraint in the exploitation of instruction
level paralle|ism.Therefore, in its essence.tJ1e problem here can be defined as: executing a given sequence
of machine instructions in the smallest possible number of processor clock cycles. while respecting
the true dependences whid'1 exist amongst the instructions. To study possible solutions. we looked at
two possible prototype processors: one provided with a reorder buffer. and the other with reservation
stations associated with its various functional units.
In theory. compiler-detected instruction level parallelism should simplify greatly the issues to be
addressed by processor hardware.This is because. in theory. the compiler would do the difficult work of
dependence analysis and instruction scheduling. Processor hardware would then be ‘dumb and fast'—it
would simply execute ata high speed the machine instructions which specify parallel operations. However.
many types of runtime events—such as interrupts and cache misses—-cannot be predicted at compile
time. Processor hardware must therefore provide for dynamic scheduling to exploit instruction level
parallelism. limiting the value of what the compiler alone can achieve.
Operand forwarding is a hardware technique to transfer a required operand to multiple destinations
in parallel. in one clock cycle. over the common data bus—thus avoiding sequential transfers over multiple
clock cycl5.To achieve this, it is necessary that processor hardware should dynamically detect and
exploit such potential parallelism in data t.ransfers.The benefits lie in reduced wait time in functional units.
and better utilization of the common data bus.which is an important hardware resource.
A reorder buffer is a simple mechanism to commit instruc1:ions in program order. even if their
corresponding operations complete out of order within the processor. Wiflwin the reorder buffer.
instructions are queued in program order with four typical fields for each insi:ruction—inst|'uct.ion
idemnifier. value computed, program-specified destination of the value computed. and a flag indicating
whether the instruction has completed.This simple technique ensures that program state and processor
state are cor rectly prese rved. but does not rsolve WAR and WAW dependences within the instr uctions.
Register renaming is a clever technique to resolveWAR and WAW dependents within the instruction
strm. This is done by re-mapping source and target programmable registers of executing machine
instructions to a larger set of program-invisible registers; thereby the second instruction of a WAR
or WAW dependence does not write to the same register which is used by the first instruction.The
renaming is done dynamically, without any performance penalty in clock cycles.
Tomasulo's algorithm was developed originally for the IBM 3i59;‘91 processor. which was designed
for intensive scientific and engineering applications. Opel-and forwarding is achieved using source tags.
which are also sent on the common data bus along with t.he operand value. Use of ruervation stations
wild": functional units provides an effective register renaming mechanism whidw resolvesWAR and WAW
dependences.
About 15% to 20% of instructions in a typical machine language program are branch and jump
instructions.Therefore for any pipelined processor—but especially for a superscalar processor—branch
rh- li|lrG-me Hiiiompwm a
fill i .=\di-roirrced Compute: Architecture

prediction and speculative execution are critical to achieving targeted performance. A simple tvvo-bit
counter for every branch instruction can serve as a basis for branch prediction: using a combination of
local and global branch prediction. more elaborate schema can be devised.
Limitations in explo itinggreater degree of instruction level parallelismarise from the increased overhead
in the required control logic.The limitations n1ay apply to achievable clock rates. power consumption.
or the actual processor utilization achieved while running application programs.Thread-level parallelism
allows procﬁsor resources to be shared amongst multiple independent threads executing at one time.
For the target application. processor dsigners must choose the right combination of instruction level
parallelism. thread-level parallelism. and multiple processor cores on a chip.

&Exercises
Problem 12.1 Define in brief the meaning of Problem 12.4 Explain in brief some of the basic
computer architecture; within dwe scope of that design issues and trade-offs faced in processor
meaning.explain in briefdwe role of processor design. design. and the role of‘v'LSl technology selected for
building the processor.
Problem 12.2
(a]- When can we say that a computer system is Problem 12.5 Explain in brief the significance of
bofdnced with respect to its performance? {i} processorstote. (ii) program stote.and {iii} committing
{b} In a particular computer system. the designers an executed instruction.
suspect d'1at the reodfivrite bandwidth of the Problem 12.6
main memory has become dwe performance (a} Explain in brief. with one example each. the
bottleneck. Describe in brief the type of
various types of dependences which must
test program you would need to run on
be considered in the process of exploiting
the system. and the type of measurements instruction level parallelism.
you would need to make. to verify whedwer
(b) Define in brief the problem of exploiting
main memory bandwidth is indeed the
instruction level parallelism in a sirgle
performance botdeneclc You may make
sequence of executing instructions.
additional assumptions about the system if
you can justify the assumptions. Problem 12.7 Widi static instruction scheduling
by dwe compiler. die processor designer does not
Problem 12.3 Recall that. in the example system
need to provide for dynamic scheduling in hardware.
shown in Fig. 121. the bandwidth of the shared
ls this statement true or false? justify your answer
processor-memory bus is a performance bottleneck.
in brief.
Fesume now that this bandwidth is increased by
a factor of six. Discuss in brief the likely effect of Problem 12.1! Describe in brief the structure of
dais increase on system performance. After this the reorder buffer. and the functions which it can
dwange is made. is there a likelihood dwat some odwer and cannot perform in die process of exploiting
subsystem becomes the performance botdeneck! instruction level parallelism.
.............- ........i....i..... P] | '
-- ,2,
Note for Eirercisu P to ‘f5
The following three sequences of machine instructions are to be used for Exercises 9 to 15. Note
that instructions other than LOAD and STORE have three operands each;from left to right they are.
respectively source 1. source 2 and de.stinatlon.'#' sign indicates an immediate operand.
Assume that (a) one instruction is issued per clock cycle. {b} no resource constraints limit instruction
level parallelism. (c) floating point operations take two clock cycles each to execute. and [d) loodfsmre
memory operations take one clock qrcle ach when there is L1 cadwe hit.

Jiocru-on so I : 1 LEAD mnm—a, R1

LEAD mem—b, R2
LEAD mem—c, R3
FADE R2. RL . RL
hi
I_)J
|I.‘:|_,'1 FSU3 as, n_ ,..-1.-
¢"'|'\I"i
7
E. o-ERE mem—a, R1

E-‘ocruw-en so 2: L LEAD mem—a, RL

FADD 22. 2; , R-
-PI“!-'1 -.v
|J..\_l?i'_. mom—a, R1
FADE #1, R3, R2
f\-_‘]
J2:l_,\J
_,"l STERE mom-d. R2
5 LEAD mem—o, R2

5»-.?r:r|J'r.?n so 3 : L LEAD mom-a. RL

2 FADD 2;. F-5.4
-r\I-In
iv
3 Q-ERe mot1—b , R2
Q FADE #1, R3, R2
l“.
_| STERE men-d, R2
E LEAD men-Q, R2

Problem 12.9 Draw dependence graphs of the Problem 12.11 Now assume that register
above sequences of machine instructions. marking renaming is implemented to resolveWAR andWAW
on them the type of data dependences. with the dependences. Determine the number of clock
rspective registers involved. cycles it takes to execute the above sequences of
instructions. counting from the last clock cycle of
Problem 12.10 Assume that the procmsor has
instruction 1.
no provision for register renaming and operand
forwarding. and that all memory references are Problem 12.12 Comment on the scope for
Satisfied from L1 cache. Determine the number of operand forwarding within the sequences of
clook cycls it takes to execute the above sequences instructions.Assume that the loodism-re unit can also
of instructions. counting from the last clock cycle of take part in operand forwarding.
instruction 1.
TM rilnfﬁrmrl-' Hflllfmmro-rm
628 i ndmirrced Computer Architecture

Problem 12.13 Assume that. in addition to ‘With the 2-bit branch predictor, ﬁnd fraction of
register renaming. operand forwarding is also correct branch predictions made ifk =1..k = 2. k= 5
implemented as discussed in Exercise 12. Determine and it = 50
the number of clock cycles it takes to execute the
Problem 12.11! Discuss in brief the difference
above sequences of instructions. counting from the
between ltrcoland global branch prediction strategies.
last clock cycle of instruction 1.
and how a two-bit selector may be used per branch
Problem 12.14 Consider your answers to to select between the two.
Exercises 10.11 and 13 above. Explain in brief how
Problem 12.19
these answers would be affected ifan L1 cache miss
(a) Wall's study on instruction level parallelism is
occurs in instruction 1.which takes ﬁve clock cycles
based on oracle-driven iroce-bosed simulation.
to satisfy from L2 cache.
Explain in brief what is meant by this type of
Problem 12.15 \Nith reference to Exercise simulation.
13. describe in brief how Tomasulo's algorithm (b) Wall's study of instruction level parallelism
would implement register renaming and operand makes certain ‘optimistic’ assumptions about
forwvarding. processor hardware.\Nhat are thse assump-
Problem 12.16 Explain in brief the meaning of tions? Against each of these assumptions.
olios analysis as applied to runtime memoryaddresses. list the corresponding ‘realistic‘ assumption
which we should make. keeping in view the
Problem 12.17 A particular processor makes characteristics of real processors.
use ofa 2-bit predictor for each branch. Based on a
program execution trace. the actual branch behavior Problem 12.20 Discuss in brief the basic trade-off
at a particular conditional branch instruction is in processor design between errqaloiting instruction
found to be as follows: level parallelism in a single executing thrd. and
providing hardware support for multiple threads.
T T T bl T T T bl
Problem 12.21 Describe in brief what is mean
(——i.—!~ (——i.—!- by dwe context ofa threod, and what are the typical
k irmttv k rune.»
operations involved in switching between threads.
HereT stands for branch taken. and N stands for Problem 12.22 Describe in brief the different
branch not taken. ln other words. the actual branch strategies which can be considered for switching
behavior forms a repeating sequence. such that the between threads in a processor which provides
branch is taken k times (T). then not taken once (N). hardvwre support for multi-threading.

Gregory R. Andrews-Foundations of Multithreaded, Parallel, and Distributed Programming-Addison-Wesley (1999)
100% (4)
Gregory R. Andrews-Foundations of Multithreaded, Parallel, and Distributed Programming-Addison-Wesley (1999)
682 pages
STOCK MANAGEMENT Ivestigatory Project Class 12
100% (2)
STOCK MANAGEMENT Ivestigatory Project Class 12
11 pages
Speedup
No ratings yet
Speedup
12 pages
Computer Worksheet
100% (2)
Computer Worksheet
6 pages
PF Lab Manual
No ratings yet
PF Lab Manual
72 pages
Module 5
No ratings yet
Module 5
40 pages
Parallel Computing
No ratings yet
Parallel Computing
28 pages
Multiprogramming in 8086 Microprocessor
No ratings yet
Multiprogramming in 8086 Microprocessor
40 pages
Memory in Multiprocessor System
No ratings yet
Memory in Multiprocessor System
52 pages
Concurrency Primer
No ratings yet
Concurrency Primer
12 pages
Chapter - 5 Parallel Processing
No ratings yet
Chapter - 5 Parallel Processing
117 pages
Java 2marks Questions With Answer
No ratings yet
Java 2marks Questions With Answer
19 pages
Con Currency
No ratings yet
Con Currency
99 pages
CH5 Parallel Processing
No ratings yet
CH5 Parallel Processing
30 pages
Concurrent Programming With Threads: Rajkumar Buyya
No ratings yet
Concurrent Programming With Threads: Rajkumar Buyya
168 pages
Concurrency
No ratings yet
Concurrency
99 pages
15cs72aca Module-5 Aca
No ratings yet
15cs72aca Module-5 Aca
53 pages
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
PDC Lecture 05
No ratings yet
PDC Lecture 05
48 pages
Pdf24 Merged
No ratings yet
Pdf24 Merged
54 pages
Arch13 Multiprocessors Afterlecture
No ratings yet
Arch13 Multiprocessors Afterlecture
70 pages
Mod5 - Aca 1 52
No ratings yet
Mod5 - Aca 1 52
52 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Lecture 4
No ratings yet
Lecture 4
41 pages
Unit 3
No ratings yet
Unit 3
43 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Programming with Shared Memory: Nguyễn Quang Hùng
No ratings yet
Programming with Shared Memory: Nguyễn Quang Hùng
54 pages
Baker CHPT 5 SIMD Good
No ratings yet
Baker CHPT 5 SIMD Good
94 pages
Chapter 6 Parallel and Concurrent Computing
No ratings yet
Chapter 6 Parallel and Concurrent Computing
27 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Advanced Operating System: Unit I
No ratings yet
Advanced Operating System: Unit I
27 pages
Unit6 - Microprocessor - Final 1
No ratings yet
Unit6 - Microprocessor - Final 1
30 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
Module 07 - Multiprocessing
No ratings yet
Module 07 - Multiprocessing
60 pages
Synchronization Mechanisms
No ratings yet
Synchronization Mechanisms
41 pages
POS Short Notes Ch1 & 2
No ratings yet
POS Short Notes Ch1 & 2
9 pages
10 Multithreading
No ratings yet
10 Multithreading
60 pages
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
No ratings yet
Parallel Processors From Client To Cloud: Omputer Rganization and Esign
43 pages
Concurrency: CS2403 Programming Languages
No ratings yet
Concurrency: CS2403 Programming Languages
44 pages
COA Assignment
No ratings yet
COA Assignment
21 pages
04 Process Con
No ratings yet
04 Process Con
26 pages
Multi Threading
No ratings yet
Multi Threading
168 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
COA Chapter 5
No ratings yet
COA Chapter 5
16 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Lecture 1
No ratings yet
Lecture 1
23 pages
CICS 504 Computer Organization
No ratings yet
CICS 504 Computer Organization
35 pages
Basic Operating System Concepts: A Review
No ratings yet
Basic Operating System Concepts: A Review
53 pages
Chapter 10
No ratings yet
Chapter 10
6 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
Lecture
No ratings yet
Lecture
32 pages
Parallel Final Exam
No ratings yet
Parallel Final Exam
9 pages
High Performance Computing
No ratings yet
High Performance Computing
17 pages
Unit 2 Os
No ratings yet
Unit 2 Os
30 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
Devnet Exam Guide
No ratings yet
Devnet Exam Guide
14 pages
Q1. Basic Concepts of Pipelining: - Con - Execution of A Program Consists
No ratings yet
Q1. Basic Concepts of Pipelining: - Con - Execution of A Program Consists
3 pages
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
No ratings yet
What Is Serial Computing?: Traditionally, Software Has Been Written For Serial Computation
22 pages
Dopra Linux OS Security
No ratings yet
Dopra Linux OS Security
63 pages
CSC508 Test 1 - Set 2
0% (1)
CSC508 Test 1 - Set 2
4 pages
Object Oriented Systems Analysis and Design Using Uml 4nbsped 0077125363 9780077125363 - Compress PDF
100% (1)
Object Oriented Systems Analysis and Design Using Uml 4nbsped 0077125363 9780077125363 - Compress PDF
636 pages
Best Books For Programmers (The Ultimate List)
No ratings yet
Best Books For Programmers (The Ultimate List)
13 pages
Object Oriented Programming Lecture-1: CSE 201 Conducted by Sarwar Morshed Email: Mobile: 01947179930
No ratings yet
Object Oriented Programming Lecture-1: CSE 201 Conducted by Sarwar Morshed Email: Mobile: 01947179930
24 pages
Downloading A Webpage As A PDF - Google Search
No ratings yet
Downloading A Webpage As A PDF - Google Search
5 pages
DMS (22319) - Chapter 4 Notes
No ratings yet
DMS (22319) - Chapter 4 Notes
87 pages
PHP - GET & POST Methods
No ratings yet
PHP - GET & POST Methods
3 pages
IT Holiday Homework (Resume)
No ratings yet
IT Holiday Homework (Resume)
1 page
Lecture 2 Datatypes and Variables
No ratings yet
Lecture 2 Datatypes and Variables
21 pages
ECE 250 Algorithms and Data Structures
No ratings yet
ECE 250 Algorithms and Data Structures
15 pages
Technical Vocational Livelihood (TVL) : Information Communication and Technology (Ict)
No ratings yet
Technical Vocational Livelihood (TVL) : Information Communication and Technology (Ict)
10 pages
Call by Value and Call by Reff
No ratings yet
Call by Value and Call by Reff
3 pages
Selenium Cheat Sheet
No ratings yet
Selenium Cheat Sheet
3 pages
Note 1137346 - Oracle Database 10g Patches For Release 10.2.0.4
No ratings yet
Note 1137346 - Oracle Database 10g Patches For Release 10.2.0.4
12 pages
PWP Micro Project
No ratings yet
PWP Micro Project
11 pages
C++ STL For Embedded System PDF
No ratings yet
C++ STL For Embedded System PDF
7 pages
Lab 5-1 Develop Software Nios Timer
No ratings yet
Lab 5-1 Develop Software Nios Timer
9 pages
Unit 6-Part2 - Parallel - Processing
No ratings yet
Unit 6-Part2 - Parallel - Processing
21 pages
Oop 2
No ratings yet
Oop 2
7 pages
Free CV Design - Com Cv10 A4
No ratings yet
Free CV Design - Com Cv10 A4
1 page
PRTG Report 4886 - Disponibilidad de Servicios - Created 2023-06-12 09-02-01 (2023-05-01 00-00 - 2023-05-31 00-00) UTC - Part 07
No ratings yet
PRTG Report 4886 - Disponibilidad de Servicios - Created 2023-06-12 09-02-01 (2023-05-01 00-00 - 2023-05-31 00-00) UTC - Part 07
15 pages
Async Capabilities Iw Grids
No ratings yet
Async Capabilities Iw Grids
6 pages
034 (Dawood)
No ratings yet
034 (Dawood)
11 pages
Multiux Ethical Wall Trial: Installation Procedure
No ratings yet
Multiux Ethical Wall Trial: Installation Procedure
7 pages