@vtucode - in 21CS643 Module 5 2021 Scheme
@vtucode - in 21CS643 Module 5 2021 Scheme
|lnfG-MM-‘ Hllitwopmm
MODULE-5
l 0
P A S-hared variables
"°'°°'5S in a common memory
-mm’:
“W 1
Process D
[Communication ehamoi]
Process E
Protected Access The main problem associated with the use of a CS is avoiding race conditions where
concurrent processes executing in different orders produce different results. The granularity of a CS affects
the performance. If the boundary of a CS is too large, it may limit parallelism due to excessive waiting by
competing processes.
When the CS is too small. it may add unnecessary code complexity or software overhead. The trick is to
shorten a heavy-duty CS or to use conditional CS5 to maintain a balanced perforrnance.
In Chapter ll, we will study shared variables in the fomi of locks for implementing mutual exctusion in
CS9. Binmjt and cotmting semaphores are used to implement CS3 and to avoid system deadlocks. Monitors
are suitable for structured programming.
Shared—variable progranirning requires special atomic operations for IPC, new language constructs for
ettpressing parallelism, compilation support for exploiting parallelism, and OS support forscheduling parallel
events and avoiding resource conflicts. Ofcourse, all ofthese depend on the memory consistency model used.
r....».rr......i.r................... . — 4,.
Shared-memory multiproccssors use shared variables for interprocessor communications. Multiprocessing
takes various forrrrs. depending on the number of users and the granularity of divided computations. Four
operational modes used in programming multiprocessor systems are specified below:
Nlulripmgmmming Traditionally, nwitrpmgrmnming is defined as multiple independent. programs running
on a single processor or on a multiprocessor by time-sharing use of the system resources. A multiprocessor
can be used in solving a single large problem or in running multiple programs across the processors.
A multiprogrammcd multiprocessor allows multiple programs to run concurrently through time-sharing
of all the processors in the system. Multiple programs are interleaved in their CPU and U0 activities. When
a program enters HO mode, the processor switches to another program. Therefore, multiprogramming is not
restricted to a multiprocessor. Even on a single processor, multiprogramming is usually implemented.
Multiprocessing When multiprogramming is inrplemented at the process level on a multiprocessor, it is
callccl mu!rt'pmce.rsing. Two types of multiprocessing are specified below. If interprocessor communications
are handled at the instruction level, the multiprocessor operates in MIMD mode. It‘ interprocessor
communications are handled at the program, subroutine, or procedural level. the machine operates in MPMD
[multiple programs" over multiple data .'rtream.r} mode.
In other words, we define MIMI] multiprocessing with line-grain instruction-level parallelism. MPMD
multiprocessing exploits coarse-grain procedure-level parallelism. In both multiprocessing modes, shared
variables are used to achieve interprocessor commurricatiorr. This is quite different from the operations
implemented on a message-passing system.
Nlultitnrlring A single program can be partitioned into multiple interrelated tasks concunently executed
on a multiprocessor. This has been implemented as multitasking on Cray multiprocessors. Thus multitasking
provides the parallel execution oftwo or more parts of a single program. Ajob etliciently multitasked requires
less execution time. Multitasking is achieved with added codes in the original program in order to provide
proper linkage and synchronization of divided tasks.
Trade-olTs do exist between multitasking and not multitasking. Only when overhead is short should
multitasking be practiced. Sometimes. not all parts ofa program can he divided into parallel tasks. 'I"l:rerefore,
multitasking tradeoffs must he analyzed before implementation. Section l 1.2 will treat this issue.
Mulrirhreoding The traditional LFNIXIUS has a single-threaded kernel in which only one process can
receive OS kemel service at a time. In a multiprocessor as studied in Chapter 9, we want to extend the single
kemel to he multiliireaded. The purpose is to allow multiple threads of lightweight processes to share the
same address space and to he executed by the same or dilTerent processors simultaneously.
The concept of muIn'rhreadr'ng is an extension of the concepts of multitasking and multiprocessing. The
ptuposc is to exploit fine-grain parallelism in modem multiproccssors built with multiple-context processors
or superscalar processors with multiple-instruction issues. Each thread will use a separate program counter.
Resource conflicts are the major problem to be resolved in a multithreaded architecture.
The levels of sophistication in securing data coherence and in preserving event order increase from
rnonoprograrnming to multitasking, to multiprogramming, to multiprocessing, and to multithreading in that
order. Memory management and special protection mechanisms must be developed to ensure correctness and
data integrity in parallel thread operations.
FM Mtfiruw Hlllrbmyrorrns
476 i " Advnrrcod Covrrputrerfirreiritectture
Partitioning ond Replication The goal of parallel processing is to exploit parallelism as much as possible
with the lowest overhead. Pmgmm partitioning is a technique for decomposing a large program and data set
into many small pieces for parallel execution by multiple processors.
Program partitioning involves both programmers and the compiler. Parallelism detection by users is
often explicitly expressed with parallel language constructs. Program restructuring techniques can be
used to transform sequential programs into a parallel fomi more suitable for multiprocessors. Ideally, this
transformation should be carried out automatically by a compiler.
Pmgmm replication refers to duplication of the same program code for parallel execution on multiple
processors over different data sets. Partitioning is often practiced on a shared-memory multiprocessor system,
while replication is more suitable for distributed-memory message-passing multicomputers.
So far, only special program constructs, such as independent loops and independent scalar operations, have
been successfully paralleli;-red. Clustering of independent scalar operations into vector or VLIW instructions
is another approach toward this end.
Scheduling on-cl Synchronization Scheduling of divided program modules on parallel processors is much
more complicated than scheduling of sequential programs on a uniprocessor. Static scheduling is conducted
at post~compile time. Its advantage is low overhead but the shortcoming is a possible mismatch with the run-
time profile of each task and therefore potentially poor resource utilization.
Dynamic scheduling catches the run-time conditions. However, dynamic scheduling requires fast context
switching, preemption, and much more OS support. The advantages of dynamic scheduling include better
resource utilization at the expense of higher scheduling overhead. Static and dynamic methods can be jointly
used in a sophisticated multiprocessor system demanding higher efficiency.
ln a conventional UNIX system, inlerpmcessor communication (IPC) is conducted at the process level.
Processes can be created by any processor. All processes asynchronously accessing the shared data must
be protected so that only one is allowed to access the shared writable data at a time. This mutual exclusion
property is enforced with the use of locks, semaphores, and monitors to be described in Chapter ll.
At the control level, virtual program counters can be assigned to different processes or threads. Counting
semaphores or barrier counters can be used to indicate the completion of parallel branch activities. One can
also use atomic memory operations such as 'li2st&Sei and Fe.tclr&rl dd‘ to achieve synchronization. Software-
implemented synchronization may require longer overhead. Hardware barriers or combining networks can
be used to reduce the synchronization time.
Codie Coherence and Protection Besides maintaining data coherence in a memory hierarchy,
multiproccssors must assume data consistency betweenprivate caches and the shared memory. The multioache
coherence problem demands an invalidation or update after each write operation. These coherence control
operations require special bus or network protocols for implementation as noted in previous chapters. A
memory system is said to be coherent ifthe value retruned on a read instruction is always the value written
by the latest write instruction on the same memory location. The access order to the caches and to the main
memory makes a big difference in computational results.
The shared memory of a multiprocessor can be used in various consistency models as discussed in
Chapters 4 and 9. Sequential consistency demands that all memory accesses be strongly ordered on a global
basis. A processor cannot issue an access until the most recently shared writable memory access has been
s....i.n.1...n..i.,..,...,........i — .,,,
globally perforinecl. A weak consistency model enforces ordering and coherence at explicit synchronization
points only. Programming with the processor consistency or release consistency may be more restricted, but
memory pearfonnance is expected no improve.
As explained in Chapter 9, fine-grain concurrent programming with global naming was aimed at merging
the shared-variable and message-passing mechanisms for heterogeneous processing.
Distributing the Computation: Progmm replication and data distribution are used in multicompubers.
The proeessois in a multicomputer [or a NORMA machine) are loosely coupled in the sense that they do
not share memory. Message passing in a multicomputer is handled at the subprogram level rather than at the
instructional or fine-grain process level as in a tightly coupled multiprocessor. That is why explicit parallelism
is more attractive for multicomputers.
I»)
Cg Example 10.1 A concurrent program for distributed
computing on a multicomputer (justin
Rattner, lntel Scientific Computers, 1990)
The computation involved is the evaluation of iras the area under the ctmreflx} between U and I as shown in
Fig. 10.2. Using a rectangle rule, we write the integral in discrete form:
_| - I 4 ll‘ --
1: = 'o_{{ii¢
- -it = J;H+x: dx + h;_y{x,i.
4»
as :ll'=)7\fB-&UfldEI"1=f{X)=4J[l+l‘2}
3
2.5
Y2
1.5 _
11012-3U123'D123'D123D123
0.5
CI e e e x
0 0.1 c-.2 as 0.4 0.5 as 0.1 us as 1
Fig. 10.2 Domain clacornp-osltlon for concurrent programming on a muidcomputer with four pr\oceesca's
where ix = lfn is the panel width, x,- = Mi‘ — 0.5) are the rnidpoints, and n is the number of panels (rectangles)
to be computed.
Assume a four-node multicomputer with four processors labeled 0, 1, 2, and 3. The rectangle rule
decomposition is shown with n = 20 and it = 1110 = (1.05. Each procefisof node is nssigtled 110 compute the
areas of five rectangular panels. Therefore, the computational load of all four nodes is balanced.
r.....i.n......l..l....,...,........r — 4,,
Host program Node program
inputinl p = numnodeslffi
sendt n,allnodes} me = rnynode(]
recv(Pi) recv{n')
output(Pi) h = 1.Dr'n
sum = U
Do i = me + 1, n, p
a=hxfi—&fl
sum = sum + fix]
End Do
pi = h >< sum
gop["+‘, Pi, host)
Each node exec-ules a separate copy ol‘ the node program. Several system calls are used to achieve message
passing between the host and the nodes. The host program semis the number of panels n as a message to all
the nodes, which Yficeive it accordingly in the node program. The commands mnnnodes and mynode specify
how big the system is and which node it is, respectively.
The software For the iPSC system offers a global summing operation gq|'J(’+', pi, host) which iteratively
pairs nodes that exchange their current partial sums. Each partial sum received from another node is added to
the sum at the receiving node, and the new sum is sent out in the next round of message exchange.
Eventually, all the nodes accumulate the global sum multiplied by the height -[pi = it >< sum) which will
be retumed to the host for printout. Not all pairs ol‘ node communications need to be carried out. Only
log; N rounds of message exchanges are required to compute the adder-tree operations, where N is the
I11-Jmber of nodes in the systcm.This point will be further elaborated in Chapter I3.
Synchronization of data-parallel operations is done at compile time rathcr than at run time. Hardware
synchronization is enforced by the control unit to carry out the locltstep execution ofSl1\-'lD programs. We
address below instruction."data broadcast, masking, and data-routing operations separately. Languages,
compilers, and the conversion of SIMD programs to run on MIMD multicomputers are also discussed.
Dara Parallelism Ever since the introduction of the llliac IV computer, programming SIMD array
processors has been a challenge for computational scientists. The main difiiculty in using the llliac IV had
been to match the problem size with the fixed machine size. In other words, large arrays or matrices had to
be partitioned into 64-element segments before they could be effectively processed by the 64 processing
elements (PEs) in the llliac IV machine.
A latter SIMD computer, the Connection Machine CM-2, offered hit-slice fine-grain data parallelism using
16.384 PEs concurrently in a single-array configuration. This demanded a lower degree of array segmentation
and thus offered highcr flexibility in programming.
Synchronous SIMD programming differs from asynchronous MIMD programming in that all PEs in
an SIMD computer operate in a locltstep fashion, whereas all processors in an MIMD computer execute
difierent instructions asynchronously. As a result, SIMD computers do not have the mutual exclusion or
synchronization problems associated with multiproccssors or multicomputers.
Instead, inter-PE communications are directly controlled by hardware. Besides lo-ckstep in computing
operations among all PEs, inter-PE data communication is also carried out in lockstep. These synchronizaed
instruction executions and data-routing operations make SIMD computers rather efficient in exploring spatial
parallelism in large arrays, grids, or meshes of data.
ln an SIMD program, scalar instructions are directly executed by tl'|e control unit. Vector instructions
are broadcast to all processing elements. Vector operands are loaded into the PEs from local memories
simultaneously using a global address with ditferent offsets in local index registers. Vector stores can he
executed in a similar manner. Constant data can be broadcast to all PEs simultaneously.
Amasking pattern {binary vector} can be set under program control so that PEs can be enabled or disabled
dynamically in any instruction cycle. Masking instructions are directly supported by hardware. Data—routing
vector operations are supported by an inter-PE routing network. which is also under program control on a
dynamic basis.
Army Language Extension: Array extensions in data—para11el languages are represented by high-level
data types. We will specify Fortran 90 array notations in Section 10.2.2. The array syntax enables the removal
of some nested loops in the code and should reflect the architecture of the array processor.
Examples of array processing languages are {JFD For the llliac W, DAP Fortran for the AMT! Distributed
Array Processor, C‘ for the TMCIConnection Machine, and MP? for the MasPar family of massively parallel
computers.
An SIMD programming language should have a global address space, which obviates the need for explicit
data routing between PEs. The array extensions should have the ability to make the number of PEs a function
of the problem size rather than a function of the target machine.
Connection Machine C‘ language satisfied these requirements nicely. A Pascal-based language, .-lems. was
developed by RH. Perrott for problem-oriented SIMD programming. Acme offered hardware transparency,
application flexibility, and explicit control structures in both program structming and data typing operations.
t.....»..n.1..d.i,.t....,...,........t . — .,,,,
Compiler Support To support data-parallel programming, the array language expressions and their
optimizing compilers must be embedded in familiar standands such as Fortran T7, Fortran 90. and CI. The
idea is to unify the program execution model, facilitate precise control of massively parallel hardware, and
enable incremental migration to data-parallel execution.
Compiler-optimized control of SIMD machine hardware allows the programmer to drive the PE array
transparently. The compiler must separate the program into scalar and parallel components and integrate with
the US environment.
The compiler technology must allow array extensions to optimize data placement, minimize data
movement, and virtualize the dimensions of the PE array. The compiler generates data-parallel machine code
to perform operations on arrays.
Array sectioning allows a programmer to reference a section or a region of a multidimensional array.
Array sections are designated by specifying a start index, a bound, and a stride. Vector-valued subscripts arc
often used to construct arrays from arbitrary permutations of another array. These expressions are vectors
that map the desired elements into the target array. They facilitate the implementation of gather and scatter
operations on a vector of indices.
SIMD programs can in theory be recompiled for MIMD architecture. The idea is to develop a source-to-
source precompiler to convert, for example, from Connection Machine C"‘ programs to C programs running
on an nCUBE message-passing multicomputer in SPMD mode.
ln fact, SPMD programs are aspecial class of SIMD programs which emphasize medium-grain parallelism
and synchronization at the subprogram level rather than at the instruction level. ln this sense, the data—parallcl
programming model applies to both synchronous SIMD and loosely coupled MIMD computers. Program
conversion between different machine architectures is needed to broaden software portability. The parallel
programming paradigm based on openMP standard is described in Chapter 13.
The development of concurrent object-on'ented pmgrarnming (CCIOP) provides an alternative model for
concunent computing on multiproccssors or on multicomputers. Various object models differ in the internal
behavior of objects and in how they interact with each other.
An Actor Model COOP must support patterns of reuse and classification, for example, through the use
of inheritance which allows all instances of a particular class to share the same property. An actor model
developed at MIT is presented as one framework for COOP.
Actors are self-contained, interactive, independent components of a computing system that communicate
by asynchronous message passing. In an actor model, message passing is attached with semantics. Basic
actor primitives include:
(ll j Create: Creating an actor from a bchaviordcscription and a sct ofparamctcrs.
{2} Send-Io: Sending a mcssagc to another actor.
-['3] Become: An actor replacing its own bchaviorby a ncw behavior.
State changes are specified by behavior replacement. The replacement mechanism allows one to aggregate
changes and to avoid unnecessary control-flow dependences. Concurrent computations are visualized in
learns of concurrent actor creations, simultaneous communication events, and behavior replacements. Each
message may cause an object (actor) to modify its state, create new objects, and send new messages.
Concurrency control structures represent particular patterns of message passing. The actor primitives
provide a low-level description of concturent systems. High-level constmcts are also needed for mising
the granularity of descriptions and for encapsulating faults. The actor model is particularly suitable for
multicomputer implementations.
Parallelism in COOP Three common pattems of parallelism have been found in the practice of COOP.
First, pipeline concurremry involves the overlapped enumeration of successive solutions and concurrent
testing of the solutions as they emerge from an evaluation pipeline.
Second, divide-and-conquer concurrency involves the concurrent elaboration of difierent subprograms
and the combining of their solutions to produce a solution to the overall problem. In this case, there is
no interaction between the procedures solving the subproblems. These two patterns are illustrated by the
following examples taken from the paper by Agha (1990).
I/)
lg Example 10.2 Concurrencyin object-oriented programming
(GulAgha.,1990}
A prime-number generation pipeline is shown Fig. l(I.3a. Integer numbers are generated and successively
tested for divisib-ility by previously generated primes in a linear pipeline of primes. The circled ntunbers
represent those being generated.
A number enters the pipeline from the left end and is eliminated if it is divisible by the prime number
tested at a pipeline stage. All the numbers being forwarded to the right of a pipeline stage are those indivisible
by all the prime numbers nested on the left of that stage.
»...<..t..n.1..t.t.l....g..e.,.....i — 4,,
Figure It]-.3b shows the multiplication of a list of numbers [[0, 7, -2, 3, 4, ~11, -3] using a divide-
and-oonquer approach. The numbers are re-presented as leaves of a tree. The problem can be recursively
subdivided into subproblems of multiplying two sublists, each of which is concurrently evaluated and the
results multiplied at the upper node.
-5.5440
-421] 1 32
-6 33-
4
TD
-2 3 -11 -3
11] 1"
A third pattern is called cooperative problem solving. A simple example is the -:lynamic path evaluation
(computational objects) of many physical bodies {objects} under the mutual influence of gravitational fields.
In this case, all objects must interact with each other; intemiediate results are stored in objects and shared by
passing messages between them. interested readers may refer to the book on actors by Agha (1986).
Today companies sueh as IBM and Cray produce supercomputers with thousands of processors inter-
connected over high performance networks. At the same time, object-oriented programming and the message-
passing model of inter-process communication have become established as standard paradigms of program
design and development. Consider, for example, IBM's powerful Blue Gene line of supercomputers; the
standard method of communication amongst node processes in these supercomputers is the Message-Passing
Interface {MPI}, customized for the architecture as needed. The Blue Gene line of supercomputers and it-'fPl
will both be discussed in Chapter 13.
is based on logic programming languages such as Concurrent Prolog and Parfog. We reveal opportunities for
parallelism in these two models and discuss their potential in AI applications.
Functional Programming Model A ftmctional programming language emphasizes the functionality oi
a program and should not produce side effects after execution. There is no concept of storage, assignment,
and branching in fitnctional programs. In other words, the history of any computation performed prior to the
evaluation of a functional expression should be irrelevant to the meaning of the expression.
The lack of side effec-ts opens up much more opportunity for parallelism. Precedence restrictions occur
only as a result of function application. The evaluation of a function produces the same value regardless
of the order in which its a.rgu.rnents are evaluated. This implies that all argliments in a dynamically created
structure of a functional program can be evaluated in parallel. All single-assignment and clam-flow languages
are functional in nature. This implies that functional programming models can be easily applied to data-
driven multiprocessors. The functional model emphasizes fine-grain MIMD parallelism and is referentially
transparent.
The majority of parallel computers designed to support the functional model were oriented toward
Lisp, such as Multilisp developed at MIT. Other dataflow computers have been used to execute functional
programs, including SISAL used in the Manchester datafiow machine.
Logic Programming Model Based on predicate logic, logic pmgronuning is suitable for knowledge
processing dealing with large databases. This model adopts an implicit search strategy and supports parallelism
in the logic inference process. A question is answered if the matching facts are found in t|'te database. Two
facts match if their predicates and associated arguments are the same. The process of matching and uni{i—
cation can be parallelized under certain conditions. Clauses in logic programming can be transformed into
dataflow graphs. Parallel tmification has been attempted on some dataflow computers built in Japan.
Concurrent Pmlog, developed by Shapiro {I986}, and Pnrlog, introduced by Clark (1931), are two
parallel logic programming languages. Both languages can implement relational language features such as
AND-parallel execution of conjunctive goals, IPC by shared variables, and DR-parallel reduction.
In Purlog, the resolution tree has one chain at AND levels, and OR levels are partially or fiilly generated.
ln Corrcurrem‘ Pmlog, the search strategy follows multiple paths or depth first. Stream parallelism is also
possible in these logic programming systems.
Both functional and logic programming models have been used in artificial intelligence applications
where parallel processing is very much in demand. Japan's Fi,fi'h-Generation Computing System (FGCS)
project attempted to develop parallel logic systems for problem solving, machine inference, and intclligcnt
human-machine interfacing.
In many ways, the FGCS project was a marriage of parallel processing hardware and AI software. The
Parallel Inference Machine (PIM-I} in this project was designed to perform ID million logic inferences
per second (MLIPS}. However, more recent Al applications tend to be based on other techniques, such as
Bayesian inference.
Availability Feature: These are features that enhance the user-friendliness, make the language portable to
a large class of parallel computers, and expand the applicability of sofiware libraries.
* Scalabi_lity—Thc language is scalable to thc number of processors available and indcpcndcnt of
hardware topology.
' Compatibility—Thc language is compatible with an cstablishcd soqucntial language.
' Pbr'lability—Thc language is portable to shared-memory multiproccssors, message-passing
multicomputers, or both.
Synchronization ICommunication Feature: Listed below are desirable language features for
synchronization or for communication purposes:
- Single-assignment languages
' Shared variables (locks) for IPC
* Logically shared memory such as the tuple space in Linda
* Sendfreceive for message passing
' Rendezvous in Ada
~ Remote procedure call
* Datafiow languages such as Id
* Barriers. mailbox, semaphores, monitors
Control of Parallelism Listed below are feattues involving control constructs for specifying parallelism
in various forms:
' Coarse. medium. or fine grain
435 i‘ - Adrnrrced Compurnerfiirhitecture
where each e,- is an arithmetic expression that must produce a scalar integer value. The first expression e, is a
lower bound, the second E2 an upper bound, and the third e3 an increment Lrtride}. For example, B(l : 4 : 3,
6 : 3 : 2,3) represents four elements B(l, 6, 3], B{4 ,6, El), Bil, ll, 3), and B[4, 8, 3} of a three-dimensional array.
When the third expression in a triplet is missing, a unit stride is assumed. The " notation in the second
expression indicates all elements in that dimension starting from cl, or the entire dimension if e, is also
omitted. When both 02 and e3 are omitted, the c| alone represents a single element in that dimension. For
example, A{.5) represents the liflh element in the array A(3 : 7 : 2). This notation allows us to select array
sections or particular array elements.
Array assignments are permitted under the following constraints: The array expression on the right must
have the same shape and the same number of elements as the array on the left. For example, the assignment
A(2 : 4, 5 : El) =A{3 : 5, I : 4) is valid, but the assignment .4{l : 4, I : 3] =1-lll :2, I : 6) is not valid, even tempt
each side has 12 elements. When a scalar is assigned to an array, the value of the scalar is assigned to every
element ofthe array. For instance, the statement B(3 : 4, 5) = 0 sets BQ3, S) and Bf,-4, 5) to G.
Parallel Flaw Control The conventional Fortran Do loop declares that all scalar instructions within the
(Du, Endrlo] pair are executed sequentially. and so are the successive iterations. To declare parallel activities,
we use die (Dnall, Endall] pair. All iterations in the Deal] loop are totally independent of each other. This
implies that they can be executed in parallel if there are sufficient processors to handle different iterations.
However. the computations within each iteration are still executed serially in program order.
‘When the successive iterations of a loop depend on each other, we use the (Dnacross, Endacrnss} pair
to declare parallelism with loop-carried dependences. Synchronizalions must be performed between the
iterations that depend on each other. For example, dependence along the J-dimension exists in the following
program. We use Doacross to declare parallelism along the I-dimension, but synchronization between
iterations is required. The (ForalL Emtall) and (Panto, Parend) commands can be interpreted either as a
Doall loop or as a Doacross loop.
4B5 i‘ Advorrced Cnvnptunerfirrhiteczure
Duacrnss I = 2, N
[In .l = 2, N
S]: A(I,J)=(r'-"t(l,[_J— li])—A(l, J + 1',t)l'2
Enddo
Endacmss
Another program construct is the {Cube-_gi|1, Coend) pair. All computations specified within the block
could be executed in parallel. But parallel processes may be created with a slight time difference in real
impletnentations. This is quite different from the semantics of the Doall loop or Doacross loop structures.
Syrtchronizations among concurrent processes created within the pair are implied. Formally, the command
Cube-gin
Pl
P2
P»
Coend
causes processes P], P1, .. . , P, to start simultaneously and to proceed concurrently until they have all ended.
The command (Pa rbe-gin, Faren-d} has equivalent meaning.
Finally, we introduce the Forlt and Join commands in the following example. During the execution of a
process P, we can use a Fork Q command to spawn a new process Q:
Process P Process Q
Fork Q E
E End
Join Q
The Join Q command recombines the two processes into one process. Execution of Q is initialized when
the Fnrk Q statement in P is executed. Programs P and Q are executed concurrently until either P executes the
Join Q statement or Q terminates. Whichever one finishes first must wait for the other to complete execution,
before they can he rejoined.
111 it UNIX or LINUX environment, flie Fork-Join statements provide a direct mechanism for dynamic
process creation including multiple activations of the same process. The Cnbe-gin-Cttend statements provide
a structured single-entry, single-exit control command which is not as dynamic as the Furl:-Jain. The
(Farhegin, Farend) command is equivalent to the (Cube-gin, Coend] command.
Data dependence,
Flew Control dopondonco,
analysts
Rouse analysis
Program Mlactorlzatlon,
Parallotlzatlons,
op-tlmlzalons
Locality, Pipolln lng
Flow Analysis This phase reveals t11e program flow patterns in order to determine data and control
dependences in the source code. We have discussed data dependence relations among scalar-type instructions
in previous chapters. Scalar dependence analysis is extended below to structured data arrays or manices.
Depending on the machine structure, the granularities of parallelism to be exploited are quite different. Thus
the l-‘tow analysis is conducted at different execution levels on different parallel computers.
Generally speaking, instruction-level parallelism is exploited in superscalar or "r-‘LS1 processors; loop
level in Sllvlll, vector, or systolic computers; and task level in multiprocessors. multicomputers, or a network
of workstations. Of course, exceptions do exist. For example. fine-grain parallelism can in theory be pushed
down to multicomputers with a globally shared address space. The flow analysis must also reveal code.-‘data
reuse and memory-access patterns.
Pretgmm Optimization: This refers to the rransfomiation of user programs in order to explore the
hardivare capabilities as much as possible. Transformation can be conducted at thc loop level, locality level,
or prefetching level with the ultimate goal ofreaching global optimization. The optimization often transforms
a code into an Equivalent but “better” form in the same representation language. These transfonnations should
be machine-independent.
In reality, most transfomtations are constrained by the machine architecture. This is the main reason why
many such compilers are machine-dependent. At the least, we vvantto design a compiler which can run on most
machines with only minor modifications. One can also conduct curtain transformations preceding the global
FM Mtfiruw Hffltitmpwtnv
4911 T " Aduertced Covnpunerfitrhiteczure
optimization. This may require a source-to-source optimization {sometimes canied out by a pracompiicr),
which transforms the program from one high-level language to another before using a dedicated compiler for
the second language on a target machine.
The ultimate goal of program optimization is to maximize the speed of code execution. T'his involves the
minimization of code length and of memory accesses and the exploitation of parallelism in programs. The
optimization techniques include vectorizration using pipelined hardware and parallelization using multiple
processors simultaneously. The compiler should be designed to reduce the nmning time with minimum
resource binding. Other optimizations demand the expansion of routines or procedure integration with
inlining. Both local and global optimizations are needed in most programs. Sometimes thc optimization
should be conducted at the algorithmic level and must involve the programmer.
Machine-dependent transformations are meant to achieve more efficient allocation of machine resources,
such as processors, memory, registers, and functional units. Replacement of complex operations by cheaper
ones is often practiced. Other o_ptimizations include elimination of unnecessary branches or common
expressions. Instruction scheduling can he used to eliminate pipeline or memory delays in executing
consecutive instructions.
Fhrallel C/ode Gurerertion Code generation usually involves transformation from one representation to
another, called an Inrermediatejbrm. A code model must he chosen as an intermediate form. Parallel code
is even more demanding because parallel constructs must be included. Code generation is closely tied to the
instruction scheduling policies used. Basic blocks linl-ted by control-flow commands are often optimized to
encourage a high degree of parallelism. Special data stnrctures are needed to represent instruction blocks.
Parallel code generation is very different for diiTerent computer classes. For example, a superscalar
processor may be software-scheduled or hardware-scheduled. How to optimize the register allocation on a
RISC or superscalar processor, how to reduce the synchronization overhead when codes are partitioned for
multiprocessor execution, and how to implement message-passing commands when codestdata ane distributed
(or replicated) on a multicomputer are added difficulties in parallel code generation. Compiler directives can
be used to help generate parallel code when automated code generation cannot he implemented easily.
Two well-known exploratory optimizing compilers were developed over mid-l ‘J80: one was Parafrase at
the University of Illinois, and the other was the PFC (Parallel Fortran Converter] at Rice University. These
systems are briefly introduced bclow.
Porafmse and?-rrrafmee I This system, developed by David Kuck and coworkers at lllinoi s, is a source-to-
source program restructurer (or compiler preprocessor) which transforms sequential Fortran 77 programs into
forms suitable for vcctorization or parallclization. Parafrase contains more than 10!) program transformations
which a.re encoded as prtsses. Aposs list is used to identify the particular sequence of transformations needed
for restructuring a given sequential program. The output of Parafrase is the converted concurrent program.
Different programs use different pass list and thus go through different sequences of transformations. The
pass lists can be optimized for specific machine architectures and specific program constructs. Parafrase 2
was developed for handling programs written in C and Pascal, in addition to convening Fortran codes.
Information on Parafrase can he found in [Kuck84] and on Parafrase 2 in [PolychronopoulosB9].
Parafrase is retargetable to produce code for different classes of parallelfvector computers. The program
transformed by Farafiase still needs a conventional optimizing compiler to produce the object code for the
target machine. The Parafrase technology was later transferred to implement the RAP vec-torizer by Kuck
and Associates, Inc.
t.....»..n.1..t.t.t....g....,.....,..t — .,,,,
The PFC and PamScope Ken Kennedy and his associates at Rice University developed PFC as an
automatic source-to-source vectorizer. It translated Fortran T7 code into Fortran 9|] code. A categorized
dependence testing scheme was developed in PFC for revealing opportunities for loop vectorization. The
PFC package was also extended to PFC+ for parallel code generation on shared-memory multiprocessors.
PFC and PFC l also supported the ParaScope pnogramming enviromnent.
PFC [Allen and Kennedy, I934] perfonned syntax analysis, including the following four steps:
(I) lnterprocedural flow analysis using call graphs.
(2) Standard transformations such as Do-loop normalization, subscript categorization, deletion of dead
codcs, ctc.
(3) Dependence analysis which applied the separability, GCD, and Banerjee tests jointly.
(4) Vector code generation. PFC+ ftuther implemented a parallel code generation algorithm {Callahan
ct al, 1938).
Commercial Compiler: Optimizing compilers have also been developed in a number of commercial
parallelfvector computers, including the Alliant FXI F Fortran compiler, the Convex parsllelizinglvec-torizing
compiler, the Cray CFT compiler, the IBM vectorizing Fortran compiler, the VAST vectorizcr by Pacific
Sierria, lnc., and lntel iPSC-VX compiler. [BM also developed a PTRAN (Parallel Fortran) system based on
control dependence with interproccdural analysis.
Du ll = L1! LT:
Do t,,=1.,,, 1.-',,
S]: 1‘\l:ji_(i-|, ..., I-M), ...,j,;1.{t|, ..., I-FD =
Enddn
Eltddu
Enddu
Iteration Space The .-1-dimensional discrete Canesian space for !‘l—|2lCCp loops is called an iteration space.
The iteration is represented as coordinates in the iteration space. The foiiowing example clarifies the concept
oftexicagraphic order for the successive iterations in :1 loop nest.
Ir)
Kg Example 10.3 Lexicographic order for sequential execution
of successive iterations in a loop structure
(Monica Lam,1992)
COtI5iClBT a hvo-dimensional iteration space (Fig. I D5} representing the following two-level loop nest in imit-
incrcment steps:
Do i= 0,5
Du j = i, '1'
fifli)= ~-
Enddo
Endd-n
5 -- -- --- -- -- -- -- --
4 -- -- --- -- - -- -- -- ‘if
2. -- -- -- - -- -- -- toe
2 —- —- - —- - -— —- --
1 -- -- -- - -- -- --
I5
|~,;.$- §_-§_ |.-_J_ -|._ J ta‘-—l- l- I- |- t- 1 -~*- r +-1 —+-L J wt-+ -+ -4 1 I31‘-I-*0-I-l-0-1 - L-+ -+ -+ 1
E";'-
Fig. 10.5 A two-dimensional it:era1:iort space for the loop nest: in Ex:-.rnpie 10.3
...............................,...... — .,,,
The following sequential order of iteration is a lexicographic order".
(0, 0). to. 1), (0, 2), (0. 3), (0, 4), (0, 5). to, s), (0. 1}
(1.1). (1. 1). (1, 3). (1, 4). (1,5). (1.6). (1. 7)
(2, 2). (2, 3). (2, 4). (2. 5). (2, 6). (2, 1)
(3. 3). (3. 4). (3. 5). (3. 6). (3. 1')
(4. 4). (4. 5). (4. 6). (4. 7)
(5, 5). (5, 6). (5, 7)
The lexicographic order is important to performing matrix transfonnation, which can be applied for loop
optimization. We will apply lexicographic orders for loop parallelizatiotl in Section 10.5.
Dependence Equations Let rr and B bc vectors of n integer indices within tl1e ranges of the upper and
lower bounds of the n loops. There is a dependence from S, to S2 if and only if there exist rr and ,8 such that
tr is lexicographically less than or equal to ,3 and the following system of dependence equation.r is satisfied:
The elements are always displayed in order from left to right and from the outermost to the innermost loop
in the nest.
For example, consider the following loop nest:
Dn t= .-1,, U1
Do j = L1, U2
fill i‘ Advorrced Cmnptuterfiirhitecture
Do It = L3, U3
A[_i+ l_,j,k— l}=A{r',j, I:)+ C
Enddn
Enddo
End-do
The distance and direction vectors for the dependence between iterations along three dilnensions of the
anay A are ('1,0, —l} and [_<, = , 1>}, respectively. Since several different values of cr and ,6 may satisfy the
dependence equations, a set of distance and direction vectors may be needed to completely describe the
dependence.
Direction vectors are lisefill for calculating the level of loop-canted dependences. Adependence is carried
by the outermost loop for which the direction in the direction vector is not “="'. For instance, the direction
vector (<, = , >1 for the dependence above shows the dependence is carried on the i-loop.
Carried dependences are important because they determine which loops cannot be executed in parallel
without synchronization. Direction vectors are also useful in determining whetlier loop interchange is legal
and profitable. Distance vectors are more precise versions of direction vectors that specify the actual distance
in loop iterations between two accesses to the same memory location. ‘They may be used to guide opti-
mizations to exploit parallelism or the memory hierarchy.
I»)
Cg Example 10.4 Subscript types in a loop computation
Consider the following loop nest of three levels, identified by indices v',j, and k.
Do r, = Ll, o‘,
Do J: =12, 1.-'2
Do It = L," L;
Misnt.i.,,.g..@s,,.,s — 4,,
A[5, r'— 1,j}=A(N, i, k) + C
Enddn
Enddn
Enddn
When testing for a flow dependence between the two references to A in the code, the first subscript is ZIV
because 5 and N are both. constants, the second is SIV because only index i appears in this dimension, and the
third is MW because both indices] and k appear in the third dimension. For simplicity, we have ignored the
output dependence in this example.
Subscript Separnbility When testing multidimensional arrays, we say that a subscript position is separable
if its indices do not occur in t11e other subscripts. If two different subscripts contain the same index, we say
they are coupled. Separability is important because multidimensional array references can cause imprecision
in dependence testing.
if all the subscripts are separable, we may compute the direction vector for each subscript independently
and merge the direction vectors on a positional basis with fitll precision. The following examples clarify these
concepts.
I/)
[<3 Example 10.5 From separability to direction vector and
distance vector
Consider the following loop nest:
Du i| = L], [F1
D" I = L2, U2
Du .t= 1.3, L3
At?-.fJl = Atfli, kl + '3
Enddn
Enddn
En-dd-n
The fust subscript is separable because index. i does not appear in the other dimensions, but the second and
third are coupled because they both contain the index]. ZIV subscripts are separable because they contain
no indices.
Consider another loop nest:
D0 q= ,:,, U,
Du ;= L2, U2
Do k = .r.,,, L-",,
A(r'+ l,j,k 1)=A{i,j, k}+ C
End-do
End-do
Enddn
495 ‘i Advanced Cornprreerfirchitecture
The leftmost direction in the direction vector is determined by testing the first subscript, the middle
direction by testing the second subscript, and t11c rightmost direction by testing the third subscript.
The resulting direction vector (-1 , = , >] is precise. The same approach applied to distances allows us to
calculate the exact distance vector ( l , 0, 1}.
Subscript Pkrrtirioning We need to classify all the subscripts in a pair of array references as separable
or as part of some minimal coupled group. A coupled group is minrlrrrai if it cannot be partitioned into two
nonempty subgroups with distinct sets of indices. Once a partition is achieved, each separable subscript and
each coupled group has completely disjoint sets of indices.
Each partition may then be tested in isolation and the resulting distance or direction vectors merged
without any loss of precision. Since each variable and coupled subscript group contains a unique subset of
indices, a merge may be fit/ought of as a Cartesian product.
ln the following loop nest, tl'|e first subscript yields the direction vector (<1) for 1:he i-loop. The second
subscript yields the direction vector [=10 for the _,r'-loop. The resulting Cartesian product is the single vector
t<, =3-
Dn r= L|,U1
DU = L2, Ll:
A(r'- l.j}=A(i,_;]+ C
Enddrr
Enddn
Consider another loop nest where the first subscript yields the direction vector {<2} for the i-loop.
D0 i = L1, U]
no j =r.,, U2
an-+1, 5} = no, N] + c
Enddrr
Enddu
Since j does not appear in any subscript, we must assume the full set of direction vectors for the j-loop:
{(<), (=), {>1}. Thus a merge yields the following set ofdirection vectors for both dimensions:
{(1 "ii. (‘E =1. (‘H I-‘ll
10.3.3 Categorized Depenrlenr:eTest:s
The goal of dependence testing is to construct the complete set of distance and direction vectors representing
potential dependences between an arbitrary pair of subscripted referenom to the same array variable. Since
distance vectors may be treated as precise direction vectors. we will simply refer to direction vectors.
Tire T-irstirrgrlllgioritlrm The following procedure is for dependence testing based on a partitioning approach,
which can isolate unrelated indices and localize the computation involved and thus is easier to implernenl.
{ll Partition the subscripts into separable and minimal coupled groups using the following algorithm:
MrMt.r.,,.g..q.,,.,...i — 4,,
Subscript I"artitinningAlgnrithm (Gofl, Kennedy, and Tseng, 1991)
Input: A pairofm-dimensional array references
containing subscripts S| ...Sm enclosed in n loops
with indices I, ...f,,.
'Dutput:A set ofpartitions Pl . . . P"-, n’ S n, each
containing a separable or minimal couplod group.
Fnreachi, 1 Eién Do
Pr <— l5.-i
Endfor
For each index I,-, l 5 ii n Du
it ~t— inane}
For each remaining partition 11- Do
if3S| E such that SI contains ipthcn
ifir= inrme} then
Ir <—_f
else
P,-t. <— Pk u E’,-
Diseard P_‘_,-
Endit"
En-dif
Endfor
Endfnr
(2) Label each subscript as ZI\-K SIY, or I'vE[\{
(3) For each separable subscript, apply tl1e appropriate single subscript test (ZIUI, SIY, MIV) based on
the complexity ofthe subscript. This will produce independence or direction vectors for the indices
occurring in that subscript.
(4) For each coupled group, apply a multiple subscript test to produce a set of direction vectors for the
indices occurring within that group.
{'5} If any test yields independence, no dependences exist.
{st Otherwise merge all the direction vectors computed in the previous steps imo a single set of direc1:ion
vectors for the two references.
Tim Categories Dependence test results for ZN subscripts are treated specially. If a ?.I'v‘ subscript proves
irrdependeuce, the dependence test algorithm halts immediately. lf independence is not proved, the ZIV test
does not produce direction vectors, and so no merge is necessary. For the implementation of the above
algorithm, we have specified how to perform the single subscript tests (?.W__ SIY, MW) separately. We
consider below the trivial ease ot'ZIV first, then SIY, and finally MW which is more involved.
We first consider dependence tests for single separable subscripts. All tests presented assume that the
subscript being tested contains expressions that are linear in the loop index variables. A subscript expression
is linear if it has the fonn a| i| + 113?; + ...+ 11,, in + e, where ii is thc index for the loop at nesting level Ir;
all oh 15 it 5 rr, are integer constants; and c is an expression possibly containing loop-invariant. symbolic
expressions.
fill ‘i Advanced Cmnpttoernrchitecrure
The 2'tl|"‘l'e:r The ZIV test is a dependence test performed on two loop-invariant expressions. If the system
determines that the two expressions cannot be equal, it has proved independence. Otherwise the subscript
does not contribute any direction vectors and may be ignored. The ZW test can be easily extended for
symbolic expressions. Simply fonn the expression representing the ditferenoe between the two subscript
expressions. If the difference simplifies to a nonzero constant, we have proved independence.
The SI? ‘first An SW subscript for index i is said to be stnmg if it has the form (oi + c|, m" + C2), i.e. if it
is Iincar and the coefticicnts of the two occurrences of the index i are constant and equal. For strong SW
subscripts, define the dependence distance as
d=r’-r= I
(10.4)
A dependence exists if and only if d is an integer and la‘ l E U — L, where U and L are the loop upper and
lower bounds. For dependences that do exist, the dependence direction is given by
I< Hd>O
Direction= i = ifd = -[1 (I05)
l> fidcfl
The strong SIV test is thus an exact test that can be implemented very efliciently in a few operations. A
bounded iteration space is shown in Fig. ltlfia. The case ofa strong SIV test is shown in Fig. 10.-Eb.
Another advantage ofthe strong SW test is that it can be easily extended to handle loop-invariant symbolic
expressions. The trick is to first evaluate the dependence distance d symbolically. ll‘ the result is a constant,
then the test may be performed as above. Otherwise calculate the difference between the loop bounds and
compare the result with d symbolically.
A weak SW’ subscript has the form (tn i + cl, {I2 i’ + C2), where the co-efficients of the two oeeturenees
of index i have different constant values. As stated previously, weak SIV subscripts may be solved using
the single-index exact test. However. we also find it helpful to view the problem geometrically, where the
dependence equation
fll i+ cl =rI2i" + c-2
‘Meek-Zero Silt’ ‘ten The case in which .o| = D or tr; = 0 is called a weak-zero SN subscript, as illustrated
in Fig. 10.6c. If H1 = 0, the dependence equation reduces to
.
r
¢'2'¢“|
.n|
M.Mt.t.,,.,...t.,..,..i —
do 1c 1 - 1,4
H’ In 1|; F-.(i'| — Elfi — 1'}
9‘
u-— 0 0 0 ‘,0’, - 0 0 K0
0 0 X0"! 0 ca K0", 0
0 :0/I 0 0 ’,c/I 0 0
|_- /O’/O G O - /,0’, 0 0 <3
’ I
ilt; i
1_ u
GOOD
[a] Bounded iteration space [I1] Strung SIV
- 0 0 ‘Io’! - 0 0 go"!
0 K0", 0 0 go”, 0
X0", 0 0 0 ‘,0’, 0
- _ X 0 0 0 — /0" 0 0
,, -
Fig. 1lI.i Geornetric view of SIV nest: in four cases (Cour-nesy of Gui e1: al. 1991: reprirmed fmrn ACFI
SIGFLAN Confl Progmmtning Language Design and hnpJementutiun.Tu*cnto.Cana.da. 1991]
It is only necessary to check that the resulting value for i is an integer and within the loop hounds
A similar check applies when 0| = 0.
The weak-rem SW test finds dependences caused by aparticular iteration i. In scientific cndes, r IS usually
the first ur last itm-atinn of thc lnnp, eliminating om: possible direction vcctur for the dependence
SUD i‘ " Advanced Compimerfirchiteottrre
Weak-Crowning Sl'i"'l"e:t All subscripts where n2 = —n| an: wen}:-crossing SIV. These subscripts typically
occur as pan ofCho|esky decomposition, also illustrated in Fig. l{J.6d. In these cases we set i= i’ and dcrivc
the dependence equation
. i
,= ('2 ' Ft
Eco
This corresponds to the intersection of the dependence equation with the line i = r’. To determine whether
dependences exist, we simply need to check that the resulting value i is within the loop bounds and is either
an integer or has st noninteger part equal to L"2.
Weak-crossing SIV subscripts cause crossing dependences, loop-carried dependences whose end points
all cross iteration i, These dependences may be eliminated using a loop-spfirting transforrnation {Kennedy
et al, I99 I } as described below.
Consider the following loop from the Callal1an—Dongarra~Levine vector lest (Callahan et al., I988]:
[In i= LN
A[i}=A{N—i+ l_)+C
Enddn
Tl'le weak-crossing SH’ test determines that dependences exist between the definition and use ofA and that
they all cross iteration {N + l)f2. Splitting the loop at that iteration results in two parallel loops:
amt r.1....t.t....,...s......r . — ,,,,
[Io r'= l, (ii.-'+ l}f2
At[r'J=A(_N—r'— l_‘_t+C
Endtln
Do i={N+ I]-"2+ l,N
At[r’j=A{l'~l—r'—l]+C
En-rldn
The MW Ten: SIV tests can be extended to handle complex iteration spaces where loop bounds may be
filrlctions ofoflrer loop indices, e.g. triangular or trapezoidal loops. We need to compute the minimum and
maximum loop bounds for each loop index.
Starting at the outermost loop nest and working inward. we replace each index in a loop upper bound with
its maximum value (or minimal if it is a negative tenn). We do the opposite in the lower hound, replacing
each index with its minimal value (or maximal if it is a negative team}.
We evaluate the resulting expressions to calculate the minimal and maximal values for the loop index and
then repeat for the next inner loop. This algorithm returns the maximal range for each index. all that is needed
for SIV tests.
The Banerjee-GCD test may be employed to construct all legal direction vectors for linear subscripts
containing multiple indices. In most cases the test can also determine the minimal dependence distance for
the carrier loop.
A special case of Mil-" subscripts, called F.DlV[rest1-ictcd double—index variable) subscripts, have the form
(o| F + c|, I71] + cl]. They are similar to SIV subscripts except that randy‘ are distinct indices. By observing
different loop botuids for i andj, SIV tests may also be extended to test RDIV subscripts exactly,
A large body of work was performed in the field of dependence testing at Rice University, the University
of lllinois, and Oregon Graduate Institute. What was described above is only one of the many dependence
testing algorithms proposed. Experimental results are reported from these research centers. Readers are
advised to read published material on Banerjee's test and the GCD test, which provide other inexact and
conservative soluliorm to the problem.
The development of a paralleliaing compiler is limited by the difficulty of having to deal with many
nonperfcctly nested loops. The lack of datatlow information is often the ultimate limit on automatic
compilation of parallel code.
use of functional units, registers, data paths, and memory. Program constraints are caused by data and control
dependences. Some processors. like those with VLIW architecture, explicitly specify [LP in their instntctions.
Dthers may use hardware interlock, out-of-order execution, or speculative execution. Even machines vvith
dynamic scheduling hardware can benefit from compiler scheduling techniques.
There are two alternative approaches to supporting instruction scheduling. One is to provide an additional
set of nontrapping instructions so that the compiler can perform aggressive static irtstmction scheduling.
This approach requires an extension of the instruction set of existing processors. The second approach is
to support out-of-order execution in the rnicro—architec1ure so that the hardware can perform aggressive
dynamic instruction scheduiing. This approach usually does not require the instruction set to be modified but
requires complex hardware support.
ln general, instruction scheduling methods ensure that control dependences, data dependences, and
resource limitations are properly handled during concurrent execution. The goal is to produce a schedule
that minimizes the execution time or the memory demand, in addition to enforcing correctness of execution.
Static scheduling at compile time requires intelligent compilation support, whereas dynamic scheduling at
rim time requires sophisticated hardware support In practice, dynamic scheduling can be assisted by static
scheduling in improving performance.
Precedence Constraint: Speculative execution requires the use of program profiling to estimate
effectiveness. Speculative exceptions must not terminate execution. In other words, precise exception
handling is desired to alleviate the control dependence problem. The data dependence problem involves
instruction ordering and register allocation issues.
lfa flow dependence is detected, the write rnust proceed ahead of the read operation involved. Similarly,
output dependence produces different results if two writes to the same location are executed in a different
order. Antidependence enforces a read to be ahead of the write operation involved. We need to analyze
the memory variables. Scalar data dependence is much easier to detect. Dependence among arrays of data
elements is much more involved, as shown in Section 10.3. Other difficulties lie in interprocedural analysis,
pointer analysis, and register allocations interacting with code scheduling.
Basic Block Scheduling Abasic black [orj ust a bt'ocI:_) is a sequence ofstatements satisfying two properties:
(1) No statement but the first can be reached from outside the block; i.e. there are no branches into the middle
of the block. (2) All statements are executed consecutively if the first one is. Therefore, no branches out or
halts are allowed until the end of the block. All blocks are required to be maxirrtai in the sense that they cannot
be extended up or down without violating these properties.
For local optimization only, an extended basic block is defined as a sequence of statements in which the
first statement is the only entry point. Thus an extended block may have branches out in the middle of the
code but no branches into it. The basic steps for constructing basic blocks are summarized below:
{ll Find the !cader.s, which are the first statements in a block. Leaders are identified as being one or more
ofthe following:
(a) The first statement of the code.
{bl Thc target ofa conditional orunconditional branch.
{ct A statcrrtcnt following a conditional branch.
(2) For a leader, a basic block consists of the leader and all statements following up to but excluding the
next lcadcr. Notc thatthc hcg inning ofinacccs siblc codc {dead codcj is not con sidcrctl a lcadcr. [n fact,
dcad co-dc should hc cliniinatcd.
MtM,,.t.mg..qs,.m — 5,,
Emit‘
B1‘ |:=n—1
B2 at-cigotoout
B3 |:=1 Em
B4 |'|'p-|go1oB5
B6
tt;=|"-1
|j13-i:=|j' 4+ 4 B5
B5 gotofl-2
B;=y—1 BB I
B7 I gobfl-4
.Mt1fl-t:=te|r'p
Fig. 10.? A flow grq:-I1 showing the pmoodcnon rchdonshlp among bash: blocks in the bubble sort program
{Courtesy ot'S.Graham.j. l_ Henrtesoy. and]. D Lllrnart Cottrse on Cede Dpflrnlzmlon and Code
GH'.|flfl‘flq|‘l,lNl!S1!fl‘fl Ire-drone of Compmer Salome. Stattforcl University. 199'!)
»...,.t.t.-.1..n..i....,.~.,i.,.....i — 5,,
While a program is being compiled, basic block records should keep pointers to predecessors and
successors. Storage reclamation techniques or linked list structures can be used to represent blocks. Sources
of program optimization include algebraic optimization to eliminate redundant operations, and other
optimizations conducted within the basic blocks locally or on a global basis.
{ii Locoi Common S-ubeirpression Elimination If a subeitpression is to be evaluated more than once within
a single block, it can be replaced by a single evaluation. For Example |o.s, in block B7, t9 and tl5 each
compute 4 "' (j - i}, and 112 and tlii each compute 4 “' j. Replacing 115 by t9, and 118 by 112, we obtain the
following revised code for B1’, which is shorter to execute.
tfi := j l
t'-1 := 4 " til
temp := A[t'5l']
tl2 := 4 * j
ti3 := .4.[ti2]
ans] -.=t13
A[tl'.'-'3] := temp
{.2} Local Constant Folding or Propogorion Sometimes some constants used in instructions can be computed
at compile time. This often takes place in the initialization blocks. The compile-time generated constants are
then folded to eliminate unnecessary calculations at run time. in other cases, a local copy may be propagated
to eliminate unnecessary calculations.
{3} Algebraic Optimization to Si'npl'ifiv Earpressiorrs For example, one can replace the t'derrtr'.l}r statement A :=
B + 0 or A := B * 1 by A := B and later even replace references to this A by references to B. Or one can use
the commutative law to combine expressions C := A + B and D := B + A. The associative and di'srri£mrr've
i‘aw.'r can also be applied on equal-priority operators, such as replacing (rt — b) + c by ti (b — c) if (b — c) has
already been evaluated earlier.
{4} instruction Reordering Code reordering is often practiced to maximize tlte pipeline utilization or to enable
overlapped memory accesses. Some orders yield better code than others. Reordered instructions lead to better
scheduling, preventing pipeline or memory delays. In the following example, instruction I3 may be delayed
in memory accesses:
SUE i‘ Advoinced Covripuneriliirhiteczure
{I} Global Versions ofl_ocol Optimizations These include global common subeitpression elimination, global
constant propagation, dead code elimination, etc. The following example fiirtber optimizes the code in
Example 10.6 if some global optimizations are performed.
Ir)
é?) Example 10.1 Global optimizations in the bubble sort
program (5. Graham, j. L. Hennessy, and
]. D. Ullman,1992)
ln Example I116, block B? needs to compute A[t9] =A[4 * (j l)], which was oomputed in block B6. To reach
Bi’, Bo must be executed first and the value ofj never changes between the two nodes. Thus the first three
statements of B7 can be replaced by temp := t3. Similarly, t5‘ computes the same value as t2, I12 computes the
same value as 1:6, and tl3 computes the same value as t7. The entire block B7 may be replaced by
temp := t3
A[t2] := t7
A[t6] := temp
{2} Loop Optiniizorions These include various loop transformations to be described in subsequent sections
for the purpose of vectorization, parallelization, or both. Somtimes code motion and induction variable
elimination can simplify loop structures. For example, one can replace the calculation of an induction variable
involving a multiplication by an addition to its former value. The addition takes less time to perform and thus
results in a shorter execution time.
In other eases, loop-invariant variables or codes can be moved out of the loop to simplify the loop nest.
One can also lower the loop control overhead using loop unrolling to reduce iteration or loop fusion to merge
loops. Loops can be exchanged to facilitate pipelining of long vectors. Many loop optimization examples will
be given in subsequent sections.
(3) Ccmrel-flmv Gptimizofim These are other global optimizations dealing with control structure but not
directly with loops. A good example is code hoisting, which eliminates copies of identical code on parallel
paths in a llow graph. This can save space significantly, but would have no impact on execution time.
lnterprocedural global optimizations are much more difficult to perfonn due to sensitivity and global
dependence relationships. Sometimes, procedure integration can be performed to replace a call to a procedure
by an instantiation of the procedtne body. This may also trigger the discovery of other optimization
opportunities. Interprocedure dependence analysis must be performed in order to reveal these opportunities.
Machine-Dependent Optimizations With a finite number of registers, memory cells, and ftlnctional
units in a machine, the efficient allocation of machine resources affects both space and time optimization
of programs. For example, strength reduction replaces complex operations by cheaper operations, such as
replacing 211 by o + H, :12 by n " ti, and i'engr.lr[.5‘l + S2} by lcngrh[.S‘l) + i'ength[5'2}. We will address register
and memory allocation problems in code generation methods in the next section.
Sill i‘ - Advorrced Compmerfirrhitecture
Other fiow control optimizations cart be conducted at the machine level. Good examples include the
elimination of unnecessary branches, the replacement of instruction sequences by simpler equivalent code
sequences, instruction reordering and multiple instruction issues in superscalar processors. Machine-level
parallelism is usually exploited at the fine-grain instruction level. System-level parallelism is usually
exploited in coarse-grain computations.
ln order to enable pipelined execution by vector hardware, we need to introduce a temporary array
TEMP{l :N) to produce the following vector code:
Mtnnt.t,,.g..qe,.n — 5,,
TEM Pll IN) = A|{21N+I)
All :N_) = B{1iN)+ C[l:'N)
B{1:N} = 2 " TEMP{l:N}
Without the TEMP array, the second array assignment B(l:N) may use the modified Ml :N) array which
was not intended in the original code.
{2} Loop interchanging Vectorization is often performed in the inner loop rather than in the outer loop.
Sometimes we intercltenge the loops to enable the veetorization. The general rules for loop interchanges are
to make the most profitable vectorizable loop the innermost loop, to make the most profitable parallelizable
loop the outermost loop, to enable memory accesses to consecutive elements in arrays, and to bring a loop
with longer vector length (iteration count} to the innermost loop for vectotimtion. The profitability is defined
by i1'nproverne11t in execution time. Consider the following loop nest with two levels:
Do 2(lI=' 2, N
[lo ll} J = 2, N
S|: A('l,I}=(A(L.T—l)1A{I,.l P 1)f2 (10.6)
ID Continue
20 Continue
The statement S] is both flow-dependent and imtidependent on itsell‘witl:t the direction vectors {=, <1 and
(I, >1, or more precisely the distanoe vectors (0, -1) and [0, 1), respectively. This implies that the loop cannot
be vectorized along the J-dimension. Therefore, we have to intettltange the two loops:
Du 20 J = 2, N
Do 20 1 = 2, N
A(l,_I}=(A(I,J 1)+A{l,.I+1)}!2
26 Cnnfi nue
Now, the inner loop [I-dimension] can be vectorized with the zero distance vector. The vectorized code is
Do 201 = 2, N
A{2:N, J) = (A{2:N_. J — 1) t A{2:N, J I 1)}/2
20 Cnnti nue
ln general, an innermost loop cannot be vectorized if forward dependence direction (*1) and heclcward
dependence direction (>1 coexist. The [=} direction does not prevent vectorization.
{3} Loop Distribution Nested loops can be veetoriizod by distributing the outermost loop and vectorizing each
of the resulting loops or loop nests.
Do ID] = 1, N
B(I, 1) = U
Do 20] = l, M
Afl) = AU) + B|[I, J) * C(l, J]
20 Continue
D[[] = E{l] + All}
lll Continue
SH] i ' Advoiiced CompuoerA.n:hitecto.re
The I-loop is distributed to three copies, separated by the ne-am J-loop from the assigrunent to array B and
D, and vectorized as follows:
Bil :N, 1] = 0 (a zero vector]
Do 30 I = 1. N
AH) =A(l] I- Bil, i:M) “‘ CH, l:l'vl)
3G Continue
D(1:N) = E(l:N] I-A{l:l"~l)
{4} Vector Reduction We have defined vector reduction instructions in Eqs. 3.6 and 3.7. A number of
reductions are defined in Fortran §0. In general, a vector reduction produces a scalar value from one or two
data arrays. Examples include the sunl, product, maximum, and minimum of all the elements in a single array.
The dot pmduct produces a scalar S = I.;| A, >< B, from two arrays 1'-it-[I :11} and B(1:n]. The loop
Do 4'0 I = l, N
S]: A{l) = Blil) + C(I)
5'2: S = S + A(_l]-
S5: AMAX = ]'v'lAX[AMAX, A(I})
40 Continue
has the following dependence relations: .S'|(=E2, S|{=].S'3, .S‘2(~=i)S2, S3 ‘=1 S3. Altltotlgh statements S2 and S3
each forrn a dependence cycle, each statement is recognized as a reduction operation and can be 'v’¢ClIG1‘l2JCtIl
as follows:
5,: A(l:N) = B(1:N] I C(l:N}
S1: S = S I SUM{A{l:N]|]
S3: ANIAX = IVL#tX(A.\'L#tX, lviAXVAI.{A{1:N)])
where SUM, MAX, and MAXVAL are all vector operations.
{Q Node Spitting The data dependence cycle can sometimes be broken by node splitting. Consider the
following loop:
Do 50 I = 2, N
5-'1: T{I}=A(l~1)iA(I i l)
5'2: AU) = EH] I CU}
S0 Continue
Now we have the dependence cycle .5‘,{-Q5‘; and S,{>‘_tS‘;_, which seems to prevent vectorization. However,
we can split statement 5| into two parts and apply statement reordering:
Do 50 I=2, N
Si“: X(I} = All t 1]
5'2: A(I) = B(I] I CH}
51.51 T(1l=Aii—1l' *' Kill
Si} Continue
i...t.n.1.....t.t...,...,........i — ,,,
The new loop structure has no dependence cycle and thus can be vectorized:
5‘;.rI X(2:N) = A(3:N + I)
5;: A(2:N) = B(2:N] + C(2:N}
.5-‘lb: T[2:N_) = A(ltN — 1] + X{2:N}
It should be noted that node splitting cannot resolve all dependence cycles.
(6) Other Vector Optimizations There are many other methods for vectorization, and we do not intend to
discuss them all. For example, senior variables in a loop can sometimes be expanded into dimensional arrays
to enable vectorization. Suberpressions in a complex expression can be vectorized separately. Loops can be
peeled, um-oiled, rerolled, or tiled (blocking) for vectorization.
Some machine-dependent optimizations can also be performed, such as amp mining [loop sectioning]
and pipeline citainirig, introduced in Chapter 8. Sometimes a vector register can be used as an accumulator,
making it possible for the compiler to move loads and stores of the register outside the vector loop. The
movement of an operation out ot‘ a loop to a basic block preceding the loop is called hoisting, and the
inverse is call-cd sinking. Vector loads and stores can be hoisted or sunk only when the array reference and
assignment have the same subscripts and all the subscripts are the induction variable of :1 vectorized loop or
loop constants.
Veetorizction Inhibitor: Listed below are some conditions inhibiting or preventing vectorization:
(1) Computed conditional statements such as IF statements which depend on runtime conditions.
(2) Multiple loop entries or exits [not basic blocks).
(3) Function or subroutine calls.
(4) lnputioutput statements.
(5) Reettrrences and their variations.
A mourn-znce exists when a value calculated in one iteration of a loop might be referenced in another
iteration. This occurs when dependence cycles exist between loop iterations. In other words, there must be at
l-cast one loop-currieddependence for a recurrence to exist. Any number ofloop-independent dependences can
occur in a loop, but a recurrence does not exist unless that loop contains at least one loop-carried dependence.
Code Pnroillelization Parallel code optimization spreads a single program into many threads for parallel
execution by multiple processors. The purpose is to reduce the total execution time. Each thread is a sequence
ofinstructions that must execute on a single processor. The most often perallelized code structure is perfonned
over the outermost loop ifdependence can be properly controlled and synchronized.
Consider the two-deep loop nest in Eq. 10.6. Because all the dependence relations have a “=” direction in
the I-loop, this outer loop can be peuallelized with no need for synchronization between the loop iterations.
The parallelized code is as follows:
Doall I = Z, N
Do J = Z, N
5'1: A(l,.l}= (A(I,.l— 1] I A[l,.l I" 1)}! 2
Enddo
Endall
FM Mtfiruw H'lHr'n.-rqn;u|n1'
5| I i " Adnorrced Covnpunerfirrhitecrure
Each of the N -1 iterations in the outer loop can he scheduled for a single processor to execute. Each
newly created thread consists of one entire J-loop with a constant index value for 1. If dependence does exist
between the iterations, the Doacross construct can be used with proper synchronization among the iterations.
The Following example shows five execution modes of a serial loop execution to various combinations oi‘
parallelization and vectorization of the same program.
I»)
8 Example 10.8 Five execution modes of a FXI Fortran loop
on the Alliant FXIBO multiprocessor (Alliant
Computer Systems Corporation, 1989)
PX-“Fortran generates code to execute a simple Do loop in scalar, vector, scalar-concurrent, vector-
concurrent, and concurrent outer‘/vector imter (CCWI) modes. The computations involved are performed
either over a one-dimensional data array A(l:2048) or over a two-dirnensional data an-av B[1:256, 1:8),
where A(K)=B(l, J) for K=8{[ l]+.l.
By using array A, the computations involved are expressed by a pure scalar loop:
Do K=1,20-48
MK] =-MK] I 5
End-do
where S is a scalar constant Figure 10.821 shows the scalar [serial] execution on a single processor in 30,616
clock cycles.
The same code can be vcctorized into eight vector instructions and executed serially on a single processor
equipped with vector hardware. Each vector instruction works on 256 iterations of the following loop:
.t(1;2n4s;2ss) = an -.2o4s;2ss) + s
The total execution is reduced to 6043 clock cycles as shown in Fig. 1l].Sh.
The scalar-concurrent mode is shown in Fig. lO.Sc. Eight processors are used in parallel, perforrning the
following scalar computations:
Doall I =1, 3
Do l= l, 256
B{I,I]=B[I,.l) F 5
End-do
En-dall
Now, the total execution time is further reduced to 3992 clock cycles.
Figure l0.8d shows the vector-concurrent mode on eight processors, all equipped with vector hardware.
The following vector codes are executed in parallel:
Doall J = l, 3
A[K:2fl40-I—l(:8} = A(K:2040+I(:S] + S
Endall
aatino.t,o,.,..t..,.,.i — 5,,
This voetorized execution by eight processors results in a total time of 961] clock cycles.
Finally, the sarne program can he executed in GOV! mode. The inner loop executes in vector mode, and
the outer loop is executed in parallel mode. In Fortran 90 notation, we have:
EH18, l:256) = B{1:S, l:256] + S
The notal execution time is now reduced to 756 clock cycles, the shortest among the five execution modes.
Pa: B[B,1]= B[6, 11+ S, B[E., 2] =B[B, 2]+S, 5118,2515) =8-[8, 256) + S
P,: A{1:2D:t1:B}=A[1:2fl41:B]+S
P2: n{2:2o¢2:ai=A{2:2m2:ai+s
P8: A[B:2G4B:3]=A{8:2Il}4B:6]+S
P1: a[1,1:25s]=a(1,1:2sey+s
P2: a{2,1:2ss1=s[2,1:2soi+s
PB: s[s,1:2so)=a{a,1:2so)+s
[oi CCWI exeoutlonon eight processors in 156 cycles
Fig. 1Il.,ll Five execunlon modes of a FX.tFornran loop on 1:l-tefiliant l"1uidp1rocesso1~ {Courtesy ofhl lianr
Computer Systems Cou~p-oration. 1989]
Inhibitor: of Parallelization Most inhibitors of vectorimtion also prevent parallelization. Listed below
are some inhibitors ofparallelization:
(1') Multiple entries or exits.
(2') Function or subroutine calls.
Ff» Mtfirnii H'l'Ht'mn;|wtn-\'
5| 4 i ' Aduortced Cmnptunerfitchitccturc
I»)
g Example 10.9 Construction of a DAG for the inner loop
kernel of the bubble sort program (5.Graham,
]. L. Hennessy, and j. D. U|lman,1992)
Listed below arc the statements contained in the basic block B7 ofthe bubble sort program in Fig. lO.T.
,=»,,.e.m..,..-.t.,t.,,.,...,i.,..,..t — 5,
IS := 1-I 1
19 := 4'18 }r.emp:=A[jj
lemp := awn]
t1D := j+l
t11 := no 1 '2 -I?
U
H2 := 4*n1
t13 := Ann] I
t14 :=' j-l 1
H5 4*n4 lAfiL=Afi¢1]
A[t15] := t13 ]
\
L16 := j+ I 1
tl T := no-1 !Afi+lL=mmp
113 := 4"‘ ll?
A [tl 8] temp 1
The corresponding DAG representation of block B‘? is shown in Fig. ll) 9 For nodes Will‘! the same
operator, one or more names are labeled provided they consume the same operands [although they may use
different values at different times]. The initial value of any variable x is denoted hyxu fiueli as the A519, and
te.rnpD at the leafnodes.
xi“
H‘
\
to
A
/
\M
|Aq
0 0 0 '""==1
leaves
FIg.1I|I.! Dtrec1:ed acyclic graph representation of the basic block B?.1:he Inner lo-op of 1:hehubl:|le sort:
program in E':tan\pl-es 10.6 and 10.? {Ct5I.l"|2QS}I' of S. Graharm]. L. Horne-ss}t and j D. Ulrnan,
Course an Code Opfl-nlmflfln and Code Ge|1erutlonMksnem lrtscteute of Curnpueer Scienoe. Stanford
University, 1991]
5| Ii i‘ - Adrorrced Compmerfirrhitecture
ln order to construct a DAG systematically, an auxiliary table can he used to keep track of variables and
temporaries. The DAG constmction process automatically detects common subertpressions and eliminates
thorn accordingly. Copy pmpagnriort can be used to compute only one of the vat'iab1es in each class. The
construction process can easily discover the variables used or assigned and the nodes whose values can be
computed at compile time. Any node whose children are constants is itself a constant. One can also label the
edges in a DAG with the delays.
Lin Scheduling A DAG represents the How of instructions in a basic block. A topological sort‘ can be used
to schedule the operations. Let READY be a buffer holding all nodes which are ready to execute. Initially,
the READY buffer holds all leaf nodes with Zero predecessors. Schedule each node in READY as early as
possible, until it becomes empty. Alter all the predecessor (children) nodes are scheduled, the successor
(parent) node should be immediately inserted into the READY buffer.
With list scheduling, each interior node is scheduled after its children. Additional ordering constraints are
needed for a procedtue call or assignment through a pointer. When the root nodes are reached, the schedule
is produced. The length of the schedule equals the critical path on the DAG. To determine the critical path,
both edge delays and nodal delays must be counted.
Some priority scheme can be used in selecting instructions from the READY buffer for scheduling.
For example, the seven interior nodes of the DAG in Fig. ltl.9 can be scheduled as follows, based on the
topological order. In the case of two ternporaries using the same node, we select the lower-numbered one.
The following sequential code results:
112 :=4 *j
til :=j —l
113 :=A[tl2]
19 :=4 '13
temp := A119]
A[tSl] .=t13
A[tl2] := temp
List scheduling schedules operations in topological order. There are no haclrtracks in the schedule. I1 is
considered the best among critical-path. branch-and-bound for nticroinstruction scheduling (Joseph Fisher,
I979}. Variations of topological list scheduling do exist such as introducing a prr'ort'r_y junction for ready
nodes, using tap-dawn versus barium-up direction, and using cycle scheduling as explained below. Whenever
possible, parallel scheduling of nodes should be exploited, of course, subject to data, control, and resource
dependence constraints.
Cycle Scheduling List scheduling is operation-based, which has the advantage that the highest-priority
operation is scheduled first. Another scheduling method for instructions in basic blocks is based on a cycle
.rc.la-eduling concept in which "cycles" rather “operations” are scheduled in order. Let READY be a buffer
holding nodes with zero unscheduled predecessors ready to execute in a current cycle. Let LEADER be a
buffer holding nodes with zero unscheduled predecessors but not ready in a current cycle (cg. due to some
latency unfulfilled). The following cycle scheduling algorithm is modified from the list scheduling algorithm:
Cu rrent-cycle = D
s.o.iMdi,.o.s.,q....,...i — 5,,
Loop until READY and LEADER are empty
For each node rt in READY (in decreasing priority order]-
Try to schedule n in current cycle
If successful, update READY and LEADER
Increment Current-cycle by l
end of loop
The advantages ofcycle scheduling include simplicity in implementation t‘or single-cycle resources, such
as in a superscalar processor. There is no need to keep records of source usage and it is also easier to keep
track of register lifetimes. It can be considered an improvement over the list scheduling scheme, which
may result in more idle cycles. LEADER provides another level ofbulleiing. Nodes in LEADER that have
become ready should be immediately loaded into the READY queue.
Register Allocation Traditional instniction scheduling methods. minimize the number of registers used,
which also reduces the degree ofparallelism exploited. To optimize the code generated from a DAG, one can
convert it to a sequence of expression trees and then study optimization for the trees. The registers can be
allocated with instructions in the scheduling scheme.
in general, more registers would allow more parallelism. The above bottom-up scheduling methods
shorten register lifetimes for expression trees. A round-robin scheme can be used to allocate registers while
the schedule is being generated. Dr one can assume an infinite number of registers to produce a schedule first
and then allocate registers and add spill code later. Another approach is to integrate register allocation with
scheduling by keeping track ofthe liveness ofregisters. When the remaining registers are greater than a given
threshold, one should maximize parallelism. Otherwise, one should reduce the number of registers allocated.
Register allocation can be optimized by register descriptors [or tags) to distinguish among constant,
variable, indexed variable, frame pointer, etc. This tagged register may enable some additional local or global
code optimizations. Another advanced feature is special branch handling, such as delayed branches or using
shorter delay slots if possible.
Code generation can be improved with a better instruction selection scheme. We can first generate code for
expression trees needing no register spills. Dne can also select instructions by recursively matching templates
to parts ofexpression trees. Amatch causes code to be generated and the subtree to he rewritten with a subtree
for the result The process ends when the subtree is reduced to a single node. When template matching fails,
heuristics can be used to generate subgoals for another matching. The key ideas of instruction selection by
pattem matching include:
-['1] Convert codc gcncration to primarily a systematic process.
-[2] Usc trcc-structured pattcms describing instrtlctions and use a trcc-stnicturcd intcrmotliatc ihrm.
{3} Scloct instructions by covering input to instruction patterns.
Major issues in pattern-based instruction selection include development of pattern matching algorithms,
design of intermediate fonn and target machine descriptions, and interaction with low—level optimization.
Extensive table descriptions are needed. Therefore table compression techniques are needed to handle large
numbers of patterns to be matched.
Advanced code generation needs to be directed toward exploitation oi‘ parallelism. Therefore special
compilation support is needed for superscalar and tnultithreaded processors. These are still open research
SIB ‘i Advanced Cmnprioernrchiteeture
problems. Partial solutions to these problems cart he found in subsequent sections. There is still a long way
to go in developing really “intelligent” compilers for parallel computers.
is 1 =
- B.ran¢hY,2 Bibi?-l<B
First Traoe
inc R4 1
\
ranch Y
MR5
" Brew
i
Fig. 1Il'.l.1Il'.l Code compaction §orVL|W processor based on trace scheduling developed by joshep Fisher
(1931)
Code Compaction Independent instructions in the trace are compacted into VLIW instructions. Each
VLIW word can be packed with multiple short instructions which can be independently executed in parallel.
t...».n.t....i..i....,....,......... — ,,,
The decoding of these independent instructions is carried out simultaneously. Multiple function units [such
as the memory-access unit, arithmetic unit, branch unit, etc.) are employed to can'y out the parallel execution.
Duly independent operations are packed into a VLIW insmiction.
Branch prediction is based on software hem-istics or on using profiles of previous program executions.
Each trace should correspond to the most likely execution path. The first trace should be the most likely one,
the second trace should be the second most likely one, and so on. In fact, reduction in tl'|e execution time of
an earlier trace is obtained at the expense of that of later traces. in other words, the execution time of likely
traces is reduced at the expense of that of unlikely traces.
Compensation Code The effectiveness of trace scheduling depends on correct predictions at successive
branches in a program. To cope with the problem of code movement Followed by incorrect prediction,
compensation codes are added to ofiitrace paths to provide correct linkage with the rest of program. Because
code compaction may move short instructions up or down in the program, different compensation codes must
be inserted to restore the original code distribution in the code blocks involved.
Compensation codes (in shaded boxes in Fig. 10.11) are needed on off-trace paths.'I'his will make
the second trace correctly executed, without being aflected by the first trace due to code movement. The
compensation code added should be a small portion of the code blocks. Sometimes the Undo operation
cannot be used due to the lack of an inverse operation in the instruction set. By adding compensation code,
software is perfomiing the function of the branch history bufi'er described irt Chapter 6.
The efiiciency of trace scheduling depends on the degree of correct prediction oi‘ branches. With accurate
branch predictions, the compensation code added may not be executed at all. The fewer the number of most
likely traces to be executed, the better the perforrnance will be.
Trace scheduling was mainly designed for VLIW processors. For superscalar processors, similar techniques
can exploit parallelism in a program whose branch behavior is relatively easier to predict, such as in some
scientific applications.
510 i ' Adrmvced Cmnpmerfixrhéteczure
__ _ _, E
re 2%‘ --- " are - ;;:3. an R1, RI
___ is.|_:1-
- _ -P - _ LLG. 51.:-12 RL, E
- "I
_ - _- I I
1'‘t-‘."I""I’
.I.. |. I
:-.-' :
F] :'; ~- =r,.' III'..|,
::_:-I‘ -. III“.-
|_"|',.HF
-'I l
U I__________
{a)Thefi|'sttraoeon1l1-aflowgraeh.
A, I‘?
E:j
;
I...5“
_;
[4 {Branch} E53
E
B’ [4 {Branch}
[3
ts
le "5 B” B”
»B~»m~
Y>l1,2lI]1 :2
nu [B
Branch
2oo H D. H3 Q
*':
"m' no
201 no :14 D»
[11
[12 [11 H3 E
[14
10.5.1 LoopTransfom1atiunTI1eor'y
Parallelizing loop nests is one of the most fundamental program optimization techniques demanded in
a vcetorizing and peralleliziug compiler. In this section, we study a loop tn-msformatrion theory and loop
trausfonnafion algorithms derived from this theory. The bulk of the material is based on the work by Wolf
and Lam (1991).
snnnt.rt.r.,..g...,t.r..,...t _ — 5,,
The theory unifies all combinations of loop interchange, skewing, reversal, and tilting as rmimoduiar
rrmisfon-notions. The goal is to maximize the degree of parallelism or data locality in rt loop nest. These
trarrsformations also support efficient use of the memory hierarchy on a parallel machine.
Elementary Transformation: A loop transformation rearranges the execution order of the iterations in a
loop nest. Three elementary loop transformations are introduced below.
{lj Per'rrrrrr.n'rirJ.rr—A permutation 5on a loop nest transforms iteration {]J|, ...,p"] to @151, ...,p|§,j. This
transformation ean he expressed in matrix form as I5 , the rr >< n identity matrix with rows permuted by
5. For rr = 2, a loop interchange transformation maps iteration {'r',jj to iteration (j, ij. In matrix notation,
we ean write this as
O 1 i j
1 0 1‘ — r
U l
The following two-deep loop nest is being trartsformed using the permutation matrix [ 1 Oil as
depicted in Fig. 1l).l2a.
Dur‘=l,N Dnj=l,N
Doj=l,N l]or'=l,N
AU'l=A{.r'J -I Ctlljl => P\(_1'l=-I“-U! |"CI1lfl
Enddo Enddn
Enddn Enddn
-[2] Rm-‘r'rsof—Re\rersa1 of the ith loop is represented by the identity matrix with the ifn element on the
diagonal equal to ~l.
The following loop nest is being reversed using the transformation matrix ii I 0 ] as depleted in
Fig. |u.12b. '3 ‘I
Dnr'=l,N Dor'=l,N
Dnj=l,N Duj=—N__—l
="\(1}J'l=Aii—1,.f+ ll =1‘ Ail}-,f‘l=1°'»{F-1,-J‘+I)
En-rlrln Enddu
Enddn Enrldu
-['3j Skeu-'r':rg—Sltewing loop ii by an integer iaetorfwith respeet to loop L maps iteration {p1, ..., p,- |, p,-,
.f'Ir'+I= ' ' '=P_,|' I!.f‘:I,r'r.f':I,|'+l1 "'= par) tn Iiplr ' " rpi Ir .f']r'# .fJr'+I> ' ' "P pf Ir p_,|' ‘I _f.l;Jr'=.f‘J_,l'+l= "'= Anni‘
In the following loop nest, the transformation performed is a skew of the irmer loop with respect
|—II-1—lIr—II'~|—II|
i Z-_' i |*"|*"1*"l*'1
-- - -- -- Ii
I—IIrI—Ir'—|I-|—sI-I
j I
Bslore Ans:
[a} Loop permutation (interchange of I and jloops}
. Ii
_.. - __1011 ‘T"1
tr |||I
__L__1
_.L_ |
__-_1_
l l
Fig. 10.1 I Loop transformations perforrned in Example 10.‘? {Courtesy of Monica Lran. WTSC Tumrid Nores.
Smndford Unlve-rslry,1‘?92]
Various combinations of the above elementary loop transfomiarions can be defined. Wolf and Lam
have called thcsc rmimodular rrorufomratioru. The optimization problem is thus to find the unimodular
transformation that maximizes an objective fimction given a set of schedule constraints.
Transformation Nlotrice: Unimodular transformations are defined by unrmoaluior matrices. A unimodular
matrix has three important properties. First, it is square, meaning that it maps an n-dimensional iteration
-space imo an n-dimensional iteration space. Second, it has all integer components, so it maps integer vectors
to integer vectors. Third. the absolute valll-B of its tleterminant is l.
Bccausc of these propcrtics, the product of two unimodular matrices is unimodular, and thc invcrsc of a
unimodular matrix is unitnodular, so that combinations of unimodular loop transformations and the inverse
of unimodular loop transfonnations are also unirnodular loop transfonnarions. A loop transformation is said
to he legal if thc transformed dependence vectors are all lexicographieally positive.
MtMt.t.,,.g..s..,.....i — 52,
A compound loop transformation can be synthesized from a sequence of primitive transforrnations, and
the effect of the loop transformation is represented by the products of the various tralisformation matrices for
each primitive lransforlnation. The major issues for loop transformation include how to apply a trartsform,
correctness or locality, and desirability or advantages of applying a transform. Wolf and Lam (1991) have
stated the following conditions for unimodular transfonnations:
[1] Let D be thc set oftlistanoc vectors ofa loop nest. .-it unimodular transformation (matrix) T is Iago], if
and only if Vd E D,
r- tr 2 o (10.1)
(2) Loops 1' through} ofa nested computation with dependence vectors Darcfidft-' pcrmrrroinic, if Vd E D,
Proofs of these two conditions are left as exercises for the reader. 'l'he following example shows how to
determine the legality of unirnodular trartsformarions.
Do i = I, N
Du j = I, N
ac.n=_:(st»".n, st»'+ 1.; In
Enddn
Enddn
This code has dite dependence vector J = (1, -1). The loop interchange transformation is represented by
the matrix
l1 Ql
fl l
T:
T,_ —l {l ‘D 1 _ {ll —l
U l 1 ‘U l D
{l 'j Canonical fiarm. Loops with distance vcctors have the special property that they can always bc
transformed into a fully permutable nest via skewing. it is easy to determine how much to skew an
inner loop with respect to an outer one to make these loops fully permutable. For example, if a doubly
nested loop has dependences {[0, 1), (1, -2), (1, -1)}, then skewing the inncr loop by a factor of2 with
respect tothe outer loop produces {[0, 1), (1, 0), (1, 1]}.
{'2} Paralfeiirarion ,rJrot'ess. Iterations ofa loop can execute in parallel if and only if no dependences are
carried by that loop. Such a loop is called a Doall loop. To maximize the degree of parallelism is to
transform thc loop nest to maximize thc numb-cr of Doall loops.
Let (ll, .. ., In) be a loop nest with lexicographically positive dependences d E D. I,- is parallelizable if and
only if "-I-‘d E D, (oi, _ .., d, 1) > {CL . ..,CI}, thc zcro vector, or d,-= 0. Once thc loops are made fitlly permutablc,
the steps to generate Doall parallelism are simple. in the following discussion, we show that the loops in
canonical form can be trivially transformed to produce both fine- and coarse—grain parallelism.
Fine-Grain Hhvefionting A nest ofn fiztlly pctrnutable loops can be transforrncd into code containing at
least in — 1) degrees of parallelism. in the degenerate case where no dependences are carried by these n loops,
the degree of parallelism is n. Otherwise, (rt I) parallel loops can be obtained by skewing the innemiost
loop in thc Fully pcrmutablc nest by cach of the other loops and moving thc innermost loop to the outermost
position.
This transformation, called wavefront transjbrmation, is represented by the following matrix:
11---11
10---on
r= 01---on (no.9)
on---10
Fine-grain parallelism is exploited on vector machines, superscalar processors, and systolic arrays. The
following example shows the entire process of loop parallelization exploiting fine-grain parallelism. The
process includes skewing and wavefront transformation.
l»)
Cg Example 10.11 Loop skewing and wavefront transformation
(Michael Wolf and Monica Lam, 1991)
Figure 10. I 3a shows the iteration slfilfie and dependence ofa source loop nest. The skewed loop nest is shown
l O
in Fig. 10. l 3b atler applying thc matrix T = [1 I ] . Figure 10.13-c shows thc result of applying wavefront
transformation to the skewed loop code, which is a result of first skewing the innermost loop to make the
two-dimensional loop nest fully permutahle and then applying the wavefront transfomietion to create one
degree of parallelism.
MlMt.n.g...,g.t..,...l _ — 5,,
FmFu-TI; : 5 503310
A[I2+1] :=1x;=.' [A1121-+A[I2+11+A[|2+2]
D = {(0.1 1. {1 .01. :1. -1:}
iiifllifl
Efiflilii I2
(a) Extract dependence lnfennatlon from source loop nest
“‘i:‘.f*1f';$1’1i?“1ii"i
l2—1 } [2-1 ll M-m~'
[1 '1'] 4M-Miltllr
£r= rn={{n,1y, [1.1], (1,0) [2
(bj Skew to make Inner loop nest fully permutable
" ,1
5'
ForI’1:=0to16de
om lg ;= max (0. Hr, - aim; to mln [5, L1’, 121: do
5??
A [15 -25+ 11 = = 1)'3'A[r’1 -2151 +A[l'1,-2|§+1]
+A[l’1-21‘,+2]
Kw
" '~\-.~"a
T:
"~§~."\w w-.
Iii
D'= TD={{1,G],[2,1],[1,1]} ‘\*~‘$'.~\-‘ens.
“'-*~\“:.~‘ 1-"~a\‘1 '.*-~
[cl Wavefront transformation on the skewed loop nest
Fig.1ll.13 Flne-grain praraiellratlen by loop skewing an-cl warefrent transfnrrruatlon In Example 10.11
(Cournesy eflfiblfand Lam: reprinted from IEEE Tmns. Pnmllel D-lstrlbumd 5)cterrrs. 1991}
There are no dependences between iterations within the innermost loop nest. The transform is a wavefront
transfennatiun because it causes iterations aleng the cliagnnal of the original loop nest to execute in parallel.
SIG i‘ Advorrced Covnpunerfirrhitecture
This wavefront transformation automatically places the maximum Doall loops in the innemiost loops,
maximizing fine-grain parallelism. This is the appropriate transformation for superscalar or\="LIW machines.
Although these machines have a. low degree of parallelism, finding multiple parallelizahle loops is still ttsetitl.
Coalesoing multiple Doall loops prevents the pitfall of paralleliring only a loop with a small iteration count.
Course-Grain Parallelism For MIMI) coarse-grain multiproccssors, having as many outermost Doall
statements as possible reduces the synchronization overhead. A wavefront transformation produces the
maximum degree of parallelism but makes the outermost loop sequential if any are. For example, consider
the following loop nest:
Do i = l, N
Dn j = 1, N
Atoll =fl'Ali -1 J - lll
Enddo
En-tl-do
This loop nest has the dependence (1, 1), and so the outermost loop is sequential and the innermost
loop is a Doall. The wavefront tran.sformation does not change this. In contrast, the unimodular
l —l . .
transfomiation [0 I il transforms the dependence to (D, ll, making the outer loop a Doall and the tnnecr
loop sequential.
In this example, the dimensionality of the iteration space is two, but the dimensionality of the space
spanned by the dependence vectors is only one. When the dependence vectors do not span the entire iteration
space, it is possible to perform a transformation that makes outermost Doall loops.
A heuristic though nonoptimal approach for making loops Doall is simply to identify loops I, such that all
d, are zero. Those loops can be made outermost Doall. The remaining loops in the tile can he wavefronted to
obtain the remaining parallelism.
Loop parallelization can be achieved through unimodular transformations as well as tiling. For loops with
distance vectors. n-deep loops have at least (n - I) degrees of parallelism. The loop parallelization algorithm
has a common step for fine- and coarse-grain parallelism in creating an n-deep fillly permutable loop nest by
skewing. The algorithm can be tailored for different machines hased on the following guidelines:
' Move Doall loop innermost [if one exists} for fine-grain macl'ii.nes. Apply a wavefront transformation
to create up to {rt — l j Doall loops.
' Create outermost Doall loops for coarse-grain machines. Apply tiling to a fully permutahle loop nest.
~ Use tiling to create loops for both fine- and coarse—grain machines.
t1
‘E
fliilfi
. iigifl§.i i§is ‘I T.EIBIEIL fii. ilfii fiYIEI_.BIEIE4 -‘ififiij-I
_ ’ Eli; rrj,,_
ii;
rig
V I-
Fig. 10.14 Tiling ofthe sitiewed loops for parallel execution on a ooarse-grain rmritlproc-user {Commas} of
Wolf and |_3'i‘\‘ll'I’€1Jtl‘ll'lt£d from IEEE li-ans. Parallel Dietribuiied Sietflfls. W91)
SIB i‘ - Advoricad Compuriierilirehitecture
Tiling can therefore increase the granularity of synchronization and data are often reused within a tile.
Without tiling, when a Doall loop is nested within a non-Doall loop, all processors must be synchronized with
a barrier at the end of each Doall loop.
Using tiling, we can reduce the syrlchronization cost in the following two ways. First, instead of applying
wavefront transfonnation to the loops in canonical form, we first tile the loops and then apply a wavefront.
transformation to the controlling loops ofthe tiles. In this way, the synchronization cost is reduced by the size
of the tile. Certain loops cannot he represented as distances. Direction vectors can be used to represent these
loops. The idea is to represent direction vectors as an infinite set of distance vectors.
Locality optimization in user programs is meant to reduce memory-access penalties. Software pipelining
can reduce the execution time. Both are desired improvements in the performance of parallel computers.
Pro-grain locality can be enhanced with loop interchange, reversal, tiling, and prefetching techniques. The
effon requires a reuse analysis of a “localized” iteration space. Software pipelining relies heavily on sufficient
support from a compiler working effectively with the scheduler.
The fetch of successive elements of a data array is pipelined for an interleaved memory. In order to reduce
the access latency, the loop nest can interchange its indices so that a long vector is moved into the innermost
loop, which can he more effective with pipelined loads. Loop transformations are performed to reuse the data
as soon as possible or to improve the effectiveness of data caches.
Prefctching is often practiced to hide the latency of memory operations. This includes the insertion of
special instmctions to prefetch data from memory to cache. This will enhance the cache hit ratio and reduce
the register occupation rate with a small prefctch overhead. In scientific codes, data prefetching is rather
important. Instruction prefetching, as studied in Chapter 6, is often practiced in modern processors using
prefetch bufiers and an instruction cache. I11 l:he following discussion, we concentrate on data prefetching
techniques.
Tiling fur Locality Blocking or tiling is a well-known technique that improves the data locality ofnumerical
algorithrns. Tiling can he used for different levels of memory hierarchy such as physical memory, caches,
and registers; multilevel tiling can be used to achieve locality at multiple levels of the memory hierarchy
simultaneously.
To illustrate the iniportarlce of tiling, consider the example of tnatrix multiplication:
[In i= l,N
Dnj=1. N
Du k= l, N
Ctr‘, Ir] = Cij, Ir) + A(i,j) >< BU, E]
End-tln
Enddo
Enddn
ln this code, although the same rows of C and B are reused in the next iteration of the middle and outer
loops, respectively, the large volume of data used in the intervening iterations may replace the data fiom the
register file or the cache before they can be reused. ‘filing reorders the execution sequence such that iterations
from loops of the outer dimensions are executed before all the iterations ofthe inner loop are completed. The
tiled matrix multiplication is:
smI.m..i.s.i.,,.,...s.§..,...i — 5,,
Du F =' 1, N, S
Do m = 1, N, 5
[In r'=' 1, H
Dn j= E, |nin{£+s—1,N)
Du k= m, mi_n[m + s-I, N]
CU‘, k] = C(r', k] + A(r',j} X BU, k]
Endihl
Enclcln
En-rl-do
En-lidn
End-do
Tiling reduces the number of intervening iterations and thus dam fetched beiween data reuses. This allows
reused data to still be in the eache or register file and hence reduces memory accesses. The tile size s can be
chosen to allow the maximum reuse for a specific lcvel of rnemnry hierarchy. For ciuaniplc, the tile size is
relevant to the cache size used or the register file size used.
I MFlcip5
9°" I beihtling
j cachetling
55‘ Q ragistsiiling
* nailing
m_
45-
4fl_
fi_
m_
25_
23..
'15-
'10-
5_
F'm-oassels
D I I I I I I I I -
CI 1 2 3 4 5 B 3" B
Fig. 10.15 Performance of a SUD x 500 double precision rratrix muiriplieaizion on the SGI 4D.I'3BD. Cache
1:IIesare64>< 64 ieeradons and regisnertiies are4x‘1{CaureesyofV\hlfand L:u1'urep:~h1:edfn:n1
PCM SIGPLAN Conf. Pmgmmmlng Language Design and Impien-raimflon.Tomnec.\ Canada. 1991}
5311 i‘ " Advanced Comptioerniehiteeture
The improvement obtained from tiling can be far greater than that obtained from traditional compiler
optimizations. Figure 10.15 shows the performance of Sill) >< 500 matrix multiplication on an SGI 4Df3B0
machine consisting of eight MIPS R3000 processors rtnining at 33 M]-lz. Each processor has a 64-Khyte
direct-mapped first-level cache and a 256-Kbyte direct-mapped second-level cache. Results from four
different experiments are reported: without tiling, tiling to reuse data in caches, tiling to reuse data in registers,
and tiling for hotl1 register and caches. For cache tiling, the data are copied into consecutive locations to avoid
cache interference.
Tiling improves the performance on a single processor by a factor of 2.75. The effect oftiling on multiple
processors is even more significant since it reduces not only the average data~access latency but also the
required memory bandwidth. Without cache tiling, contention over the memory bus limits the speedup to
about 4.5 times. Cache tiling permits speedups of over T for eight processors, achieving an overall speed of
64 Mflops when combined with register tiling.
Localimd It-emtion Space Reusing vector spacing offers opportunities for locality optimization. However,
re-use do-es not imply locality. For example, if reuse does not occur soon enough, it may miss the temporal
locality. Therefore, the idea is to make reuse happen soon enough. A localized iteration space contains the
iterations that can exploit reuse. In fact, tiling increases the number of dimensions in which reuse can be
exploited.
Consider the following two~deep nest. Reuse is exploited for loops i andj only, which form the localized
iteration space:
Du i = l, N
Du j = I, N
Bttljl =J'(-”\(t’)- AU?)
En-ddo
End-do
Reference AU") touches different data within the inner loop but reuses the same elements across the outer
loop. More precisely, the same data A-lfj) is used in iterations {i,j], I E i S N. There is reuse, but the reuse is
separated by accesses to N — l other data. When N is large, the data is removed from the cache before it can be
reused, and there is no locality. Therefore. a reuse does not guarantee locality. The tiled code is shown below:
Do F = 1, N, S
Du i = l, N
[In j=f, n1ax[£"+ s-1,N')
Blflfl =.r'(A{fi. Mil)
End-do
Endd-n
Enddo
We choose the tile size such that the data used within the tile can be held within the cache. For this
ettarnple, as long as s is smaller than the cache size, AU} will still he present in the cache when it is reused.
Thus, reuse is exploited for loops i andj only, and so the localized iteration space includes only these loops.
s.s..ti.1...tt.t..,.,.s,.,,.....t — ,,,
ln general, if rt is the first loop with a large bound, counting from innermost to outermost, then reuse
occurring within the inner n loops can be exploited. Therefore the localized vector space of a tiled loop is
simply that of the innermost tile, whether the tile is coalesced or not.
Obviously, memory optimizations are important. Locality can be upheld by intersecting the localized
vector space with the reuse vector space. ln other words, reuse directs the search for unimodular and tiling
transformations. One should use locality information to eliminate unnecessary prefetches.
3*E: i’AddtoCi'
Cycle lteration
l 2 3 4
Read
Mul
Read
M ul
Add Read
D‘-Lh-I5-U-Jl‘\-'lI—' Mul
531 ‘i Advanced Cmnptioerfiirhiteezure
Add Read
Write Mul
\GlIKi‘- J Add
I0 Write
11 Add
I2 Write
I3
I4 Write
Four iterations ofthe software-pipelined code are shown. Although each iteration requires 8 oyeles to flow
through the pipeline, the four overlapped iterations require only 14 clock cycles to execute. Compared with
the nonpipelined execution, a speedup factor of 24¢’ I4 = l .7 is achieved with the pipelining of four iterations.
N iterations require ZN + IS cycles to execute with the pipeline. Thus, a speedup factor of 6Ni"(2N + 6} is
achieved. As N approaches infinity, a speedup factor of 3 is expected. This shows the advantage of software
pipelining, if other overhead is ignored.
Dnnemu Loop: Unlike unrolling, sofiware pipelining can give optimal results. Locally compacted code
may not he globally optimal. The Doall loops can fill arhitrarily long pipelines with infinite iterations. In the
following Doaeross loop with dependence between iterations, software pipelining can still be done but is
harder to implement
Doaeross l= 1, N
AH} = A{I) >< B
Sum = Sum — A(]);
End-do
The sofiware-pipelined code is shown below:
Read
E.‘ E
Add Read
Write Mul
Add
I:-.»-est-is)- Write
lt is assumed that one memory access and one aritllmetic operation can be concurrently executed on
the two-issue superscalar processor in each cycle. Thus TBCUITBIICBS can also be parallelizted with software
pipelining.
As in the hardware pipeline scheduling in Chapter ti, the objective of software pipelining is to minimize
the interval at which iterations are initiated; i.e. the initiation latency determines the throughput for the loop.
The basic units of scheduling are minimally indivisible sequences of microinstructjons. [I1 the above Doall
loop example, the initiation latency is two eyeles per iteration, and the Doaeross loop is also pipelined with
an initiation latency of two cycles.
s...t..n.......r..i................... L = 5..
To summarize, sofirware pipelining demands maximizing the tltnougltput by reducing the initiation latency,
as well as by producing small, compacted code size. Thc lrick is to find an itltmtical schedule for cvcry
iteration with a constant initiation latency. The scheduling problem is tractable. if every operation takes unit
execution time and no cyclic dependences exist in the loop.
ti“A-_ U! ummary
A programming model is a collection of program abstractions which present the programmer with a
well-defined view of the software and hardware system. Parallel programming models are defined for
the various types of parallel architectures which we have studied in the rlier chapters of the book.
‘IMa started this chapter with a study of the parallel programming models which have become well-
established. namely: shared variable model. message-passing model. data parallel model. object-oriented
model and functional and logic models.
Foranygiven model for parallel progr'amming.the user needs to be provided with a parallel programming
environment. which consists of parallel languages. compilers. sup port tools for program development. and
runtime supporLThe programming environment must: provide specific features for parallelism aimed at:
optimization. availability. synchronimtion and communication. control of parallelism. data parallelism. and
process management. Paralld language constructs are needed in the programming language. of whidw a
few examples have been presented in this chapter. And the compiler must be capable of optimizing t:he
machine code generated for the type of parallelism available in hardware.
lkctorizing or parallelizing compilers can. in theory. detect and exploit the potential parallelism which
is present in a sequential program. ln this procas. dependence analysis of data arrays can reseal the
presence or absence of dependences between successive references to army elements in a loop. or in
nested loops. in general. two operations can be carried out in parallel only if there is no data or control
dependence between them.We reviewed some specific techniques for dependence analysis of data arrays,
such as iteration space analysis. subscript separability and partitioning. and categorized dependence tests.
Optimimtion of the machine code generated by the compiler. and cycle-by-cycle scheduling of
machine instructions for execution on the procasor. are both critical to achieving high performance
computing. Local optimization can be carried out within basic bloclcs. but in general both local and global
optimizations are r‘equired.We studied several vectorization and parallelization methods. such as the use
of temporary storage. loop interchanging. loop distribution. vector reduction. and node splitting.
Code generation and scheduling make use of directed acyclic graphs ofoperations within basic blocks.
and should utilize a register allocation soategy which does not inhibit parallel execution of instructions.
Trace-sdweduling compilation males use of program traces obtained from multiple previous executions
of the same program.
Loop trartsformations may in general be required prior to prarallelization andior vectorization of
program code. Perrnutatlon, reversal. skewing, and transformatican matrices are some of the specific
techniques which can be applied. ‘Wavefioncing can be useful in exploiting fine-grain parallelism. while
tiling can help achieve locality and reduce synchronization costs in coarse-grain computa1;lons.Software
pipelining of loop iterations is another possible technique to parallelize a sequential program.
rh- Mcfimu-~ um T
534 i" mmun“ Adiwrced Cfl'lT‘lpl.rI|EJ'J\1|.liC|lhlI|E\'IIt|'J'E
55 Exercises
Problem 10.1 Explain the following terms program sequentially on a uniprocessor system.The
associated with message-passing programming of chain should be sufficiently long to see the tlilference.
multicomputers:
Problem 10.3 Gaussian elimination with partial
(a} Synchronous sersus asynchronous message- pivoting was implemented by [Quinn90] and
passing schemes.
Hatcher in C* code on dwe Connection Machine. as
(b} Blocking versus nonblocking communications.
well as in concurrent C code on an nCUBE 3200
(c) The rendezvous concept introduced in the multicomputer.
Ada programming system. (a) Discuss the translationicompiler effort from
{d} Name-addressing versus channel-addressing C* to C on the two machines after a careful
schemes for message passing. rding of the paper by Quinn and Hatcher.
(e} Uncoupling between sender and receiver (b) Comment on SPMD [single program and
using buffers or mailboxes. multiple data streams} programming style
{f} Lost-message handling and interrupt-message as opposed to SIMD programming style. in
handling. terms of synchronintion implementation and
related performance issues.
Problem 10.1 Concurrent object-oriented
programmingwasintroduced in Section 10.1 .4.Chain (c) Repeat the program conversion experiments
multiplication ofa list of numbers was illustrated in for a first Fourier transform (FFT) algorithm.
Example 10.2 based on a divide-and-conquer strategy. Perform the program conversion manually at
A fragment of a Lisp-like code for multiplying the the algorithm level using pseudo-codes with
sequence of numbers is given below: parallel constructs.
INTRDD UCTI DH
1 The period between the l9'I"Os and the 1990s saw agreat many innovative ideas being proposed
in computer architecture. The basic hardware technology o fcomputers had been mastered by
the 1960s, and several companies had produced successfirl commercial products. The time was therefore
right to generate new ideas, to reach performance levels higher than that of the original single-processor
systems. As we have seen, parallelism in its various fbrms has played a central tole in the development of
newerarchitectures.
The carlicr part ofthisbook has presented a comprehensive overview ofthe many architectural innovations
which had been attempted until the early 1990s. Some ofthesc were commercially successful, while many
others were not so fortunate—which is not at all surprising, given the large variety of ideas which were
proposed and the fast-paced advances taking place in the underlying technologies.
In the last two chapters of the book, we take a loolt at some of the recent trends and developments
in computer architecturc—ir|cluding, as appropriate, a brief discussion of advances in the underlying
technologies which have made these developments possible. ln fact, we shall sec that thc recent advances
in computer architecture can be understood only when we also talte a look at the underlying technologies.
.5_‘r-‘stem perjbrmrmee is the ltcy benchmark in the study of computer architecture. A computer system
must solve the real world problem, or support the real world application, for which the user is installing
it. Therefore, in addition to the theoretical peak performance ofthe processor, the design objectives ofany
oomputer architecture must also include other important criteria, which include system performance under
585 li .
Advanced Cornpurterfirchiteeture
realistic load conditions, scalability, prioe, usability, and reliability. ln addition, power consumption and
physical size are also often important criteria.
A basic rule of system design is that there .sh0ul'o' be no perfsrmanee borrleneelrs in the system. Typically,
a performance bottleneck arises when one pan of the system—i.c. one of its subsystems—cannot keep
up with the overall throughput requirements of the system. Such a performance bottleneck can occur in a
production syslcrn, a distribution system, or even in traffic systcmlu. [fa performance bottleneck docs occur
in a system—i.c. if one subsystem is not able to keep up with other subsystcms—then the other subsystems
remain idle, waiting for response from the skmwerone.
[n a computer system, the key subsystems are processors, memories, L-"O interfaces, and the data paths
oonnecting them. Within the processors, we have subsystems such as filnctional units, registers, cache
memories, and intemal data buses. Within the computer system as a whole—or within a single prooessor-
designers do not wish to create bottlenecks to system performance.
I»)
8 Example 12.1 Performance bottleneck in a system
ln Fig. ll l we the schematic diagram ofa simple computer system consisting of fourprocessors, a large
shared main memory, and a processor-memory bus.
Shared maln
memory Fm"
pro-oe-ssors
Pro-oa-asor-
memory bus
ill ltt oommcm language, we say that tr r.'hur'.n r'.'r o.nf_v as strong as 1'11: n'euke.i1' flair.
mwermuurnwmassm -—. 5,,
This system exhibits a pcrtbnnanee mismatch between the processors, main memory, and the processor-
memory bu s. The data tran sfer rates supported by the main memory and the sharedprocessor-memory bus do
not meet the aggregate requirements of the tburproccssors in the system.
The system architect must pay earclill attention to all such potential mismatches in system design.
Otherwise, tl1e sustained pcrtbrmanee which the system can deliver can only equal the performance of the
slowest part ofthe system—i.c. the bottleneck.
While this is a simple example, it illustrates the key challenge facing system designers. It is elearthat, in the
above system, if processor pcr_,firni.rmec is improved by, say, 20"?-"ft, we may not see a matching improvement
in .s_rs1t’m pe.rj"ormanee, because the performance bottleneck in the system is the relatively slower processor-
rnemory bus. In this particular case, a better investment for increased system peribrrnanee could be (aj faster
processor-memory bus, and (bj improved eache memory with each processor, i.e. one with better hit rate—
which reduces contention tbr the processor-memory bu s.
In fact, as we shall see, even achieving peak theoretical performance is not the final goal of system design.
The system performance must be maintained for real-life applications, and that too in spite ofthe enormous
diversity in modem applications.
In earlier chapters of the book, we have studied the many ways in which parallelism can be introduced
in a computer system, for higher processing performance. The concept of instruction level parallelism and
superscalar architecture has been introduced in Chapter ti. In this chapter, we take a more detailed look at
instn.|c1i-on level parallelism.
lb] supporting multiple instruction streams on the prooessor in multi-core andfor multi-threading mode‘?
This design choice is also related to the depth ofthe instruction pipeline. In general, designs which aim to
maximize the exploitation of instruction level parallelism need deeper pipelines; up to a point, such designs
may support highcr clock rates. But, beyond a point, deeper pipelines do not necessarily provide highcr net
throughput, while power consumption rises rapidly with clock rate, as we shall also discuss in Chapter 13.
Let us examine the trade-olTinvolvei:l in this context in a simplified way:
total chip area = number of cores X chip area per core
or
total transistor count = numberofcores >< transistor cotmt per core
Here we have assumed for simplicity that cache and interconnect area—and transistor count—can be
considered proportionately on a per core basis.
At a given time, ‘s-'LSl technology limits the left hand side in the above equation s, while the designer must
select the two factors on the right. Aggressive exploitation of instruction level parallelism, with multiple
functional unitsand more complex control logic, increases the chip area—and trans is tor count—pcr processor
core. Alternatively, ibra difiierent category oftarget applications, the designer may select simpler cores, and
thereby place a larger number ofthem on a single chip.
Clfcourse system design would involve issues which are more complex than these, but a basic design is sue
is seen he-re: For the targeted application and performance, how should the designers divide available chip
resources among processors and, within a single processor, among its various fttnetional elements?
Within a prooessor, a set of instntctions are in various stages of execution at a given time—within the
pipeline stages, functional un its, operation bufiers, reservation stations, and so on. Recall that iimctional units
themselves may also be internally pipelinod. Therefore machine instructions are not in general executed in
the order in which they are stored in memory, and all instructions under execution must be seen as ‘work in
progress‘.
As we shall see, to maintain the work flow oi‘ instructions within the processor, a superscalar prooessor
makes use of lirrmeh prcdr'ction—i.e. the result of a conditional branch instruction is predicted even bcibre
the instnlction executes—so that instntctions from the predicted branch can continue to be processed,“-'ithout
causing pipeline stalls. The strategy works provided fairly good branch prediction accuracy is maintained.
But we shall assume that instnrctions are eonmtitreri in order. Here corriniitring an instntction means that
the instruction is no longer ‘under execution’—tl1e processor state and program state reflect the completion
of all operations specified in the instruction.
Thus we assume that, at any time, the set of committed instructions correspond with the program order
of instructions and the conditional branches actually taken. Any hardware exceptions generated within
the processor must reflect the prooessor and program state resulting fi'om instructions which have already
committed.
Parallelism which appears explicitly in the source prog ram, which may be dubbed as s'n'rieIre*riIprtrrrl!cIisnr,
is not directly related to instruction level parallelism. Parallelism detected and exploited by the compiler is a
form ofinstnlction level parallelism, because the compiler generates the machine instntctions which result in
parallel execution ofmultiple operations within the processor. We shall discuss in Section 12.5 some ofthe
main issues related to this method ofexploiting instruction level parallelism.
instruction mamas -—. 5,,
Parallelism detected and exploited by processor hardware on firefly, within the instructionswhich are under
execution, is certainly instruction level parallelism. Much ofthe remaining part ofthis chapter discusses the
basic techniques ibr hardware detection and exploitation of such parallelism, as well as some related design
trade-ofi's.
While the student is expected to be familiar with the basic concepts related to instruction pipelines, the
earlier discussion of these topics in Chapterfi will serve as an introduction to the techniques discussed more
fully in this chapter.
Weak memory consistency models,which are discussed elsewhere in the book, are netdiscussed explicitly
in this chapter, since they are relevant mainly in the case of parallel threads of execution distributed over
multiple processors. Similarly——since the discussion in this chapter is primarily in the context ofa single
proccssor——thc issues of shared memory, cache coherence, and message-routing are also not discussed here.
The student may refer to Chapters 5 and ?, respectively, fora discussion ofthesc two topics.
‘With this background, let us start with a statement ofthe basic system design objective which is addressed
in this chapter
Our aim is to make the discussion independent of any specific instruction set, and tlterefere we shall use
simple and self-explanatory opcodes, as needed.
Data transfer instructions have only two operand:-;—souree and destination registers; fear"! and stare
instnrctions to.-"from main memory specify one operand in the form ofa memory address, using an available
addressing mode. Efiective address for fem’ and store is calculated at the time of instruction execution.
Conditional branch instructions need to be treated as a special category, since each such branch presents
two possible cominuations ofthe instruction stream. Branch decision is made only when the instruction
executes; at that time, if instructions from the branch-not-taken are in the pipeline, they must beflushed. But
pipeline flushes are costly in terms oflost processor clock cycles. The payoff ofhraneh prediction lies in the
fact that correctly predicted branches allow the detection ofparallelism to stretch across two or more basic
EH1 D Admrrced Compurterfirchiteeture
blocks ofthepmgram, without pipeline stalls. It is for this reason that branch prediction becomes an essential
technique in exploiting instnrction level parallelism.
Limits to detecting and exploiting instruction level parallelism are imposed by dependences between
instnlctions. After all, ii'N instructions are completely independent of each other, they can be executed
in parallel on N firnctienal units—ii'N iirnctional units are avaiIablc—and they may even be executed in
arbitrary order.
But in fact deperldences amongst instruetionsareacentral and essential part of program logic.A dependence
specifies that instruction 1;, must wait for instruction lj to complete. Within the instruction pipeline, such a
dependence may create a Fmzrrrrt‘ or smH—i.e. lost processor ckick cycles while lk waits tbr I1- to complete.
For this reason, for a given instruction pipeline design and associated fimctional units, dependences
amongst instnrctions limit the available instruction level parallelism—a.nd therefore it is natural that the
eentral issue in exploiting instnretion level parallelism is related to the correct handlingofsuch dependences.
We have already seen in Chapter 2 that dependences amongst instructions fall into several categories; here
we shall review these basic concepts and introduce some related notation which will pmve usefiil.
Data Dependence:
Assume that instruction lk follows instn.|etion [J in the program. Dam r'li’]Jc'm'l'e’H£'c' between Ii and lk means
that both acoess a common operand. Forthe present discussion, let us assume that the common operand of [J-
and I1, is i11 a programmable register. Since each instruction either reads or writes an operand value, accesses
by l1- and I 1, to the common register can occur in one of four possible ways:
Read by Ii after read by lj
@ 1"-'1' It tfltrrnlnc 1"-I)‘ It
EYE ll‘? It by It
iltnlebr ltsitentrrrlcbr It
Of these, the first pattern of register access does not in fact; create a dependence, sinoe the two instructions
can read the common value ofthe operand in any order.
The other three patterns of operand access do create dependences amongst instructions. Based on the
tmderlined words shown above, these are known as rend .n_,r‘i'er n'rin: [RAW'] dependence, 'n'rr're qficr read
[WAR] dependence, and ii-rire .n_,fi‘er it-‘rite (WAW) dependence, respectively.
Read afierwrite (RA‘W] is truedata dependence, inthe sense that the registervalue written by instruction lj is
read—i.e. used—by instruction Ik. This is how computations proceed; a value produced in one step is used
Further in a subsequent step. Thereibre RAW dependences must be respected when program instnrctions are
executed. This type of dependence is also known a_s_/‘low depandemre.
‘Write attcr read {WAR} is known as mitt’-n'r=penn'ertee, because in this instance instruction I; should not
overwrite the value in the common register iii] the previous value stored therein has been used by the prior
instnrction IJ- which needs the value. Such dependence can be removed from the exccuting program by simply
assigning another register for the write instruction I1 to write imo. With read and write occurring to two
difiicrent registers, the dependence between instructions is removod. In fact, this is the basis ofthe register
mnmring technique which we shall discuss later in this chapter
tsm.- Lcirelibrollelism I1 JI|r.u| u :
._,_ 5,“
Write aiterwrite-[WAIPI-'1 is kriovm as output dependence, since two instructions are writing to a common
register. Ifthis dependence is violated, then subsequent instructions will a value in the register which
should in tact have been ovcrwritten—i.e. they will see the value written by Ij rather than Ig. This type of
dependence can also be removed from the executing program by assigning another target register tor the
second write instruction, i.e. by register romrrrring.
Sometimes we need to show dependences between instructions using graphical notation. We shall use
small circles to represent instructions, and double line arrows between two circles to denote dependences.
The instruction at the head ofthe arrow is dependent on the instruction at the tail; if necessary, the type of
dependence between instructions may be shown by appropriate notation next to the arrow. A missing arrow
between two instnrctions will m-can explicit absence of dependence.
Single line arrows will be u.sod between instructions when we "wish to denote program order without any
implied dependence or absence of dependence.
Figure 12.2 illustrates this notation.
Ii Ii Ii O Ii Ii
When dependences between multiple instnrctions are thus depicted, the result is a dirccrcri graph
of dependences. A nnrfiz in the graph represents an instnrction, while a directed edge between two nodes
represents a dependence.
Often dependences are thus depicted in a basic block of inst:ructions— i.e. a sequence of instructions with
entry only at the first instruction, and exit only at the last instruction ofthe sequence. in such cases, the graph
of dependences becomes a directed nc_vc!ic graph, and the dependences define a pnrrirri order amongst the
instructions.
Part (a) of Fig. 12.3 shows a basic Heck of six instructions, denoted I | through lg in program order. Entry
to the basic block may be from one ofmultiplc points within the program; continuation aflerthe basic block
would be at one of several points, depending on the outcome of conditional branch instnrction at the end of
the block.
Part (b) of the figure shows a possible pattcm of depcndcnocs as they may exist amongst these six
instnlctions. For simplicity, we have not shown the type of each dependence, e.g. RAW(R3], etc. In the
partial order, we see that several pairs of instn|ctions—sueh as (I|, I3] and (I3, I,,]—are not related by any
dependence. Therefore, amongst each of these pairs, tl'|e instnlctions may be executed in any order, or in
parallel.
Dependenoes amongst instructions are inherent in the instruction stream. For processor design, the
important questions are: Fora given processor architecture, what is the efiect ofsuch dependences on prooessor
porforrnanoc? Do those dependences create hazards which necessitate pipeline stalls andfor flushes’? Can
these dependences be removed on rire_,I'ifr using some design technique‘? Can their adverse impact be reduced?
592 i .
Admncad Cmnpusterfirchitecture
ll‘ If
la. J
<5 '2
(5 '3
C) '4 I5 ls
,1’
<5 =6 is, [a] program order
Consider -once again the pattern of dependences shown in Fig. 12.3(b}. If the processor is capable of
completing two {ormorc} irtstnictions pcrclock cycle, and ifno pipeline stalls are caused by the dependences
shown, then clearly the six instr1.|ction_s can be completed in three consecutive processor clock cycles.
lnstruction latency, from fetch to commit stage, will ofcourse depend on the depth of the pipeline.
Control Dependence: ln typical application programs, basic blocks tend to be small in length, since
about 15% to 20% instructions in programs are branch and jump instmctions, with indirect jumps and return.-r
from procedure calls also included in the latter category. Because oftypically small sizes ofbasic blocks in
program s, the a.rno |.|nt ofinsrruction level parallelism which can be exploited in a single basic block is limited.
Assume that instruction lj is a conditional branch and that, whether another instruction lk executes or not
depends on the outcome ofthe oonditional branch instruction [J-. In such a ease, we say that there is a comm!
nbpendencze of instruction IL on i11stn.|c.tion ll-.
Let us assume that a processor has instruction pipeline of depth eight, and that the designers target
superscalar performance of four instmctions completed in every clock cycle. Assuming no pipeline stalls,
the number of instructions in the processor at any one timc—in its various pipeline stages and functional
units—would be 4 >< E = 32.
If] 5% to 20% of these instmctions are branches and jump s, then tl'|e execution of subsequent instructions
within the processor would he held up pending the resolution ofconditional branches, procedure returns, and
so on—eausing frequent pipeline stalls.
This simple calculation shows the potential adverse impact of conditional branches on the performance
of a superscalar processor. The key question here is: How can the processor designer mitigate the adverse
impact of such comrol .nbj.1en:1'enees in a program?
n.~.1m- Levclflwutlelism "1 .I| \ -
-—. ,,,
Answer: Using some form of brand: rrrrn‘junr;J pre.rfic'ri0rr—i.e. predicting early and correctly [most of
the time] the results ofconditional branches, indirect jump s, and proocdure rctums. The aim is that, for every
correct prediction made, there should be no lost processorclock cycles due to the conditional branch, indirect
jump. or procedure return. For every mis-prediction made, there would be the cost of flushing the pipeline of
instructions from the wrong continuation after the conditional branch orjinnp.
l/l
El Example 12.2 Impact of successful branch prediction
Assume that we have attained 93% accuracy in branch prediction in a prooessor with eight pipeline stages.
Assume also that the mis~prediction penalty is 4 processor clock cycles to flush the instruction pipeline. What
is the performance gain from such a branch prediction strategy’?
Recall that the expected cost of a random variable X is given by Zr‘,-jr,-, where .r,- are possible values of
X, and pi are the respective probabilities. In our case, the probability ofa correct branch is 0.93, and the
corresponding cost is zero; the probability ofa wrong branch is O.'l]'F, and the corresponding cost is 2. Thus
the expected cost ofa conditional branch instruction is 0.07 >< 4 = 0.23 clock cycle i.e. much less than one
clock cycle.
As a primitive form of branch prediction, the processor designer could assumethat a conditional branch is
always taken, and continue processing the instructions which follow at the target address. Let us assume that
this simple strategy works E(l‘.‘rit ofthe time; then the expected cost of a conditional branch is 0.2 >< 4 = 0.8
clock cycles.
Suppose that not even this primitive ibrrn ofbranch prediction is used. Then the pipeline must stall tmtil
the result of every branch condition, and thc target address of every indirect jump and procedure return, is
known; only then can the processor proceed with the correct continuation within the program. lfwe assume
that in this case the pipeline stalls over halfthe total number of stages, then the number of lost clock cycles
is 4 for every conditional branch, indirect jump and procedure return instruction.
Considering that l 5% to ED‘!-"ii ofthe instructions in a program are branches and jumps, the difference in
cost between 0.28 clock cycle a.nd 4 clock cycles per branch instruction is huge, Lnrderlining the importance
ofbranch prediction in a superscalar processor.
Later, in this chapter, we shall study the techniques employed ibr branch prediction.
Resource Dependence: This is possibly the simplest kind ofdependence to understand, since it refers to
a resource constraint causing dependence amongst instructions needing the resource.
L»)
éjd Example12.3 Resource dependence
Considcra simple pipclincti prooessor with only one floating point multiplier, which is not internally pipclined
and takes three processor clock cycles for each multiplication. Assume that several independent floating point
multiply instructions follow each other in thc instruction stream i11 a single basic block under execution.
594 O ridmncad Computerfirchitccture
Clearly, while thepmcessor is executing these multiply instr|.|ction s, it cannot forthat duration get even one
instnrction completed in every clock cycle. Therelbre pipeline stalls are inevitable, caused by the absence of
sufficient floating point multiply capability within the processor. in fact, for the duration ofthese consecutive
multiply operations, the prooessor will only complete one instruction in every three clock cycles.
We have assumed the instructions to be independent of each other, and in a single basic block—i.e. there
are no conditional branches within the sequence. Thus there is no data dependence or control dependence
amongst these instnrctions. What we have here is msorrree rkijitrndenee, i.e. all the instructions depend on
the resource which ha_s not berm provided to the extent it is needed forthc given workload on the prooessor.
We can say that there is an imbalance in this processor between the floating point capability provided and
the workload which is placed on it. Such imbalances ir| system resources usually have adverse peribrmance
impact. Recall that Example lll above and the related discussion illustrated this same point in another
context.
A resorrrce rrfejwndrzrrc-e which results in a pipeline stall can arise for access to any processor resourc-e—
iirnctional unit, data path, register bank, and so onp]. We can certainly say that such resource dependences
will arise if hardware resources provided on the processordo not match the needs of the executing program.
Now that we have seen the various types of dependences which can occur between instructions in an
executing program, the problem ofdetecting and exploiting instruction level parallelism can finally be stated
in the following manner:
Problem Definition Design a superscalar processorto detect and exploit the maximum degree ofparallelism
available in the instruction stream—i.c. execute the instructions in the smallest possible numberofprocessor
clock cyclcs—by handling correctly the data dependences, control dependences and resource dependences
within thc instruction stream.
Before we ean make progress in that direction, however, it is necessary to keep in mind a prototype
processor design on which the problem solution can be attempted.
-‘iiTl1is type ofdep-endenee may also be ealled .'r.r‘.r'r.rcl‘;rrrI.r.i dc'pendcn¢'£, since it is related to the structure ofthe prooessor;
however n=.'.'r0r.rre'¢' dependenc¢.> is the more co-mrnon term.
s.~.m..- Lcvcllibrullelism "1 .I| \ _
._,. 5,,
Let us assume that our superscalar processor is designed lor It instruction issues in every prooessor clock
cycle. Clearly then tliefeteh, rfeeonic and issue pipeline stages, as well as the other elements of the processor,
must all be designed to process Ir instructions in every clock cycle.
On multiple issue pipelines, issue stage is usually separated from decade stage. One reason for thus
increasing a pipeline stage is that it allows the processor to be driven by a fasterclock. Decode stage must be
seen as preparation for instruction issue wl1ieh—by defi|1ition—can occur only if thc relevant functional unit
in the processor is in a state in which it can accept one more operation for execution. As a result ofthe issue,
the opcration is handed over to the fimctional unit for execution.
Note 12.1
The name of instruction rfceonb stage is somewhat inaccurate, in the sense that the instruction is never
fiilly decoded. [fa 32-bit instruction is li.|lly decoded, for example, the decoder would have some 4 X
109 outputs! This is never done; an immediate constant is never decoded, and memory or HO address
is decoded outside the processor, in the address decoder associated with the memory or l.~'U module.
Register select bits in the instruction are decoded when thq-' are used to access the register bank;
similarly, ALL; function bits can be decoded within the ALU. Therefore register select and ALU
function bits also need not be decoded in the instruction decode stage ofthe prooessor.
Wliat happens in the instruction decode stage of the processor is that m of
the instruction are decoded. For esample, opcode bits must be decoded to select the functional unit,
and addressing mode bits must be decoded to determine the operations required to calculate cfiectivc
memory address.
When instruction scheduling is specified by the compiler in the machine code it generates, we refer to it as
srnrie sc-heduling. in theory, static scheduling should free up the processor hardware lrom the comp lcxities of
instnietion scheduling; in practice, though, things do not quite turn out that way, as we shall see in the next
section.
if the processor comrol logic schedules instruction an the _fi_}r——taking into account inter-instnietion
dependences as well as the state of the fiunctional units—wc refer to it as .r.fm.nmr'c sehe.r.|‘u!ing. Much of the
rest ofthis chapter is devoted to various aspects and techniques ofdynamic scheduling. Ofcourse the basic
aim i11 both types of seheduling—static as well as dynamic—is to maximize the i11stn|ction level parallelism
which is exploited in the executing sequence of instnlctions.
As we have seen, at onc time multiple instructions are in various stages ofcxecution within the processor.
But pmewssor stare and program stare need to be maintained which are consistent with thc program ordcrof
completed instructions. This is important from thc point ofyiew ofprcserving the semantics ofthe program.
Therefore, even with multiple instructions executing in parallel, the processor mu st arrange the results of
completed instructions so that their sequence reflects program order. One way to achieve this is by using a
ill Instruction scheduling as discussed here has some similarity with other types of task or job scheduling systems. It
should he noted, ofoourse. that a typical production system requiring job scheduling does not involve conditional
l:-runcl1es_ i.e. control dependences.
5% li - rldmrrced Cmnpusterfirrchitectum
reomier bufiir, shown in Fig. 12.4, which allows instructions to be commirreu‘ in program order, even if they
execute in a diflerent order; we shall discuss this point in some more detail in Section 12. '1'.
F unctlcmat Functional
unit unit ii_
Branch
Prediction
tocachei
main mommy
Loao.r'Store Register
unit bank
if instructions are executed on the basis of predicted branches, before the actual branch outcome is
available, we say that the processor performs specuinrii-'e ereeurion. ln such cases, the reorder buffer will
need to be clearod—wholly or partly—if the actual branch result indicates that speculation has occurrod on
the basis ofa mis-prediction.
Functional units in the processor may t:hem.selvcs be internally pipclined; they may also be provided with
reservation smrians, which accept operations by the issue stage ofthe in znruction pipeline. A functional
unit performs an operation when the required operands for it are available in the reservation station. For the
purposes of our discussion, memory imm‘-store uniIr's,l may also be treated as functional tmits, which perform
their finictions with respect to the cacheimernory subsystem.
Figure 12.5 shows a processor design in which fimctional units are provided with resen-nrion smrirzws.
Such designs usually also make use of operrrmifcni-‘arrfing over a common dam bus [CD13], with tags to
identify the source of data on the bus. Such a design also implies regi.srer renaming, which resolves RAW
and W.-KW dependences. Dynamic scheduling of instructions on such a processor is discussed in some more
detail in Sections 12.8 and 12.9.
A branch prediction unit has also been shown in Fig. 12.4 and Fig. 12.5 to implement some form ofa
branch prediction algorithm, as discussed in Section 12.10.
Data paths connecting the various elements within the processor must be provided so that no resource
dependem:es—and consequent pipeline stalls—are created for want of a data path. if Ir instructions are to be
completed in every processor clock cycle, the data paths within the processor must support the required data
transfers in each clock cycle.
t.,m..,.- “imam ._,y 5,,
Functional F mctlonal
unlt unlt I I I
Reservation Reservation
57316 h stations stations
Preelletlon
to 12Gl‘lBJ'
main memory
At one extreme, a primitive arrangement would be to provide a single common bus within the processor;
but such a bus would become a scarce and pcrtbrmance limiting resource amongst multiple irtstructiorns
executing in parallel within the processor.
At the other extreme, one can envisage a emnplere gr.n_;Jh of data paths amongst the various processor
elements. In such a sy stem, in each clock cycle, any processorelement can transfer data to any other processor
clement, with no resource dependences caused on that acooum. But unfortunately, for a processor with n
internal elements, such a system requires n — 1 data ports at every element, and is therefore not practical.
Therefore, betwoen the two extremes outlined above, processor designers must aim for an optimum
design of intemal processor data paths, appropriate tbr the given instruction set and the targeted processor
performance. This point will be discussed further in Section 12.6, when we discuss a technique known as
0perand_f0r'n-erding.
AS mentioned HIJOVB, the important question of defining program (or thread} smite and proces.-sor smte
must also be addressed. I f a contest switch, interr1.|pt or exception occurs, the program-"thread state and
processor state must be saved, and then restored at a latertime when the same programfthread resumes. From
the programmer's poim ofview, tl1e state should correspond to a point in the machine language program at
which the previous insrmction has completed execution. but the next one has not started.
in a multiploissue processor, clearly this requires carefi.|l thought—s i11ee, at any time, as many as a couple
of‘ dozen instructions may be in \-arious stages ofexeeution.
Aproeessorofthe type described here is often designed with hardware support t'ormm"!i-threading, which
requires maintaining thread status of multiple threads, and switching between threads; this type ofdcsign is
discussed fi.|rther in Section 12.12.
EFF i _ Admncad Cmnputerfireltiteeture
Note also that, in Fig. 12.4 and Fig. 12.5. we have separated control elements fl-om data flow elements and
functional units in the proeessor—and in fact shown only the latter. Design ofthe control logic needed tbrthc
processor will not be discussed in this chapter in any degree of derail beyond the bricfovcrvicw contained
i11 Note 12.2.
Note 12.1
The processor designer must select the architectural components to be included in the processor—for
example a reornlrr brijirr of B. particular type, a specific method of opernmf_fi'Jrn-trrrfing, a specific
method ofbrmmli rrredicrion, and so on. Thedesignermust also specify li.|lly the algorithms which will
govern the working ofthe selected architectural components. These algorittrms are very similar to the
algorithms we write in higher level programming languages, and are written using similar languages.
These algorithms specify the comrol logic that would be needed for the processor, which would be
finally realized in the form of appropriate digital logic circuits.
Given the complexity of modem systems, the task of translating algorithmic descriptions of
processor functions into digital logic circuits can only be carried out using very sophisticated VLSI
design software. Such software offers a wide range of firnctionality; sirrrrdtrrirm software is used to
verity the correcmcss ofthe selected algorithm; l'0gical design software translates the algorithm into a
dig ital circuit; p.li_i's'icnu" ralrsign sofiware translates the logical circuit design into a physical circuit which
can be built using VLSI, while design vergficafion software verifies that the physical design does not
violate any constraints of the underlying cecuit fabrication technology.
All the architectural elements and control logic which is being described in this chaptercan thus be
translated into a physical design and then realized in ‘s-'LSl. This is how processors and other digital
systems are designed and built today. For our purposes in this chapter, however, it is not necessary to
go into the ktails ofhow the required circuits and control logic are to be realized in VLSI.
We take the view that the architect decides n-'hm is to be designed, and then the circuit designer
designs and realizes the circuit accordingly. ln ottrcr words, our subject matter is restricted to the
functions ofthe architect, and does not extend to circuit design—i.c. to the question oflmw aparticular
function is to be realized in VLSI. We assume that any required control logic which can be clearly
specified can be implemented.
Now suppose that machine code is generated by the compiler as though the original program had been
written as:
for 3 = O to 52 stop 4 do
\\
cf‘_.. = a='
._.. *h . J
_ prd£j:;
C.E::"'.: = fl::j"" '~h'j-L1 - p"dT:
c:f"—2: = afi- 2 -~b'j=2§ - p*dTo—2 :
. ._: .
To discover and exploit the parallelism implicit in loops, as seen in Example 12.4, the compiler must
perform the loop enrolling transformation to generate the machine code. Clearly, this strategy makes sense
only if su.flicir:nt hardware resources are provided within thc processor for ewtecuting instr-ut:tion.s in parallel.
Hm --. ' Amwmfmmwdmmmm
ln the simple example above, the loop control variable in the original program goes from ID to 5B—i.c. its
initial and final values are both known at compile time. If, on the other hand, the loop control values are not
lorown at compile time, the compiler must generate code to calculate at run-time the control values for the
unrolled loop.
Note that loop unrolling by the compiler does Qt in itself involve the detection of instruction level
parallelism. But loop unrolling makes it possible for the compiler or the processor hardware to csploit a
greater degree of instnrction level parallelism. ln Example 12.4, since the basic block making up the loop
body becomes longer, it becomes possible for the compiler or prooessor to find a greater degree ofparallelism
amongst the instructions across the unrolled loop iterations.
Can the compiler also do the additional work of actually scheduling machine instructions on the hardware
resources available on the processor? Or must this scheduling be necessarily performed on rtrefly by the
processor control logic?
Wlren the compiler schedules machine instnrctions forexecution on the processo r, the form of scheduling
is known as srrrrie sche.r.fur"ing. As against this, instruction scheduling carried out by the processor hardware
on the is known as oft-'ncrrnr'e .~rcfrcr2‘rrling, which has been introduced in Chapter 6 and will be discussed
further later in this chapter.
lfthe compiler is to schedule mach inc instruction s, then it must perform the requ ired depcndenoc analysis
amongst instructions. This is certainly possible, since the compiler has access to full semantic information
obtained from the original source program.
I»)
g Example 12.5 Dependence across loop iterations
Consider the following loop in a source program, which appears similar to the loop seen in the previous
ex-ample, but has a crucial new dependence built into it:
for i = D to 55 do
cjif = afijsbfif — p*e[i—Lf;
Now the value calculated in the i'":" iteration ofthe loop makes use ofthe value -: Ii — L I calculated in
the previous iteration. This does not mean that the modified loop cannot be unrolled, but only that extra care
should be taken to account for the dependence.
Dependenoes amongst refbrenoes to simple variables, or amongst array elements whose index values are
known at compiletime -[as in the two -osamples seen above), can be analyzed relatively easily at compile time.
But when pointers are used to refer to locations in memory, or when array index values are known only
at run-time, then clearly dependence analysis is not possible at compile time. Therefore processor hardware
must provide support at run-time for dim rrr'arf_y,-'.s'r's—i.e. based on the respective effective addresses, to
detemtine w hethertwo memory accesses for read or write operations refer to the same location.
There is another reason why static scheduling by the compiler must be backed up by dynamic scheduling
by the processor hardware. Cache misses, l~"O interrupts, and hardware exceptions cannot be predicted
em-0,. comes -—.. H,
at compile time. Therefore, apart from alias amilysis, the disruptions caused by such events in statically
scheduled rtmning code mu.st also be handled by the dynamic scheduling ha.rdware in the prooessor.
These arguments bring out a basic point—compiler detected instnlction level parallelism also requires
dynamic scheduling support within the processor. The fact that compiler performs extra work does not really
make the processor hardware much simplerl“.
A further step in the direction ofcompiler detected instruction level parallelism and static scheduling can
be the following:
Suppose each machine instruction specifics multiple operations——to be carried out in parallel within
the prooessor, on multiple functional units. The machine language program produced by the compiler then
consists of such multi-operation instructions, and their scheduling takes imo account all the dependences
amongst instructions.
Recall that conventional machine instntetions speciiy one operation cach—c.g. fond, ants‘, rrmLrip!__r, and
so on. As opposed to this, multi-opcration instn.|ctions would require a larger number of bits to encode.
Theretore processors with this type of instruction word are said to have v+:r__r long ins£r'r.1cr‘r'rm ti-'orr1’("v'Ll"t‘t-’].
A preliminary discussion ofthis concept has been included in Chapter -4 ofthe book.
A little further refinement of this concept brings us to the so-ealled t."_?t',l'J.l'i7i1‘i'l"|l'l-‘}'J|f'll"|:I.|"|liE'.|l iristrucrion eortsprtttrr
(EPIC). The EPIC instruction format can be more flexible than the fixed format of multi-operation VLIW
instntction; tbrexample, it may allow the compiler to encode explicitly dependences between operations.
Another possibility is that of having predicttrtrd insrrucr1'0n.~r in the instruction set, whereby an instruction
is executed only if the hardware condition (predicate) specified with it holds tnte. Such instructions would
result in reduced number of conditional branch instntctions in the program, and could thereby lower the
number of pipeline flushes.
The aim behind VLIW and EP [C processor architecture is to assign to the compiler primary responsibility
for the parallel exploitation ofplentifitl hardware resources ofthe processor. In theory, this would simplify
the processor hardware, allowing for increased aggregate proces sor throughput. Thus this app roach would, in
theory, provide a third alternative to the RISC and CISC styles of processor architecture.
In general, however. it is fair to say that VLWV and EPIC concepts have not fulfilled their original promise.
lntel ltanium 64-bit processors make up the mo st well-known processor family ofth is class. Experience with
that processor showed, as was argued briefly above, that processor hardware does not really become simpler
even when the compiler bears primary responsibility for the detection and exploitation of instruction level
parallelism. Events such as imcrrupts and cache misses remain unpredictable, and therefore execution of
operations at run-time cannot follow completely the static scheduling specified in VLIWM EPIC instructions
by the compiler; dynamic scheduling is still needed.
Another practical clitficulty with compiler detected instruction level parallelism is that the source program
may have to be recompiled for a different processor model of the same processor family. The reason is
simple: such a compiler depends not only on the instruction set architecture [ISA] of the processor family,
but also on the hardware resources provided on the specific processor model for which it generates code.
'-‘ll Recall in this context the basic argument for RISC architecture, whereby the instruction set Ls J"l'.‘dl'I£'£'£-l for the sake of
higher processor throughput. A similar trade-ofihetween hardware and soltware complexity do-es not exist when the
compiler performs static sclretiulirrg of instructions on a superscalar processor.
F?» Mtfiruw Hfllrlrmpwrnw
H11 T ridmnced Ccmputcrfirclriteeturc
For highly compute-intensive applications which run on dedicated hardware platforms, this strategy may
well be feasible and it may yield significant pertbrnianee benefits. Such special-pr:|.rpose applications are fine-
tuncd fora given hardware platform, and then run for long periods on the same dedicated plattbrm.
But commonly used programs such as word processors, web browsers, and spreadsheets must nrn without
recompilation on all the processors ofa family. Most users ofsotlware do not have source programs to
recompile, and all the processors ofa family are expected to be instruction set compatible with one another.
Therefore the role of compiler-detected instruction level parallelism is limited in the case of widely rnred
general purpose application programs ofthe type mentioned.
Niiiiiili (JFWERJhlIIJ|Fi)FUIVlHR[)Hfll5
1 We know that a superscalar processor offers opportunities for the detection and exploitation
of instnrction level parallc|ism—i.e. potential parallelism which is present within a single
instruction stream. Exploitation ofsuch parallelism is enhanced by providing multiple functional units and
by other techniques that we shall study. Tr|.|c data dependences between instnrctions must of course be
respected, since they reflect program logic. On the other hand, two independent instructions can be executed
in parallel—or even out ofsequence—ifthat results in better utilization of processor clock cycles.
We now know that pipelinejirlslres caused by conditional branch, indirect jump, and procedure return
instructions lead to degradation in performance, and therefore attempts must be made to minimize them;
similarly pipeline .s'r.nHs caused by data dependences and cache misses also have adverse impact on processor
performance.
Therefore thc strategy should be to minimize the number of pipeline stalls and flushes encountered while
executing an instruction stream. In other words, we must minimize wasted processor clock cycles within the
pipeline and also, if possible, within the various iirnctional units ofthe processor.
In this section, we take a look at a basic technique known as o;Jerond_fom-'arti‘r’ng, which helps in reducing
the impact oftrue data dependences in the instruction str-cam. Considerthe following simple sequence oftwo
instructions in a running program:
Ann R1, R2, as
sn:rca #4, as, as
The result ofthe.-KDD instruction is stored in destination register ,r,n|_i 141, rig, n;_=,
R3, and then shifted right by ibur bits in the second instnrctinn,
with the shifted value being placed in Ft-4. Thus, there is a simple
RAW rfeperidenr-e between the two instructions—the output of the RAW [R3]
first is required as input operand of the second.
In terrrts ofour notation, this RAW dependence appears as shown
in Fig. 12.6, in the form ofa graph with two nodes and one edge. SE1 3.1.5“ y-,1. R3’ M
[n a pipelined processor; ideally the second instnretion should
Fig.12.b RAW dependence between
be executed one stagc—-and therefore one clock cyclc—bchind the
nsolnsnnerlons
first. However, the diliiculty here is that it takes one clock cycle
to transfer ALU output to destination register R3, and then another clock cycle to transfer the contents of
mm Letrelflwullelism -—.. ,0,
register R3 to ALU input for the right shift. Thus a total of two eloek cycles are needed to bring the result
of the first instruction where it is needed for the second instruction. Therefore, as things stand, the second
instruction above cannot be executed just one clock cycle behind the first.
This sequence of data transfers has been illustrated in Fig. ll? fa). In clock cycle T1,, ALU output is
transferred to R3 over an internal data path. ln the next clock cycle Th +1, the content of R3 is transferred to
ALU input for the right shift. When carried out in this order, clearly the two data transfer operations take two
clock cycles.
But note that the required 135-_o transfers ofdata can be achieved in only one clock cycle ifALU output is
sent to both R3 and ALU input in the same eloelt cycle as illustrated in Fig. l2.7 (b). ln general, if X is to
be oopiod to Y, and in the next clock cycle Y is to be copied to Z, then we canjust as well copy X to both Y
and Z in one clock cycle.
if this is done in the above sequence of instructions, the second instruction ean be just one clock cycle
behind the first which is a basic requirement of an instruction pipeline.
ALL! output Tr R3
Tit + 1
ALU Input
la)
ALU output Tk R3
Tk
ALU Input
[bl
Fig. 11'? ‘limo data transfers {a} in sequence and {ti} in parallel
In technical ternrs, this type of an operation within a processor is known as qrJertrnd_forwtn'ding. Basically
this means that, instead of periorming two or more data transiers from a oomrnon source one after the
other, we perform them in parallel. This can be seen as parallelism at the level ofclementary data transfer
operations within the processor. To achieve this aim, the processor hardware must he designed to detect and
exploit on tkefly all such opportunities for saving clock cycles. We shall see later in this chapter one simple
and elegant technique for achieving this aim.
The benefits of sueh a technique are easy to see. Thc wait within a functional unit for its operand becomes
shorter because, as soon as it is available, the operand is scnt in one clock cycle, ovcr the common data bus,
to every destination where it is needed. We saw in the above example that thereby the common data bus
renrainod occupied ior one clock cycle rather than two clock cycles. Since this bus itself is a key hardware
resource, its better utilization in this way certainly co mributes to better processorperformance.
HM i Admrrced Compurterfirchitecture
The above reasoning applies even if there is an intervening instruction between ADD and SHIFTR.
Consider the following sequence of instructions:
FIEDRDER BU FFER
— The rmrder huflirr as a processor element was introduced and discussed briefly in Section
1'-3.4. Since instructions cltecute in parallel on multiple firnctional units, the reorder bufter
serves thc firnction of bringing completed instructions back into an order which is consistent with program
order. Note that instructions may eorrr,rJ!ere in an order which is not related to program order, but must be
comrrrirred in program order.
At any time, program s'r‘.rrre and pmeessor s'r‘rn'e arc defined in terms of instructions which have l:|-een
committe4;l—~i.e. their results are reflected in appropriate registers andfor memory locations. The concepts
of program state and processor state are important in supporting context switches and in providing precise
esceptions.
Entries in the reorder buffer are completed instructions, which are gueucd in program order. However,
since instructions do not necessarily complete in program order, we also need a flag with each reorder buffer
entry to indicate whether thc instruction i11 that position has completed.
Figure 12.9 shows a reorder buffer of size eight. Four fields are shown with each entry in the reorder
bufli::r—instruetion identifier, value computed, program-specified destination of the value computed, and a
flag indicating whether the instruction has completed (i.e. the computed value is available).
[n Fig. 12.9, the head ofqueue of instructions is shown at the top, arbitrarily labeled as instr[i]. This is the
instruction which would be committed nest—-if it has completed execution. When this instnrction commits,
HIE F Admrrced Compurterfirchitecture
its result value is copied to its destination, and the instruction is then removed from the reorder buffer. The
nest instnrction to be issued in the issue stage ofthe instruction pipeline then joins the reorder bu.fi'er at its
tail.
[fthe instruction at the head ofthe queue has not completed, and the reorder butter is firll, then further
issue of instructions is held up—i.e. the pipeline stalls—because there is no free space in the reorder buffer
throne more entry.
The result value of any other instruct ion lower down in tl1e reorder buffer, say value[i+ ls], can also be used
as an input operand for a subsequent operation—provided ofcourse that the instruction has completed and
therefore its rcsult value is available, as indicated by the corresponding flag ready[i+k]. In this sense, we see
that the technique of operandfortt-'arrfing can be combined with the concept of the reorder bufier.
lt should be noted here that operands at the input latches of functional units, as well as values stored in
the reorderbuffer on behalfof completed but uncommitted instructions, are simply ‘work in progress‘. These
values arc not reflected in the state of the program or the processor, as needed for a context switch or for
exception handling,
We now take a brief look at how the use ofreorder buffer addresses the various types ofdependences in
the program.
{i} Data Dependence: A RAW dependcnce—i.e. tnre data dependcnee—will hold up the execution ofthe
dependent instruction ifthe result value required as its input operand is not available. As suggested above,
operand forwarding can be added to this scheme to speed up the supply ofthe needed input operand as soon
as its value has been computed.
WAR and WAW dependcnces—i.e. anti-dependence and output dependence, respectively—aIso hold up
the execution ofthe dependent instruction and create a possible pipeline stall. We shall see below that the
technique of register remmzfng is needed to avoid the adverse impact ofthesc two types ofdependences.
(ii) Control Depend-en:-B Suppose the in.struction{s] in the reorder buffer belong to a branch in the
program which should it have been taken—i.e. there has been a mis-predicted branch. Clearly then the
emmotmsm -- ,,,,
reorder buffer should be flushed along with other elements of the pipeline. 'l'he-refore the performance impact
ofcontrol dependences in the r|.|nning program is determined by the accuracy ofbranch prediction technique
employed. The reorderbuffer plays no direct role in the handling ofcontrol dependences.
Resource Dependenoes lfan instruction needs a functional unit to execute, but the unit is not free,
then the instruction mu st wait for the unit to become lre1.~clearly no technique in the world can change that.
ln sueh cases, the pnocessor designer ean aim to achieve at least this: if a subsequent instruction needs to use
another firnctional unit which is free, then the subsequent instruction can be executed out ofordcr.
However, the reorder buffer queues and commits instructions in program order. In this sense, therefore,
the technique of using a reorder bufier does not address explicitly the resource dependences existing within
the instruction stream; with multiple firnctional tmits, the processor can still achieve nut ofordcr completion
ofinstruetions.
ln essence, the conceptually simpletcchnique ofreorder buFI'er en sures that if instructions as prog rammed
ean be carried out in pa1allel—i.e. ifthere are no dependences amongst them—then they are carried out
in parallel. But nothing clever is attempted in this technique to resolve dependences. Instruction issue and
oommit are in program order; program state and processor state are co rroctly preserved.
We shall now discuss a clever technique which alleviates tl1e adverse performance effect of WAR and
WAW dependences amongst instructions.
REGISTER RENAMING
-
Traditional compilers allocate registers to program variables in such a way as to reduce the
main memory accesses required in the running program. ln programming language C, in fact,
the programmer can even pass a hint to the compiler that a variable be maintained in a processor register.
Traditional compilers and assembly language programmers work with a fairly small number of
programmable registers. The number of programmable registers provided on a processor is determined by
either
(it the need to maintain backward instruction compatibility with other members ofthe processor family,
or
{ii} the need to achieve reasonably compact instruction encoding in binary. With sixteen programmable
registers, lor example, four bits are needed foreach register specified in a machine instruction.
Amongst the instructions in various stages of-e.\recution within the processor, there would be occurrences
of Rs-‘Wt-', WAR and WAW dependenoes on programmable registers. As we have seen, RAW is true data
dependenoe—since a value ‘written by one instmction is used as an input operand by another. But a WAR
or WAW dependence can be avoided ifwe have more registers to work wit.h. We can simply remove sueh a
dependenoe by getting the two instn.|ctions in question to use two diflerent registers.
But we must also assume that the insrruc-firm se! rtrehirccrnre (ISA) of the processor is fi1tod—i.e. we
eannot change it to allow access to a larger number of programmable registers. Rather, our aim here is to
esploretoehniques to detect and exploit instruction level parallelism using a given instruction set architecture.
Hill i Adrwrced Cornpurterfirelritecture
Therefore the only way to make a larger number of registers available to instructions under esecution
within the proces sor is to make the additional regi sters invisibfe to machine language instructions. lnstructions
under ere:-urrhn would use these additional registers, even it" instructions making up the machine language
program stored in memory cannot refer to them.
Let us suppose that we have several such additional registers available, to which machine instructions
of the nrrming program cannot make any direct reference. Ofcoursc these machine instructions do refer to
programmable registers in the processor—and thereby create the WAR and WAVI.-' dependences which we are
now trying to remove.
For example, let us say that the instruction:
rape 2;, 22, as
is followed by the instrr.rct:ion:
:51 In fact the prooessor may also rename R5 in F.-XDD to another program invisible register, say Y. But ctearly the
argument made here still remains valid.
hm nsmrm -—.. H,
A similar argument applies if lj is reading the value in R1,, and a subsequent instruction is writing into
R1,—i.e. there is a WAR dependence between them.
The technique outlined, which can resolve WAR and W.-KW dependences, is lcnown as register remtrrring.
Both these dependences are caused by a subsequent instruction writing into a register being used by a
previous instruction. Such dependences do not reflect program logic, but rather the use of a limited number
of registers.
L/ct us now consider a simple example of WAR dependenoe, i.e. ofanti-dependence. The case of W.-KW
dependence would be very similar.
Thus we sec that register renaming removes ‘N.-“LR and W.-‘KW dependences from the instruction stream by
re-mapping programmable registers to a larger pool of program invisible registers. For this, the processor
must have extra registers to handle instructions under execution, but these registers do not appear i11 the
instruction set.
Consider true data dependence, i.e. RAW dependence, between two instructions. Under register renaming,
the write operation and the subsequent read opcration both occur on the same program invisible register. Thus
RAW dependence remai11s intact in the instruction stream—as it should, since it is true data dependenoe. As
seen above, its impact on the pipeline opcration can be reduced by operand forwarding.
Dcpendences are also caused by reads and writes to memory locations. In general, however, whether
two instructions refer to the same memory location can only be lcnov.-n afier the two eiilective addresses are
calculated during execution. For example, the two memory references 2OCI'I][Rl1 and 4i]'I]'l][R3] occurring in
a running program may or may not refer to the same memory location—this cannot be resolved at compile
time.
Resolution of whether two memory references point to the same memory location is ltnovm as alias
rmrtlysis, which must be carried out on the basis of the two efi'ective memory addresses. If a Joan‘ and a store
opcration to memory refcrto two different addresses, their order may be interchanged. Such capability can be
built into the load-store ur|it—which in essence operates as another iilnctional unit ofthe processor.
An elegant implementation ofregister renaming and operand fotrwarding in a high performance processor
was seen as early as i11 l9ti7—even before the term register rermnring was coined. This tcchnique—which
has si11cc become well-ltnown as Torrmsuio Is‘ rii'gorit!rm—is described in the next section.
TOMASI-lLO’S ALGORITHM
1 In the I BM 3-fit] family ofcomputer systems of 1960s and 1970s, model 3t"i{3.-"91 was developed
as a high performance system for scientific and engineering applications, which involve
intensive floating point computations. Thc prooessor in system was designed with multiple floating point
units, and it made use ofan innovative algorithm for the efficient use of these units. The algorithm was based
on operand forwarding overacommon data bus, with tags to identity sources ofdata values sentoverthc bus.
The algorithm has since become known as ii’;-rrmsrrio h afgoririmr, afier the name of its chief designerlfi];
what we now understand as register ramming was also an implicit part ofthe original algo rithm.
Recall that, for register renaming, we need a set of program invisible registers to which programmable
registers an: re-mapped. TorrrasuIo‘.s algorithm requires these program invisible registers to be provided with
reservation stations of iimctional units.
Let us assume that the iirnetional units are intemally pipelinecL and can complete onc operation in every
clock cycle. Therefore each filnctional rmit can initiate one operation in every clock cyclc—provid-od of
oourse that a reservation station of the unit is ready with the required input operand value or values. Note that
the exact depth ofthis firnctional unit pipeline docs not concern us for the present.
2°] See .-in e'f_fic'ic‘nt uigrrriihnrjirr crpfri-iring nsrrftipfe an'i‘hnrei‘r'c r.rnr'i’.s. by R. M. Tortmsulo. IBM Journal of Researcli dc
Development I I: I, January I961.-‘i preliminary discussion on Tomnsul-o’s algorithm was included in Cliapter 6.
n.~.1m- lxvelibrollelism ._. H,
Figure 12.12 shows sueh a timetional unit connected to the common data bus, with three reservation
stations provided on it.
funetlonal unlt
opncl-1 It qsnd-2 I2
reservation
statlms
‘W'l1en the needed operand value or values are available in a reservation station, the fimctional unit can
initiate the required operation in the next clock cycle.
At the time of instruetioli issue, the reservation station is filled out with the operation. eode {op}. If an
operand value is available, for example in a programmable register, it is transferred to the corresponding
source operand field in the reservation station.
However, ifthe operand value is @ available at the time of issue, the corresponding source tag -[I1 and-‘or
£2) is eopi-ed into the reservation station. The source tag identifies the source of the required operand. As soon
as the required operand value is available at its soun:e—which would be typically the output ofa fimetional
unit—the data value is forwarded over the -common data bus, along with the source tag. This value is -copied
into all the reservation station operand slots which have the matching tag.
Thus operand forwarding is achieved here with the use oi" tags. All the destinations which require a data
value receive it in the sameeloelt cycle overthe common data bu s, by matching their stored operand tags with
the source tag sent out over the bus.
I»)
g Example 12.1 Tomasulo's algorithm and RAW dependence
Assume that instruction l1 is to write its rcsult into R4, and that two subsequent instructions 12 and I3 are
to read—i.e. make use of—that result value. Thus instructions I2 and 13 are truly data dependent (R.-‘WI.-’
dependent) on instruction ll. See Fig. 12.13.
$1 I i _ Advanced Cmnputerfirchiteeture
It may be noted from Example 12.7 that, in efiiect, programmable registers become rmamed to operand
registers within reservation stations, which are program invisible. As we have seen in the previous section,
such renaming also resolves anti-dependences and output dependences, since the target register of the
dependent instruction is renamed in these cases to a different program invisible register.
I/I
£1 Example 12.8 Combination of RAW andWAR dependence
Let us now consider a combination of RAW and WAR dependences.
A.ssume that instruction 11 is to write its result into R4, a subsequent instructions E is to read that result
value, and a latter subsequent instruction I3 is then to v.-rite its result into R4. Thus instruction I2 is truly
data dependent {RAW dependent] on instnrction I I, but I3 is anti-dependent (WAR dependent} on 12. See
Fig. 12.14.
Irrstmetm t....a...tt.,. '1 .I| \ _
._, m
:1
RAW(R4{|
:2
nswtso
I3
As in the previous example, and keeping in mind similar possibilities, let us assume onee again that the
output of ll is not available when I2 and I3 are issued; thus R4 has the source tag value corresponding to the
output of I 1.
When I2 is issued, it is parlted in thc reservation station ofthe appropriate functional unit. Since thc
required rcsult value irom ll is not available, the reservation station entry of I2 also gets the source tag
corresponding to the output of |l—i.e. the same source tag value which has been assigned to register R4,
since they are both awaiting the same result.
The question new is: Can I3 be issued even before II completes and I2 starts execution?
The answer is that, with register renaming—carricd out here using source tags—I3 Q be issued even
before I2 starts execution.
Recall that instruction I2 is RAW dependent on ll , and therefore it has the correct source tag forthc output
of ll. I2 will receive its required input operand as soon as that is available, when that value would also be
oopicd into R4 overthe common data bus. This is exactly what we observed in the previous example.
B1.|t suppose I3 is issued even betbre the output ofll is available. Now R4 should receive the output oi
I3 rather than the output of I1. This is simply because, in register R4, the output ofll is programmed to be
overwritten by the output ofl3.
Thus, when I3 is issued, R4 will receive the source tag value corresponding to the output of I3—i.e. the
fi.|nctional unit which performs the opcration ot'l3. Its previous source tag value eorresponding to the output
of 11 will be overwritten.
When the output of I1 {finally} becomes available, it goes to the input of I2, but log to register R4, since
this register's source tag now refers to I3. When the output ofl3 becomes available, it goes correctly to R4
because ofthe matching source tag.
For simplicity of discussion, we have not tracked here the outputs of I2 and I3. But the student can
verify easily that the two data transfers described above are consistent with the specified sequence of three
instructions and the specified dependences.
Let us assume that,without any unrolling by the compiler, this loop executes on a proeessorwh ich provides
branch prediction and implements Tomasulo‘s algorithm. if instn.|ctions from successive loop iterations are
available in the processor at one time—b-ecause of suooessful branch pt-ediction{s)—and if floating point
unitr~'. are available, then instructions from successive iterations can esecute at onetime, in parallel.
But if instructions from multiple iterations are thus executing in parallel within the proccssor—at one
time—then the net efi'-ect ofthesc hardware techniques in the processor is the same as that ofan unrolled loop.
ln other words, the processor hardware achieves on iLl‘I€fl_P what otherwise would require unrolling assistance
from the compiler!
Even the dependence shown in Example 12.5 across successive loop iterations is handled in a natural
way by branch prediction and Tomasulo's algorithm. Basically this dependence across loop iterations
becomes RAW dependence between instructions, and is handled in a natural way by source tags and operand
forwarding.
This example brings ourclearly how a particular method ofexploiting parallelism—Ioop rmrofling, in this
ease—can be implemented eitherby the compiler or, equivalently, by clever hardware techniques employed
within the processor.
Example 12.9 illustratesthe combined power ofsophist icated hardware techniques for dynamic scheduling
and branch prediction. With such effieient techniques becoming possible in hardware, the importance of
oompilcr-detected parallelism (Section 12. 5] diminishes somewhat in comparison.
lrl
éllj Example 12.10 Calculation of processor clock cycles
bet u s eon sider the number ofc lock cycles it takes to execute the following sequenoe ofmach ine instruction s.
We shall count clock cycles starting from the last cloeli: cycleoi'ins-truetion l , so that the answer is independent
of the depth oi'ir1struction pipeline.
We shall assume that (rt) one instruction is issued pecr clock cycle, (b) floating point operations take two
clock cyc lcs each to execute, and [cl memory operations take one clock cycle each when there is L1 cache hit.
lfwe add the number of clock cycles noeded lbr each instruction, we get the total as l+2+]+2+l = 7.
However, if no operand forwarding is provided, t11e RAW dependences on registers R-'1 and R7 will cost three
additional clock cycles {recall Fig. 12.7], fora total of 10 clock cycles ibr the given sequence of instnictions.
With operand lb rwarding—wh ich is built into Tornasulo‘s algo rithm—one clock cycle is saved on account
ofeach RAW dependence—i.e. between (ii instrticfions 1 and Z, { ii] instructions 2 and 3, and {iii instructions
4 and 5.
Thus the total number ofclock cycles required, counting from the last clock cycle of instruction 1, is 7.
With the assumptions as made here, there is no further scope to schedule these instructions in parallel.
n.~.1m- Levellibroilelism '1 .I| \ -
-—. M
In To masulo‘s algorithm, use ofthe common data bus and operand forwarding based on source tags results
in rferenrrrifized comm! of the multiple instructions in execution. In the 1960s and 1970s, Control Data
Corporation developed supercomputers CDC 6600 and CDC "F600 with a r.-en!rah':a'! technique to exploit
insmiction level parallelism.
In these supercomputers, the processor had a centralized st-orebmrd which maintained the status oi
functional units and executing instructions (see Chapter 6). Based on this status, processor control logic
governed the issue and execution of instructions. One part ofthe scoreboard maintained the status of every
in struction under exec ution, while another part maintained the status ofevery functional unit. The scoreboard
itselfwas updated at every clock cycle of the processor, as execution progressed.
BRANCH PREDICTION
3 _ The importance of branch prediction for multiple issue prooessor performance has already
been discussed in Section 12.3. About 15% to Ill‘?/ii of instructions in a typical program are
branch and jump instructions, including procedure returns. Therefore—ifhardware resources are to be f|.|lly
utilized in a superscalar processor—the processor must start working on instructions beyond a branch,
even before the branch instr1.|ction itself has completed. This is only possible through some fotrm ofbranch
prediction.
What can be the logical basis for branch preclietion? To understand this, we consider first the reasoning
which is involved ifone wishes to predict the result ofa tossed coin.
ill For a detailed discussion, with applications, the reader may retier to the hook .=lm'.f:'c:'rn’ Ini’¢'H:'_|;enee.' A Modem Ap-
pmat-ii, by Russell and Norvig, Pearson Education.
sis li .
xmmm Com puwrfinmflcdrm
Like tossed coins, outcomes of conditional branches in computer programs also have __i-'es and no
answen-:— i.e. a branch is either taken or not taken. But outcomes of conditional branches are in
fact biased—because there is strong correlation between fa) successive branches taken at the same
eonditional branch instruction in a program, and 1'b) branches taken at two different conditional branch
instructions in the same program.
This is how programs behave, i.e. such correlation is an essential property ofreal-life program s. And
such correlation provides the logical basis for branch prediction. The issue for processor designers is
how to discover and utilize this correlation on t!iefl_v—w'ithout incurring prohibitive overhead in the
process.
Abasic branch prediction technique uses a so-called In-'0-hit pncdiemr. A two-bit counter is maintained for
every eonditional branch instn.|ction in the program. The two-bit counter has four possible states; these four
states and the possible transitions between these states are shown in Fig. 12.15.
Dtyos t
1’ \'\- _\
,. 1" is t
I \\
I‘, \
flyes
s. ,1‘,
\ I
'~ r
s 1
'\ tr
~.\ ‘X
-Ir‘ ,1
Wlien the counter state is O or 1, the respective branch is predicted as token; when the cotmter state is 2
or 3, the branch is predicted as not token. When the conditional branch instn.|etion is executed and the actual
branch outcome is known, the state of the respective two-bit counter is changed as shown in the figure using
solid a.nd broken line arrows.
‘W'l1en two successive predict ions come out wrong, the prediction is changed from fJ.|"flYi£'.fi taken to brnrieh
nor token, and vice versri. ln Fig. 12. 14, state transitions made on mis-predictions are shown using broken line
arrows, while solid line arrows show state transitions made on predictions which come out right.
i-fl Note that Fig. I2. I4 is a slightly redrawn version ofthe state transition diagram shown earlier in Fig. 6. I9 (bl.
nsm.- stresses f'r .I| \ _
._.. H,
This scheme uses a two -bit counter for every conditional branch, and there are many conditional branches
in the program. CIvcralL therefore, this branch prediction logic needs a few kilobytes or more offast memory.
One possible organization for tl1is branch prediction memory is in the form of an array which is indexed
by low order bits of the instruction address. If twelve low order hits are used to define the array index, for
csample, then the numberof entries in the array is 4096M.
To be effective, branch prediction should be carried out as early as possible in the instruction pipeline.
As soon as a eonditional branch instruction is decoded, branch prediction logic should predict whether the
branch is taken. Acoordingly, the next instruction address should be taken eitheras the branch target address
(i.e. branch is taken], or the sequentially next address in the program (i.e. branch is not taken).
Can branch prediction be carried out even betbre the instruction is deeoded——i.e. at the instruction fetch
stage? YB5, ifa so-ealled brrmr.-it target bnfer is provided which hasa history ofrecently executed eonditional
branches. The branch target buffer is organized as an associative memory accessed by the instruction address;
this memory provides quick access to the prediction and the target instruction address needed.
In some programs, whether a conditional branch is taken or not taken correlates betterwith otherconditional
branches in the program—ra1l1er than with the earlier history of outcomes of the same eonditional branch.
Accordingly, r-ormlnterl pnafir-mrs can be designed, which generate a branch prediction based on whether
other conditional branches in the program were taken or not taken.
Branch prediction based on the earlier history of the same branch is known as fOC‘flf prerfietrion, while
prediction based on the history ofother branches in the program is known as glohrrlrtrttfiefionlt tournament
prerfietor uses [i] a global predictor, (ii) a local predictor, and {iiij a s'el'eetnr which selects one of the two
predictors for prediction at a given branch instr|.|ction. The selector uses a two-bit counter per conditional
branch—as in Fig. 12. l 4—to choose between the global and local predictors for the branch. Two successive
mis-predictions cause a switch from the local predictor to the global predictor, and t-fee wrsa; the aim is to
infer which predictorwo rks better for the particular branch.
The common element in all these cases is that branch prediction relies on the correlation detected between
branches taken or not taken in the running program—end for this, an efficient hardware implementation of
the required prediction logic is required.
Con sidcrations outlined here apply also to j1:n1p;Jrr'rfietrIon,which is applicable in indiroctjumps, computed
go to statements (used in FCIRTRAN], and switch statements (used in C and C++). Procedure returns can
also benefit from a form ofjump prediction. The reason is that the address associated with a procedure return
is obtained from the rtmtime procedure stack in main memory; therefore a correct prediction of the return
address can save memory access and a few proccssorclock cycles.
lt is also possible to design the branch prediction logic to utilize information gleaned froma prior exeerrfion
profile or execution trace of the program. lfthe same program is going to run on dedicatod hardware for
years—.say for an application such as weather forecasting—then such special effort put into speeding up the
program on that dodicated hardware can pay very good dividend over the life of the application. Suppose
the execution trace informs us that a particular branch is taken 95% of the time, for example. Then it is a
good idea to ‘prcdict‘ the particular branch as always taken—ir| this case, we are assured that 95% of the
predictions made will be correct!
ial Clearly, iftwo eonditional branch instr|1etion.s happen to have the same low order bits, then their predictions will be-
come ‘interrningled’. But the probability of two ore more sueh inst ruetions being in execution at the same time would
be quite lcsv.
FM Mtfiruw H'l"I'ncl'q||;1:lI¢-\'
‘I B - ' rldmnced Computerfirehitecture
A.s we discussed ir| Section 12.3, under any branch prediction scheme, a mis-predicted branch means that
subsequent instructions must be flushed from the pipeline. It should of course be noted here that thc actual
result ofaconditional branch irtstruction—as against its E_flcte_d result—isonly known when the instruction
completes execution.
Speculative Execution lnstntctions executed on the basis ofa predicted branch, betbre the actual branch
result is known, are said to involve .speen!otr'\-'e execution.
lfa branch prediction tums out to be correct, the corresponding speculatively executed instnletions must
be committed. lfthe prediction tums out to be wrong, the effects of corresponding speculative operations
carried out within the processor must be cleaned up, and instructions from another branch of the program
must instead be cxecutod.
As we have seen in Example 12.2, the strategy results in net performance gain if branch predictions are
made with suflieicntly high accuracy. The performance benefit of branch prediction can only be gained if
prediction is followed by speculative execution.
A conventional processor fetches ccne instr|.|ction alter another—i.e. it does not look ahead into the
forthcoming instr|.|ction stream more than one instruction at a time. To support a deeper and wider—i.e.
multiple issue—instructi-on pipeline, it is necessary for branch prediction and dynamic scheduling logic to
look further out into the lbrthcoming instruction stream. ln other words, more ofthe likely future instructions
need to be examined i11 support of multiple issue scheduling and branch prediction.
Instruction wirtdott-'—or simply it-'rno'ow—is the special memory provided upstream of the lbtch unit in the
processor to thus look ahead into the forthcoming instntction stream. Forthe targeted processor performance,
the prooessor designers must integrate and balance the hardware techniques of branch prediction, dynamic
scheduling, speculative execution, internal data paths, functional unitrc and an instruction window of
appropriate size.
1"" The interested student rnay read L:'mr't.-r o,f'l'rr.'rtrrr-t'tr'rnt-1'-r.' tel PrtmHelr'.s"nr, by ow. Wall, Research Report ens, starts rtt
Research l_.-rboratory, Digital Equipment Corporation, l\love1nber I993. Note I24 below Ls a brie-f.s|.untna1'y ofthis
technical report.
s,.~.m..- levelfln-ulielism '1 .I| \ _
._. H,
Similarly, multiple loads and stores would be in progress at one time. Also, dynamic scheduling would
require a iairly large instnrction window, to maintain the issue rate at the targeted four instructions per clock
cycle.
Consider the instn.|ction window. In structions in thcwindow must be checked for dependences, to support
out of order issue. This requires associative memory and its control logic, which means an overhead in
chip area and power consumption; such overhead would increase with window size. Similarly, any form
of checking amongst executing instructions—e.g. checking addresses of main memory references, for alias
analys is—would involve overhead which increases with issue multiplicity Ir. In turn, such increased overhead
in aggressive pursuit of instruction level parallelism would adversely impact processor clock speed which is
achievable, for a given VLSI technology.
Also, with greater issue multiplicity Ir, there would be higher probability of less than it instructions being
issued in some clock cycles. The reason for this can be simply that a filnctional unit is not available, -or
that tr1.|e RAW dependences amongst instnlctions hold up instruction issue. This would result in missing
the target performance of the processor in acl1|al applications, in temrs of issue multiplicity k. Let us say
processor A has It = ti but it is only 60°»-it utilized on average in actual application s; processor B, with the same
instnrcfion set but with I: = -4, might have faster clock rate and also highcr average utilization, thus giving
better performance than A on actual application s.
The increased overhead also necessitates a largernumbcrofstagcs in the instruction pipeline, so as to limit
the total delay per stage and thereby achieve faster clock cycle; but a longer pipeline results in highercost oi
flushing the pipeline. Thus the aggregate p-erforrnance impact of increased overhead finally places limits on
what is achievable in practice with aggressively superscalar, VLIW and EPIC fl.I1;‘l'llIL‘Cl'LlIC.lH'l
Basically, the increased overhead required within the processor implies that:
(ij To support highcr multiplicity ofinstruction issue, the amount ofoontrol logic required in theprocessor
increases disproportionately, and
{ii} For higher throughput, the processor must also operate at a high clock rate.
But these two design goals are often at odds, ibrtechnical reasons ofcircuit design, and also because there
are practical limits on the amount of power the chip earl dissipate.
Power consumption ofa chip is roughly proportional to N><_,r", where N isthe number ofdevices on the chip,
andfis the clock rate. The number of devices on the chip is largely determined by the fabrication technology
being used, and power consumption must be held within the limits ofthe heat dissipation possible.
Therefore the question for processor designers is: For a targeted processor performance, how best to
select and utilize the various chip resources available, within the broad desigrr eorrsrrnfnrs ofthe given circuit
technology?
The student may recall that this was the introductory theme ofthis chapter{'S-ection 12.1 1, and should note
that such design trade-offs are shaping processor design today. To achieve their goals, processor designers
make use ofextensive sofiware simulations of the processor, using various benchmark programs within the
target range ofapplication s. The designers‘ own experience and insights supplement the simulation results in
the process ofgenerating solutions to the actual problems ofprocessor design.
:1“ In this C1Dl"ll'l£!CllZOl'l, see aLso the discussion in the latter part 0|‘ Section l2.5.
War MIGIIILH Hf" I_1M‘KI|l[1rI|f\
Gill i ' Adtwrced Cemputerfiirchitecture
Emergence of hardware support for multi-threading and of multi-core chips, which we shall discuss in
the next section, is due in part to the practical limits which have been encountered in exploiting the irnplicrt
parallelism within a single instruction stream.
lll] lntite world ot'a11eient(}1~eeoe,ai1oraeit-was a power which could predict future events; one well-kiiown anti presum-
ably reliable oracle was at the temple of Delphi.
am Lcuclibrollclism .
-—.. ill
Forexample, suppose one ofthe programs listed in Table l2.l executed thirty million instructions,
as seen from its execution trace. Suppose further that, for a given processor configuration, these
instructions could be paclred into six million processor cycles. Then the average degree of parallelism
obtained for this program, for this particular processor eonfigtrration, would be 3W6 = S.
Techniques Explored The range of techniques which was explored in tl'|e study to detect and exploit
instruction level parallelism is summarized briefly below:
Register renrrnring—witl'| (a) infinite number of registers as renaming targets, (b) finite number of
registers, and -[c) no register renaming.
Aiirrs rmrr{vsir—\sith {aj perfect alias analysis, {bl two intermediate levels of alias analysis, and
(cj no alias analysis.
Branch preo'ie1ion—with fa) perfect branch prediction, (b) three hardware-based branch prediction
schemes, (c) three profile-based branch prediction schemes, and (d) no branch prediction. Hardware
predictors used a combination of local and global tables, with difi'ercnt total table sizes. Some branch
fanout limits were also applied.
lnoireer junip prerfietion—with (a) perfect prediction, (bl intermediate level of prediction, and
{cl no indirectjump prediction.
ll5"indot1-'.s'i:e~——for some of the processor models, difierent window sizes from an upper limit of 2043
instructions down to 4 instn.|c.tions were used.
C_FC'f£' width, i.e. the maximum number ofinstructions which can be issued in one cycle—-[:a) 64,
(b) 128, and -[c] bounded only by window size. Note that, from a practical point ofvicw, cycle widths
ofboth 6-'1 and 128 are on the high side.
Loteneies ofprocessor operorions—five different latency models were used, specifying latencies [in
number ofclock cycles) for various processor operations.
Loop rrnroHing—was carried out in some ofthe programs.
Mispreoietion penoitt-'—values of ll to ltl clock cycles were used.
Conclusions Reached ‘W111 13 programs and more than 350 processor configurations, it should come
as no surprise to the student that Wall's research generated copious results. These results are presented
systematically in the full report, which is available on the web. For our purposes, we summarize below
the main conclusions ofthe report.
For the overall degree of parallelism found, the report says:
Using nontrit-'i.nI but enrnsnrfft-‘ lrno wn teenn iques, ‘Ht’ eonsisrenti_1-' got porolfefism between 4 and I O
for most of the progronrs in our rest suite. i-iretorirrrbfe or nerrrfy ve-:'to.rr':.niJIe pmgnnnrs went omen
higher.
Branch prediction and speculative execution is identified as the major contributor in the exploitation
of instruction level parallelism:
.S'peeu)'otive oreerrtioln driver: by good branch prediction is critical‘ to one exploitation ofmore than
rnoo'e.st amounts of instruction-fevei paroi'l'el'ism. {I ne stnrt with the Perfect model‘ and remove brrrnen
preoietion, the nreoiion ptnrrrHei'is'm ,rJl'1.1rnrnets_,r"ronr 3|'1 ti no 2.3
Perfect mode! in the above excerpt reicrs to a processor which performs perfect branch prediction,
jump prediction, register renaming, and alias analysis. The student will appreciate readily that this is an
ideal which is impossible to achieve in practice.
.
622 '3 - ' rid~\orrcedCornpr11er.5|.rcl|ritectu|c
OveralL Wall's study reports good results for the degree of parallelism; but the report also goes on
to say that the results are based on ‘rather optinristie ossnnrptions’. lli the actual study, this meant: (i)
as many copies of functional units as needed, {ii} a perfect memory system with no cache misses, -[iii]
no penalty for missed predictions, and (iv) no speed penalty ofthe overhead for aggressive pursuit of
instruction level parallelism.
Cleariy no real hardware prooessor can satisfy sueh ideal assumptions. Alter listing some more
factors which lead to optimistic results. the report concludes:
Any one oftnetre cons iderrrtions conl'o' reo'uce the expected_rJqt-'ojf,'fofan in.s'trnetio.n -prlrrtliel rnaen ine
he rt tit inn’: together the}-' could eiim inote it mnuileteft-1
The broad conclusion of Wall's research study thereibre certainly seems to support the proverb
quoted at the start ofthis section.
[n Chapter 13, we shall review recent advances in technology which have had a major impact on
processor design, and we shall also look at some specific commercial products introduced in recent
years. We shall sec that the basic techniques and trade-ofls discussed in this chapter are reflected, in one
fonrr or another, in the processors and systems-on-a-chip introduced in recent years.
1 We have already seen that dependences amongst machine instructions limit the amount ot
instruction level parallelism which is available to be exploited within the processor. The
dependences may be true data dependences {RAW}, control dependences introduced by conditional branch
instructions, or resource dependeneesl '31.
One way to reduce the burden of dependences is to combine—with hardware support within the
processor—in.structions from multiple independent threads ofexecution. Such hardware support for multi-
threading would provide the processorwith a pool of i1'|str|.|ctions, in various stages ofcxecution, which have
a relatively smaller number of dependences amongst them, since the threads are ind-ependent ofone another.
Let us consider once again the processor with instruction pipeline of depth eight, and with targeted
superscalar performance of four instmctiorts complemd in every clock cycle (see Section 12.11]. Now
suppose that these instructions come from four independent threads of execution. Then, on average, the
number of instructions in the processor at any onetime f would be 4 >< B.-"4 = 8.
With t1'|e threads being independent ofone another, there is a smaller total number of data dependences
amongst the instructions in the processor. Further, with control dependences also being separated into four
threads, less aggressive branch prediction is needed.
Another major bcsnefit ofsueh hardware-supported |:nu.lti-threading is that pipeline stalls arcvcry cflbetively
utilized. lf one thread runs into a pipeline stall—for access to main memory, say—then another thread makes
use oftl1e corresponding processor clock cycles, which would otherwise be wasted. Thus hardware support
formulti-threading becomes an important latency hiding toehnique.
To provide support for multi-tl'|readi ng, the processor mu st be designed to switch between threads—eitl'|er
on the occurrence of a pipeline stall, or in a round robin manner. As in the case ofthe operating system
switching between running processes, in this case the hardware context ofa thread w1'thin the processormust
be preserved.
But in this case what exactly is thc meaning of the cantor! ofa rhnmfi
Basically, thread contest includes the fi.|ll set of registers {programmable registers and those used in
register renaming], PC , stack pointer, relevant memory map information, protection bits, interrupt control
bits, etc. For N-way multi-threading support, the processor must store at onetime the thread contests offs‘
executing threads. When the processor switchers, say, from thread A to thread B, control logic ensures that
execution of subsequent instruction{s) occurs with reference to the context ofthread B.
Note that thread contexts need not be saved and later restored. As long as the processor t;;g;_g;g_.5;s within
itself multiple thread contexts, all that is required is that the processor be able to switch betwoen thread
contexts from onc clock cycle to the next.
As we saw in the previous section, there are limits on the amount ofinstruction level parallelism which can
be extracted from a single stream of csecuting instn.|ctiorts—i.e. a single thread. But, with steady advances in
VLSI technology, the aggregate amount of functionality that can be built into a single chip has been growing
steadily.
-H" As dtscussod abort-"e, we assutne that WAR and W:'t‘W dc-p-cruienoes can be ham.tted using some term of register
rcrunnittg.
FM Mtfiruw H'Illr'nm;|unn1'
514 T ' Admrrcad Cctmputerfirchitcctura
Therefore hardware support for multi-threading—as well as the provision ofmultiplc processoreores on a
single chip—can both be seen as natural consequences oftl1c steady advances in VLSI technology. Both these
developments address the needs of important segments of modem computer applications and workloads.
Depending on the specific strategy adopted for switching betwoen threads, hardware support for multi-
thrcadirrg may be classified as one of the following:
{ii Coarse-grain rnriiri-threading refers to switching between threads only on the occurrence of a major
pipeline stall—whieh may be caused by, say, access to main memory, with latencies ofthe order ofa
hundred processor clock cycles.
{ii} Fim:-grain inrriri-threading refers to switching betwoen threads on the occurrence of any pipeline
stall, which may be caused by, say, Ll cache miss. But this term would also apply to designs in which
processor clock cycles are regularly being shared amongst exccuting threads, even in the absence ofa
pipeline stall.
[iii] Sinwfrwieoris" nwfri-threading refers to machine instructions from two {or more] threads being
in parallel in each processor clock cycle. This would correspond to a multiple-issue processor where
the multiple instructions issucd in a clock cycle come from an equal number of independent execution
threads.
With increasing power of‘s-'LSl technology, the development ofmulti-core s1rsrcrri.s-aim:-drip -[|SoCsj was
also inevitable, since there are practical limits to the number of threads a single processor core can support.
Eaeh core on the Sun UltraSpa.rc T2, for example, supports eight-way fine-grain mulfi-tl:treadir1g, and the
chip has eight such cores. Multi-core chips promise higher net processing performance per watt of power
consumption.
St-'sreni.s-(Jr:-a~efn}J are examples of fasc irrating design trade-offs and tl'|c technical issues which have been
discussed in this chapter Clf course, we have discussed here only the basic design issues and techniques. For
any actual task ofprocessordesign, it is necessary to make many design choices and trade-ofi's, validate the
design using simulations, and then finally complete the design in detail tn the level of logic circuits.
Over the last couple of decades, enormous advances have taken place in various arois of computer
technology; these advances have had a major impact on processor and system design. ln the next chapter,
we shall discuss in some detail these advances and their impact on processor and system design. We shall
also study in brief several commercial products, as case studies in how actual processors and systems are
designed.
3*i Summary
Processor design—or the choice of a processor from amongst several alterrrarivcs—is the central
element of computer system design. Since system design can only be carried out with specific target
application loads in mind. it follows that processor design should also be tailored for target application
loads-.To satisfy the overall systom perfo rmanee criteria. various elements of the system must be balanced
in terms of their perFormaocc—i.e. no element of the system should become a performance bottleneck.
lrrstruction Level Parallelism ""*"G"""”m “""""'"" -' i 62
One of the main processor design trade-offs faced in this context is this: Should the processor be
designed to squeeze the maximum possible parallelism from a single thrd.orshould processor hardware
support multiple independent thrds. with less aggressive exploitation of instruction level parallelism
within each thread? In this chapter. we studied the various standard techniques for exploiting instruction
level parallelism. and also discussed some of the related design issues and trade-offs.
Dependences amongst instructions make up the main constraint in the exploitation of instruction
level paralle|ism.Therefore, in its essence.tJ1e problem here can be defined as: executing a given sequence
of machine instructions in the smallest possible number of processor clock cycles. while respecting
the true dependences whid'1 exist amongst the instructions. To study possible solutions. we looked at
two possible prototype processors: one provided with a reorder buffer. and the other with reservation
stations associated with its various functional units.
In theory. compiler-detected instruction level parallelism should simplify greatly the issues to be
addressed by processor hardware.This is because. in theory. the compiler would do the difficult work of
dependence analysis and instruction scheduling. Processor hardware would then be ‘dumb and fast'—it
would simply execute ata high speed the machine instructions which specify parallel operations. However.
many types of runtime events—such as interrupts and cache misses—-cannot be predicted at compile
time. Processor hardware must therefore provide for dynamic scheduling to exploit instruction level
parallelism. limiting the value of what the compiler alone can achieve.
Operand forwarding is a hardware technique to transfer a required operand to multiple destinations
in parallel. in one clock cycle. over the common data bus—thus avoiding sequential transfers over multiple
clock cycl5.To achieve this, it is necessary that processor hardware should dynamically detect and
exploit such potential parallelism in data t.ransfers.The benefits lie in reduced wait time in functional units.
and better utilization of the common data bus.which is an important hardware resource.
A reorder buffer is a simple mechanism to commit instruc1:ions in program order. even if their
corresponding operations complete out of order within the processor. Wiflwin the reorder buffer.
instructions are queued in program order with four typical fields for each insi:ruction—inst|'uct.ion
idemnifier. value computed, program-specified destination of the value computed. and a flag indicating
whether the instruction has completed.This simple technique ensures that program state and processor
state are cor rectly prese rved. but does not rsolve WAR and WAW dependences within the instr uctions.
Register renaming is a clever technique to resolveWAR and WAW dependents within the instruction
strm. This is done by re-mapping source and target programmable registers of executing machine
instructions to a larger set of program-invisible registers; thereby the second instruction of a WAR
or WAW dependence does not write to the same register which is used by the first instruction.The
renaming is done dynamically, without any performance penalty in clock cycles.
Tomasulo's algorithm was developed originally for the IBM 3i59;‘91 processor. which was designed
for intensive scientific and engineering applications. Opel-and forwarding is achieved using source tags.
which are also sent on the common data bus along with t.he operand value. Use of ruervation stations
wild": functional units provides an effective register renaming mechanism whidw resolvesWAR and WAW
dependences.
About 15% to 20% of instructions in a typical machine language program are branch and jump
instructions.Therefore for any pipelined processor—but especially for a superscalar processor—branch
rh- li|lrG-me Hiiiompwm a
fill i .=\di-roirrced Compute: Architecture
prediction and speculative execution are critical to achieving targeted performance. A simple tvvo-bit
counter for every branch instruction can serve as a basis for branch prediction: using a combination of
local and global branch prediction. more elaborate schema can be devised.
Limitations in explo itinggreater degree of instruction level parallelismarise from the increased overhead
in the required control logic.The limitations n1ay apply to achievable clock rates. power consumption.
or the actual processor utilization achieved while running application programs.Thread-level parallelism
allows procfisor resources to be shared amongst multiple independent threads executing at one time.
For the target application. processor dsigners must choose the right combination of instruction level
parallelism. thread-level parallelism. and multiple processor cores on a chip.
&Exercises
Problem 12.1 Define in brief the meaning of Problem 12.4 Explain in brief some of the basic
computer architecture; within dwe scope of that design issues and trade-offs faced in processor
meaning.explain in briefdwe role of processor design. design. and the role of‘v'LSl technology selected for
building the processor.
Problem 12.2
(a]- When can we say that a computer system is Problem 12.5 Explain in brief the significance of
bofdnced with respect to its performance? {i} processorstote. (ii) program stote.and {iii} committing
{b} In a particular computer system. the designers an executed instruction.
suspect d'1at the reodfivrite bandwidth of the Problem 12.6
main memory has become dwe performance (a} Explain in brief. with one example each. the
bottleneck. Describe in brief the type of
various types of dependences which must
test program you would need to run on
be considered in the process of exploiting
the system. and the type of measurements instruction level parallelism.
you would need to make. to verify whedwer
(b) Define in brief the problem of exploiting
main memory bandwidth is indeed the
instruction level parallelism in a sirgle
performance botdeneclc You may make
sequence of executing instructions.
additional assumptions about the system if
you can justify the assumptions. Problem 12.7 Widi static instruction scheduling
by dwe compiler. die processor designer does not
Problem 12.3 Recall that. in the example system
need to provide for dynamic scheduling in hardware.
shown in Fig. 121. the bandwidth of the shared
ls this statement true or false? justify your answer
processor-memory bus is a performance bottleneck.
in brief.
Fesume now that this bandwidth is increased by
a factor of six. Discuss in brief the likely effect of Problem 12.1! Describe in brief the structure of
dais increase on system performance. After this the reorder buffer. and the functions which it can
dwange is made. is there a likelihood dwat some odwer and cannot perform in die process of exploiting
subsystem becomes the performance botdeneck! instruction level parallelism.
.............- ........i....i..... P] | '
-- ,2,
Note for Eirercisu P to ‘f5
The following three sequences of machine instructions are to be used for Exercises 9 to 15. Note
that instructions other than LOAD and STORE have three operands each;from left to right they are.
respectively source 1. source 2 and de.stinatlon.'#' sign indicates an immediate operand.
Assume that (a) one instruction is issued per clock cycle. {b} no resource constraints limit instruction
level parallelism. (c) floating point operations take two clock cycles each to execute. and [d) loodfsmre
memory operations take one clock qrcle ach when there is L1 cadwe hit.
Problem 12.9 Draw dependence graphs of the Problem 12.11 Now assume that register
above sequences of machine instructions. marking renaming is implemented to resolveWAR andWAW
on them the type of data dependences. with the dependences. Determine the number of clock
rspective registers involved. cycles it takes to execute the above sequences of
instructions. counting from the last clock cycle of
Problem 12.10 Assume that the procmsor has
instruction 1.
no provision for register renaming and operand
forwarding. and that all memory references are Problem 12.12 Comment on the scope for
Satisfied from L1 cache. Determine the number of operand forwarding within the sequences of
clook cycls it takes to execute the above sequences instructions.Assume that the loodism-re unit can also
of instructions. counting from the last clock cycle of take part in operand forwarding.
instruction 1.
TM rilnffirmrl-' Hflllfmmro-rm
628 i ndmirrced Computer Architecture
Problem 12.13 Assume that. in addition to ‘With the 2-bit branch predictor, find fraction of
register renaming. operand forwarding is also correct branch predictions made ifk =1..k = 2. k= 5
implemented as discussed in Exercise 12. Determine and it = 50
the number of clock cycles it takes to execute the
Problem 12.11! Discuss in brief the difference
above sequences of instructions. counting from the
between ltrcoland global branch prediction strategies.
last clock cycle of instruction 1.
and how a two-bit selector may be used per branch
Problem 12.14 Consider your answers to to select between the two.
Exercises 10.11 and 13 above. Explain in brief how
Problem 12.19
these answers would be affected ifan L1 cache miss
(a) Wall's study on instruction level parallelism is
occurs in instruction 1.which takes five clock cycles
based on oracle-driven iroce-bosed simulation.
to satisfy from L2 cache.
Explain in brief what is meant by this type of
Problem 12.15 \Nith reference to Exercise simulation.
13. describe in brief how Tomasulo's algorithm (b) Wall's study of instruction level parallelism
would implement register renaming and operand makes certain ‘optimistic’ assumptions about
forwvarding. processor hardware.\Nhat are thse assump-
Problem 12.16 Explain in brief the meaning of tions? Against each of these assumptions.
olios analysis as applied to runtime memoryaddresses. list the corresponding ‘realistic‘ assumption
which we should make. keeping in view the
Problem 12.17 A particular processor makes characteristics of real processors.
use ofa 2-bit predictor for each branch. Based on a
program execution trace. the actual branch behavior Problem 12.20 Discuss in brief the basic trade-off
at a particular conditional branch instruction is in processor design between errqaloiting instruction
found to be as follows: level parallelism in a single executing thrd. and
providing hardware support for multiple threads.
T T T bl T T T bl
Problem 12.21 Describe in brief what is mean
(——i.—!~ (——i.—!- by dwe context ofa threod, and what are the typical
k irmttv k rune.»
operations involved in switching between threads.
HereT stands for branch taken. and N stands for Problem 12.22 Describe in brief the different
branch not taken. ln other words. the actual branch strategies which can be considered for switching
behavior forms a repeating sequence. such that the between threads in a processor which provides
branch is taken k times (T). then not taken once (N). hardvwre support for multi-threading.