Module 7 Notes Parallelizing-Vectorizing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 26

Parallelizing and Vectorizing Compilers

Rudolf Eigenmann and Jay Hoeflinger


January 2000

1 Introduction of computer consisted of a numeric code indicating


which operation to carry out, as well as an indica-
Programming computers is a complex and constantly tion of the operands to which the operation was ap-
evolving activity. Technological innovations have plied. Debugging such a machine language program
consistently increased the speed and complexity of involved inspecting memory and finding the binary
computers, making programming them more diffi- digits that were loaded incorrectly. This could be
cult. To ease the programming burden, computer tedious and time-consuming.
languages have been developed. A programming lan- Then another innovation was introduced - sym-
guage employs syntactic forms that express the con- bolic assembly language and a assembler program to
cepts of the language. Programming language con- translate assembly language programs into machine
cepts have evolved over time, as well. At first the language. Programs could be written using names
concepts were centered on the computing machine for memory locations and registers. This eased the
being used. Gradually, the concepts of computer lan- programming burden considerably, but people still
guages have evolved away from the machine, toward had to manage a group of registers, had to check
problem-solving abstractions. Computer languages status bits within the processor, and keep track of
require translators, called compilers, to translate pro- other machine-related hardware. Though the prob-
grams into a form that the computing machine can lems changed somewhat, debugging was still tedious
use directly, called machine language. and time-consuming.
The part of a computer that does arithmetical or A new pair of innovations was required to lessen
logical operations is called the processor. Processors the onerous attention to machine details - a new lan-
execute instructions, that determine the operation to guage (Fortran), closer to the domain of the prob-
perform. An instruction that does arithmetic on one lems being solved, and a program for translating For-
or two numbers at a time is called a scalar instruc- tran programs into machine code, a Fortran compiler.
tion. An instruction that operates on a larger num- The 1954 Preliminary Report on Fortran stated that
ber of values at once (e.g. 32 or 64) is called a vector “. . . Fortran should virtually eliminate coding and de-
instruction. A processor that contains no vector in- bugging . . . ”.
structions is called a scalar processor and one that That prediction, unfortunately, did not come true,
contains vector instructions is called a vector proces- but the Fortran compiler was an undeniable step for-
sor. If the machine has more than one processor of ward, in that it eliminated much complexity from the
either type, it is called a multiprocessor or a parallel programming process. The programmer was freed
computer. from managing the way machine registers were em-
The first computers were actually programmed by ployed and many other details of the machine archi-
connecting components with wires. This was a te- tecture.
dious task and very error prone. If an error in the The authors of the first Fortran compiler recog-
wiring was made, the programmer had to visually in- nized that in order for Fortran to be accepted by pro-
spect the wiring to find the wrong connection. grammers, the code generated by the Fortran com-
The innovation of stored-program computers elim- piler had to be nearly as efficient as code written in
inated the physical wiring job. The stored-program assembly language. This required sophisticated anal-
computer contained a memory device that could store ysis of the program, and optimizations to avoid un-
a series of binary digits (1s and 0s), a processing necessary operations.
device that could carry out several operations, and Programs written for execution by a single proces-
devices for communicating to the outside world (in- sor are referred to as serial, or sequential programs.
put/output devices). An instruction for this kind When the quest for increased speed produced com-

1
puters with vector instructions and multi-processors, a SIMD machine uses a SISD machine as a host,
compilers were created to convert serial programs for to broadcast the instructions to be executed.
use with these machines. Such compilers, called vec-
torizing and parallelizing compilers, attempt to re- 3. Multiple instruction streams - single data stream
lieve the programmer from dealing with the machine (MISD) - no whole machine of this type has ever
details. They allow the programmer to concentrate been built, but if it were, it would have multi-
on solving the object problem, while the compiler ple processors, all operating on the same data.
concerns itself with the complexities of the machine. This is similar to the idea of pipelining, where
Much more sophisticated analysis is required of the different pipeline stages operate in sequence on
compiler to generate efficient machine code for these a single data stream.
types of machines.
Vector processors provide instructions that load a 4. Multiple instruction streams - multiple data
series of numbers for each operand of a given opera- streams (MIMD) - these machines have two or
tion, then perform the operation on the whole series. more processors that can all execute different
This can be done in pipelined fashion, similar to oper- programs and operate on their own data.
ations done by an assembly line, which is faster than
doing the operation on each item separately. Parallel
Another way to classify multiprocessor computers
processors offer the opportunity to do multiple oper-
is according to how the programmer can think of
ations at the same time on the different processors.
the memory system. Shared-memory multiprocessors
This article will attempt to give the reader an
(SMPs) are machines in which any processor can ac-
overview of the vast field of vectorizing and paral-
cess the contents of any memory location by simply is-
lelizing compilers. In Section 2, we will review the
suing its memory address. Shared-memory machines
architecture of high performance computers. In Sec-
can be thought of as having a shared memory unit
tion 3 we will cover the principle ways in which high-
accessible to every processor. The memory unit can
performance machines are programmed. In Section 4,
be connected to the machine through a bus (a set
we will delve into the analysis techniques that help
of wires and a control unit that allows only a single
parallelizing and vectorizing compilers optimize pro-
device to connect to a single processor at one time),
grams. In Section 5, we discuss techniques that trans-
or an interconnection network (a collection of wires
form program code in ways that can enable improved
and control units, allowing multiple data transfers at
vectorization or parallelization. Sections 6 and 7 dis-
one time). If the hardware allows nearly equal access
cuss the generation of vector instructions and paral-
time to all of memory for each processor, these ma-
lel regions, respectively, and the issues surrounding
chines can be called uniform memory access (UMA)
them. Finally, Section 9 discusses a number of im-
computers.
portant compiler-internal issues.
Distributed-memory multiprocessors (DMPs) use
processors that each have their own local memory, in-
accessible to other processors. To move data from one
2 Parallel Machines processor to another, a message containing the data
must be sent between the processors. Distributed
2.1 Classifying Machines memory machines have frequently been called multi-
Many different terms have been devised for classi- computers.
fying high performance machines. In a 1966 paper, Distributed shared memory (DSM) machines use a
M.J. Flynn divided computers into four classifica- combined model, in which each processor has a sep-
tions, based on the instruction and data streams used arate memory, but special hardware and/or software
in the machine. These classifications have proven use- is used to retrieve data from the memory of another
ful, and are still in use today: processor. Since in these machines it is faster for
a processor to access data in its own memory than
1. Single instruction stream - single data stream to access data in another processor’s memory, these
(SISD) - these are single processor machines. machines are frequently called non-uniform memory
access (NUMA) computers. NUMA machines may
2. Single instruction stream - multiple data streams be further divided into two categories – those in
(SIMD) - these machines have two or more pro- which cache-coherence is maintained between proces-
cessors that all execute the same instruction at sors (cc-NUMA) and those in which cache-coherence
the same time, but on separate data. Typically, is not maintained (nc-NUMA).

2
2.2 Parallel Computer Architectures did not have vector registers, but rather loaded data
directly from memory to the processor. The first
People have experimented with quite a few types of commercially successful vector machine was the Cray
architectures for high-performance computers. There Research Cray-1. It used vector registers and paired
has been a constantly re-adjusting balance between a fast vector unit with a fast scalar unit. In the
ease of implementation and high performance. 1980s, CDC built the Cyber 205 as a follow-on to the
Star-100, and three Japanese companies, NEC, Hi-
SIMD Machines tachi and Fujitsu built vector machines. These three
companies continued manufacturing vector machines
The earliest parallel machines were SIMD machines. through the 1990s.
SIMD machines have a large number of very simple
slave processors controlled by a sequential host or
Shared Memory Machines
master processor. The slave processors each contain a
portion of the data for the program. The master pro- In a shared memory multi-processor, each processor
cessor executes a user’s program until it encounters can access the value of any shared address by simply
a parallel instruction. At that time, the master pro- issuing the address. Two principle hardware schemes
cessor broadcasts the instruction to all the slave pro- have been used to implement this. In the first (called
cessors, which then execute the instruction on their centralized shared memory), the processors are con-
data. The master processor typically applies a bit- nected to the shared memory via either a system bus
mask to the slave processors. If the bit-mask entry or an interconnection network. The memory bus is
for a particular slave processor is 0, then that pro- the cheapest way to connect processors to make a
cessor does not execute the instruction on its data. shared memory system. However, the bus becomes a
The set of slave processors is also called an attached bottleneck since only one device may use it at a time.
array processor because it can be built into a single An interconnection network has more inherent paral-
unit and attached as a performance upgrade to a uni- lelism, but involves more expensive hardware. In the
processor. second (called distributed shared-memory), each pro-
An early example of a SIMD machine was the Illiac cessor has a local memory, and whenever a processor
IV, built at the University of Illinois during the late issues the address of a memory location not in its lo-
1960s and early 1970s. The final configuration had 64 cal memory, special hardware is activated to fetch the
processors, one-quarter of the 256 originally planned. value from the remote memory that contains it.
It was the world’s fastest computer throughout its The Sperry Rand 1108 was an early centralized
lifetime, from 1975 to 1981. Examples of a SIMD shared memory computer, built in the mid 1960s.
machine from the 1980s were the Connection Machine It could be configured with up to three processors
from Thinking Machines Corporation, introduced in plus two input/output controller processors. In the
1985, and its follow-on, the CM-2, which contained 1970s, Carnegie Mellon University built the C.mmp
64K processors, introduced in 1987. as a research machine, connecting 16 minicomput-
ers (PDP-11s) to 16 memory units through a cross-
Vector Machines bar interconnection network. Several companies built
bus-based centralized shared memory computers dur-
A vector machine has a specialized instruction set ing the 1980s, including Alliant, Convex, Sequent and
with vector operations and usually a set of vector Encore. The 1990s saw fewer machines of this type
registers, each of which can contain a large number introduced. A prominent manufacturer was Silicon
of floating point values (up to 128). With a single Graphics Inc. (SGI), which produced the Challenge
instruction, it applies an operation to all the floating and Power Challenge systems in that period.
point numbers in a vector register. The processor of During the 1980s and 1990s, several research ma-
a vector machine is typically pipelined, so that the chines explored the distributed shared memory archi-
different stages of applying the operation to the vec- tecture. The Cedar machine built at the University
tor of values overlap. This also avoids the overheads of Illinois in the late 1980s connected a number of
associated with loop constructs. A scalar processor bus-based multiprocessors (called clusters) with an
would have to apply the operation to each data value interconnection network to a global memory. The
in a loop. global memory modules contained special synchro-
The first vector machines were the Control Data nization processors which allowed clusters to syn-
Corporation (CDC) Star-100, and the Texas Instru- chronize. The Stanford DASH, built in the early
ments ASC, built in the early 1970s. These machines 1990s, also employed a two-level architecture, but

3
added a cache-coherence mechanism. One node of processor’s memory to another during the run of a
the DASH was a bus-based multiprocessor with a lo- program. The term attraction memory has been used
cal memory. A collection of these nodes were con- to describe the tendency of data to migrate toward
nected together in a mesh. When a processor referred the processor that uses it the most. Theoretically,
to a memory location not contained within the local this can minimize the latency to access data, since
node, the node’s directory was consulted to determine latency increases as the data get further from the
the remote location. The MIT Alewife project also processor.
produced a directory-based machine. A prominent The COMA idea was introduced by a team at the
directory-based commercial machine was the Origin Swedish Institute of Computer Science, working on
2000 from SGI. the Data Diffusion Machine. The idea was commer-
cialized by Kendall Square Research (KSR), which
Distributed Memory Multiprocessors built the KSR1 in the early 1990s.

In a distributed memory multiprocessor, each proces- Multi-threaded Machines


sor has access to its local memory only. It can only
access values from a different memory by receiving A multi-threaded machine attempts to hide latency
them in a message from the processor whose memory to memory by overlapping it with computation. As
contains them. The other processor must be pro- soon as the processor is forced to wait for a data
grammed to send the value at the right time. access, it switches to another thread to do more com-
People started extensive experimentation with putation. If there are enough threads to keep the
DMPs in the 1980s. They were searching for ways processor busy until each piece of data arrives, then
to construct computers with large numbers of pro- the processor is never idle.
cessors cheaply. Bus-based machines were easy to The Denelcor HEP machine was the first multi-
build, but suffered from a limitation on bus commu- threaded processor in the early 1980s. In the 1990s,
nications bandwidth, and were hampered by the se- the Alewife machine used multi-threading in the pro-
rialization required to use the bus. Machines built cessor to help hide some of the latency of memory
using interconnection networks or cross-bar switches accesses. Also, in the 90s, the Tera Computer Com-
had increased bandwidth and communication paral- pany developed the MTA machine, which expanded
lelism, but were expensive to build. Multicomput- on many of the ideas used in the HEP.
ers simply connected processor/memory nodes with
communication lines in a number of configurations, Clusters of SMPs
from meshes to toroids to hypercubes. These turned
out to be cheap to build and (usually) had sufficient Another approach to building a multiprocessor is to
memory bandwidth. The principle drawback of these use a small number of commodity microprocessors to
machines was the difficulty of writing programs for make centralized shared-memory clusters, then con-
them. nect large numbers of these together. The number
One example of a DMP in the 1980s was a Hyper- of microprocessors to use to make a single cluster
cube research machine built at the California Insti- would be determined by the number of processors
tute of Technology, which had processors connected that would saturate the bus (keep the bus constantly
in a hypercube topology by communication links. busy). Such machines are called clusters of SMPs.
The nCUBE company and Intel Scientific Computers Clusters of SMPs have the advantage of being cheap
were founded to build similar machines. nCUBE built to build. During the second half of the 1990s, people
the nCUBE/1 and nCUBE/2, while ISC built the began building clusters out of low-cost components:
iPSC/1 and iPSC/2. In the 1990s, nCUBE followed Pentium processors, a fast network (such as Ether-
with the nCUBE/2S, and ISC built the iPSC/860, net or Myrinet), the Linux operating system, and the
the iWarp for Carnegie Mellon University and the Message Passing Interface (MPI) library. These ma-
Paragon. chines are sometimes referred to as Beowulf clusters.

COMA 3 Programming Parallel Ma-


A machine that uses all of its memory as a cache chines
is called a cache-only memory architecture (COMA).
Typically in these machines, each processor has a lo- From a user’s point of view there are three different
cal memory, and data is allowed to move from one ways of creating a parallel program:

4
1. writing a serial program and compiling it with a OpenMP standard emerged. OpenMP describes a
parallelizing compiler common set of directives for implementing various
types of parallel execution and synchronization. One
2. composing a program from modules that have advantage of the OpenMP directives is that they are
already been implemented as parallel programs designed to be added to a working serial code. If the
3. writing a program that expresses parallel activi- compiler is told to ignore the directives, the serial
ties explicitly program will still execute correctly. Since the serial
program is unchanged, such a parallel program may
Option 1 above is obviously the easiest for the pro- be easier to debug.
grammer. It is easier to write a serial program than A third way of expressing parallelism is to use li-
it is to write a parallel program. The programmer brary calls within an otherwise sequential program.
would write the program in one of the languages for The libraries perform the task of creating and termi-
which a parallelizing compiler is available (Fortran, nating parallel activities, scheduling them, and sup-
C, and C++), then employ the compiler. The tech- porting communication and synchronization. Exam-
nology that supports this scenario is the main focus ples of libraries that support this method are the
of this article. POSIX threads package, which is supported by many
Option 2 above can be easy as well, because the operating systems, and the MPI libraries, which have
user does not need to deal with explicit parallelism. become a standard for expressing message passing
For many problems and computer systems there exist parallel applications.
libraries that perform common operations in parallel.
Among them, mathematical libraries for manipulat- Parallel Programming Models
ing matrices are best known. One difficulty for users
is that one must make sure that a large fraction of Programming Vector Machines: Vector paral-
the program execution is spent inside such libraries. lelism typically exploits operations that are per-
Otherwise, the serial part of the program may domi- formed on array data structures. This can be ex-
nate the execution time when running the application pressed using vector constructs that have been added
on many processors. to standard languages. For instance, Fortran90 uses
Option 3 above is the most difficult for program- constructs such as
mers, but gives them direct control over the perfor-
mance of the parallel execution. Explicit parallel lan- A(1:n) = B(1:n) + C(1:n)
guages are also important as a target for parallelizing
For a vector machine, this could cause a vector loop
compilers. Many parallelizers act as source-to-source
to be produced, which performs a vector add between
restructurers, translating the original, serial program
chunks of arrays B and C, then a vector copy of the
into parallel form. The actual generation of paral-
result into a chunk of array A. The size of a chunk
lel code is then performed by a “backend compiler”
would be determined by the number of elements that
from this parallel language form. The remainder of
fit into a vector register in the machine.
this section discusses this option in more detail.

Loop Parallelism: Loops express repetitive exe-


Expressing Parallel Programs
cution patterns, which is where most of a program’s
Syntactically, parallel programs can be expressed in work is performed. Parallelism is exploited by iden-
various ways. A large number of languages offer par- tifying loops that have independent iterations. That
allel programming constructs. Examples are Prolog, is, all iterations access separate data. Loop paral-
Haskell, Sisal, Multilisp, Concurrent Pascal, Occam lelism is often expressed through directives, which
and many others. Compared to standard, sequential are placed before the first statement of the loop.
languages they tend to be more complex, available on OpenMP is an important example of a loop-oriented
less machines, and lack good debugging tools, which directive language. Typically, a single processor exe-
contributes to the degree difficulty facing the user of cutes code between loops, but activates (forks) a set
Option 3 above. of processors to cooperate in executing the parallel
Parallelism can also be expressed in the form of loop. Every processor will execute a share of the
directives, which are pseudo comments with seman- loop iterations. A synchronization point (or barrier)
tics understood by the compiler. Many manufac- is typically placed after the loop. When all proces-
turers have devised their own set of such directives sors arrive at the barrier, only the master processor
(Cray, SGI, Convex, etc.), but during the 1990s the continues. This is called a join point for the loop.

5
Thus the term fork/join parallelism is used for loop parallelization and vectorization, the compiler typi-
parallelism. cally takes as input the serial form of a program, then
Determining which processor executes which iter- determines which parts of the program can be trans-
ation of the loop is called scheduling. Loops may formed into parallel or vector form. The key con-
be scheduled statically, which means that the assign- straint is that the “results” of each section of code
ment of processors to loop iterations is fully deter- must be the same as those of the serial program.
mined prior to the execution of the loop. Loops may Sometimes the compiler can parallelize a section of
also be self-scheduled, which means that whenever a code in such a way that the order of operations is
given processor is ready to execute a loop iteration, it different than that in the serial program, causing a
takes the next available iteration. Other scheduling slightly different result. The difference may be so
techniques will be discussed in Section 7.3. small as to be unimportant, or actually might alter
the results in an important way. In these cases, the
programmer must agree to let the compiler parallelize
Parallel Threads Model: If the parallel activi-
the code in this manner.
ties in a program can be packaged well in the form of
Some of the analysis techniques used by paralleliz-
subroutines that can execute independently of each
ing compilers are also done by optimizing compilers
other, then the threads model is adequate. Threads
compiling for serial machines. In this section we will
are parallel activities that are created and terminated
generally ignore such techniques, and focus on the
explicitly by the program. The code executed by a
techniques that are unique to parallelizing and vec-
thread is a specified subroutine, and the data accessed
torizing compilers.
can either be private to a thread or shared with other
threads. Various synchronization constructs are usu-
ally supported for coordinating parallel threads. Us- 4.1 Dependence Analysis
ing the threads model, users can implement highly
A data dependence between two sections of a pro-
dynamic and flexible parallel execution schemes. The
gram indicates that during execution of the optimized
POSIX threads package is one example of a well-
program, those two sections of code must be run in
known library that supports this model.
the order indicated by the dependence. Data depen-
dences between two sections of code that access the
The SPMD Model: Distributed-memory paral- same memory location are classified based on the type
lel machines are typically programmed by using the of the access (read or write) and the order, so there
SPMD execution model. SPMD stands for “single are four classifications:
program, multiple data”. This refers to the fact that
each processor executes an identical program, but on input dependence: READ before READ
different data. One processor can not directly access
the data of another processor, but a message contain- anti dependence: READ before WRITE
ing that data can be passed from one processor to the flow dependence: WRITE before READ
other. The MPI standard defines an important form
for passing such messages. In a DMP, a processor’s output dependence: WRITE before WRITE
access to its own data is much faster than access to
data of another processor through a message, so pro- Flow dependences are also referred to as true de-
grammers typically write SPMD programs that avoid pendences. If an input dependence occurs between
access to the data of other processors. Programs writ- two sections of a program, it does not prevent the
ten for a DMP can be more difficult to write than sections from running at the same time (in parallel).
programs written for an SMP, because the program- However, the existence of any of the other types of
mer must be much more careful about how the data dependences would prevent the sections from run-
is accessed. ning in parallel, because the results may be differ-
ent from those of the serial code. Techniques have
been developed for changing the original program in
4 Program Analysis many situations where dependences exist, so that the
sections can run in parallel. Some of them will be
Program analysis is crucial for any optimizing com- described later in this article.
piler. The compiler writer must determine the analy- A loop is parallelized by running its iterations in
sis techniques to use in the compiler based on the tar- parallel, so the question must be asked whether the
get machine and the type of optimization desired. For same storage location would be accessed in different

6
iterations of a loop, and whether one of the accesses is 1 ≤ i01 ≤ 100
a write. If so, then a data dependence exists within 1 ≤ i001 ≤ 100
the loop. Data dependence within a loop is typi-
cally determined by equating the subscript expres- The constraint i01 < i001 comes from the idea that
sions of each pair of references to a given array, and only dependences across iterations are important. A
attempting to solve the equation (called the depen- dependence within the same iteration (i01 ≡ i001 ) is
dence equation), subject to constraints imposed by never a problem, since each iteration executes on a
the loop bounds. For a multi-dimensional array, there single processor, so it can be ignored.
is one dependence equation for each dimension. The Of course, there are many solutions to this equation
dependence equations form a system of equations, the that satisfy the constraints: {i01 : 1, i001 : 2} is one;
dependence system, which is solved simultaneously. If {i01 : 2, i001 : 3} is another. Therefore, the given loop
the compiler can find a solution to the system, or if it contains a dependence.
cannot prove that there is no solution, then it must A dependence test is an algorithm employed to de-
conservatively assume that there is a solution, which termine if a dependence exists in a section of code.
would mean that a dependence exists. Mathematical The problem of finding dependence in this way has
methods for solving such systems are well known if been shown to be equivalent to the problem of find-
the equations are linear. That means, the form of the ing solutions to a system of Diophantine equations,
subscript expressions is as follows: which is NP-complete, meaning that only extremely
expensive algorithms can be found to solve the com-
Xk
plete problem exactly. Therefore, a large number of
aj ij + a0
j=1
dependence tests have been devised that solve the
problem under simplifying conditions and in special
where k is the number of loops nested around an array situations.
reference, ij is the loop index in the j th loop in the
nest, and aj is the coefficient of the j th loop index in Iteration Spaces
the expression.
The dependence equation would be of the form: Looping statements with pre-evaluated loop bounds,
such as the Fortran do-loop, have a predetermined set
of values for their loop indices. This set of loop-index
X k Xk
0 00
aj ij + a0 = bj ij + b0 (1) values is the iteration space of the loop. k-nested loop
j=1 j=1 statements have a k-dimensional iteration space. A
specific iteration within that space may be named
or by a k-tuple of iteration values, called an iteration
vector:
X k
(aj i0j − bj i00j ) = b0 − a0 (2) {i1 , i2 , · · · , ik }
j=1
in which i1 represents the outermost loop, i2 the next
In these equations, i0j and i00j represent the values inner, and i is the innermost loop.
k
of the j th loop index of the two subscript expressions
being equated. For instance, consider the loop below.
Direction and Distance Vectors
DO i = 1, 100
When a dependence is found in a loop nest, it is some-
A(i) = B(i)
times useful to characterize it by indicating the itera-
C(i) = A(i-1)
tion vectors of the iterations where the same location
ENDDO
is referenced. For instance, consider the following
There are two references to array A, so we equate loop.
the subscript expressions of the two references. The
equation would be: DO i = 2, 100
DO j = 1, 100
i01 = i001 − 1 S1: A(i,j) = B(i)
S2: C(i) = A(i-1,j+3)
subject to the constraints: ENDDO
0
i <i 00 ENDDO
1 1

7
The dependence between statements S1 and S2 We call a dependence test exact if it only reports
happens between iterations answers 1 or 2. Otherwise, it is inexact.

{2, 5} and {3, 2}, {2, 6} and {3, 3}, etc Dependence Tests
Since {2, 5} happens before {3, 2} in the serial ex- The first of the dependence tests was the GCD test,
ecution of the loop nest, we say that the dependence an inexact test. The GCD test finds the greatest com-
source is {2, 5} and the dependence sink is {3, 2}. The mon divisor g of the coefficients of the left-hand-side
dependence distance for a particular dependence is de- of the dependence equation (Equation 2 above). If
fined as the difference between the iteration vectors, g does not divide the right-hand-side value of Equa-
the sink minus the source. tion 2, then there can be no dependence. Otherwise,
a dependence is still a possibility. The GCD test is
dependence distance = {3, 2} − {2, 5} = {1, −3} cheap compared to some other dependence tests. In
Notice that in this example the dependence dis- practice, however, often the GCD g is 1, which will
tance is constant, but this may not always be the always divide the right-hand-side, so the GCD test
case. doesn’t help in those cases.
The dependence direction vector is also useful in- The Extreme Value test, also inexact in the general
formation, though coarser than the dependence dis- case, has proven to be one of the most useful depen-
tance. There are three directions for a dependence: dence tests. It takes the dependence equation (2) and
{<, =, >}. The < direction corresponds to a positive constructs both the minimum and the maximum pos-
dependence distance, the = direction corresponds to sible values for the left-hand-side. If it can show that
a distance of zero, and the > direction corresponds to the right-hand-side is either greater than the maxi-
a negative dependence distance. Therefore, the direc- mum, or less than the minimum, then we know for
tion vector for the example above would be {<, >}. certain that no dependence exists. Otherwise, a de-
Distance and direction vectors are used within par- pendence must be assumed. A combination of the
allelizing compilers to help determine the legality of Extreme Value test and the GCD test has proved to
various transformations, and to improve the efficiency be very valuable and fast because they complement
of the compiler. Loop transformations that reorder each other very well. The GCD test does not in-
the iteration space, or modify the subscripts of array corporate information about the loop bounds, which
references within loops cannot be applied for some the Extreme Value test provides. At the same time,
configurations of direction vectors. In addition, in the Extreme Value test does not concern itself with
multiply-nested loops that refer to multi-dimensional the structure of the subscript expressions, which the
arrays, we can hierarchically test for dependence, GCD test does.
guided by the direction vectors, and thereby make The Extreme Value test is exact under certain con-
fewer dependence tests. Distance vectors can help ditions. It is exact if any of the following are true:
partially parallelize loops, even in the presence of de-
• all loop index coefficients are ±1 or 0,
pendences.
• the coefficient of one index variable is ±1 and
Exact versus Inexact Tests the magnitudes of all other coefficients are less
than the range of that index variable, or
There are three possible answers that any dependence
test can give: • the coefficient of one index variable is ±1 and
there exists a permutation of the remaining in-
1. No dependence - the compiler can prove that no dex variables such that the coefficient of each
dependence exists. is less than the sum of the products of the co-
2. Dependence - the compiler can prove that a de- efficients and ranges for all the previous index
pendence exists. variables.

3. Not sure - the test could neither prove nor dis- Many other dependence tests have been devised
prove dependences. To be safe, the compiler over the years. Many deal with ways of solving the
must assume a dependence in this case. This is dependence system when it takes certain forms. For
the conservative assumption for dependence test- instance, the Two Variable Exact test can find an
ing, necessary to guarantee correct execution of exact solution if the dependence system is a single
the parallel program. equation of the form:

8
impossible result, indicating that the original param-
eterized solution was actually empty, disproving the
ai + bj = c dependence. The Power Test can also be used to test
for dependence for specific direction vectors.
.
All of the preceding dependence tests are applica-
The most general dependence test would be to use
ble when all coefficients and loop bounds are integer
integer programming to solve a linear system - a set
constants and the subscript expressions are all affine
of equations (the dependence system) and a set of in-
functions. The Power Test is the only test mentioned
equalities (constraints on variables due to the struc-
up to this point that can make use of variables as
ture of the program). Integer programming conducts
coefficients or loop bounds. A variable can simply be
a search for a set of integer values for the variables
treated as an additional unknown. The value of the
that satisfy the linear system. Fourier-Motzkin elim-
variable would simply be expressed in terms of the
ination is one algorithm that is used to conduct the
free variables of the solution, then Fourier-Motzkin
search for solutions. Its complexity is very high (ex-
elimination could incorporate any constraints on that
ponential), so until the advent of the Omega test (dis-
variable into the constraints on the free variables of
cussed below), it was considered too expensive to use
the solution. A small number of dependence tests
integer programming as a dependence test.
have been devised that can make use of variables and
The Lambda test is an increased-precision form of
non-affine subscript expressions.
the Extreme Value test. While the Extreme Value
test basically checks to see whether the hyper-plane The Omega test makes use of a fast algorithm for
doing Fourier-Motzkin elimination. The original de-
formed by any of the dependence equations falls com-
pendence problem that it tries to solve consists of a
pletely outside the multi-dimensional volume delim-
ited by the loop bounds of the loop nest in question, set of equalities (the dependence system), and a set of
inequalities (the program constraints). First, it elimi-
the Lambda test checks for the situation in which
nates all equality constraints (as was done in the Gen-
each hyper-plane intersects the volume, but the in-
eralized GCD test) by using a specially-designed modd
tersection of all hyper-planes falls outside the volume.
It is especially useful for the situation in which a sin- function to reduce the magnitude of the coefficients,
gle loop index appears in the subscript expression until at least one reaches ±1, when it is possible to
for more than one dimension of an array reference remove the equality. Then, the set of resulting in-
(a reference referred to as having coupled subscripts). equalities is tested to determine whether any integer
If the Lambda test can find that the intersection of solutions can be found for them. It has been shown
any two dependence equation hyper-planes falls com- that, for most real program situations, the Omega
pletely outside the volume, then it can declare that test gives an answer quickly (polynomial time). In
there is no solution to the dependence system. some cases, however, it cannot and it resorts to an
The I Test is a combination of the GCD and Ex- exponential-time search.
treme Value tests, but is more precise than would be The Range test extends the Extreme Value test to
the application of the two tests individually. symbolic and nonlinear subscript expressions. The
The Generalized GCD test, built on Gaussian ranges of array locations accessed by adjacent loop
Elimination (adapted for integers) attempts to solve iterations are symbolically compared. The Range test
the system of dependence equations simultaneously. makes use of range information for variables within
It forms a matrix representation of the dependence the program, obtained by symbolically analyzing the
system, then using elementary row operations forms program. It is able to discern data dependences in
a solution of the dependence system, if one exists. a few, important situations that other tests cannot
The solution is parameterized so that all possible so- handle.
lutions could be generated. The dependence distance The recent Access Region test makes use of a sym-
can also be determined by this method. bolic representation of the array elements accessed at
The Power Test first uses the Generalized GCD separate sites within a loop. It uses an intersection
test. If that produces a parameterized solution to operation to intersect two of these symbolic access re-
the dependence system, then it uses constraints de- gions. If the intersection can be proven empty, then
rived from the program to determine lower and upper the potential dependence is disproven. The Access
bounds on the free variables of the parameterized so- Region test likewise can test dependence when non-
lution. Fourier-Motzkin elimination is used to com- affine subscript expressions are used, because in some
bine the constraints of the program for this purpose. cases it can apply simplification operators to express
These extra constraints can sometimes produce an the regions in affine terms.

9
An array subscript expression classification system the compiler to compile a test into the program that
can assist dependence testing. Subscript expressions would test for certain parallelism-enabling conditions,
may be classified according to their structure, then then choose between parallel and serial code based on
the dependence solution technique may be chosen the result of the test. This technique of paralleliza-
based on how the subscript expressions involved are tion is called run-time dependence testing.
classified. A useful classification of the subscript ex- The inspector/executor model of program execu-
pression pairs involved in a dependence problem is as tion allows a compiler to run some kind of analysis of
follows: the data values in the program (the inspector), which
sets up the execution, then to execute the code based
ZIV (zero index variable) The two subscript ex- on the analysis (the executor). The inspector can
pressions contain no index variables at all, e.g. do anything from dependence testing to setting up a
A(1) and A(2). communication pattern to be carried out by the ex-
ecutor. The complexity of the test needed at run-time
SIV (single index variable) The two subscript ex-
varies based on the details of the loop itself. Some-
pressions contain only one loop index variable,
times the test needed is very simple. For instance, in
e.g. A(i) and A(i+2).
the loop
MIV (multiple index variable) The two subscript
expressions contain more than one loop index DO i=1,100
variable, e.g. A(i) and A(j) or A(i+j) and A(i+m) = B(i)
A(i). C(i) = A(i)
ENDDO
The different classifications call for unique depen-
dence testing methods. The SIV class is further sub-
no dependence exists in the loop if m > 99. The
divided into various special cases, each enabling a
compiler might generate code that executes a paral-
special dependence test or loop transformation.
lel version of the loop if m > 99, otherwise a serial
The Delta test makes use of these subscript ex-
version.
pression classes. It first classifies each dependence
problem according to the above types. Then, it uses More complicated situations might call for a more
a specially-targeted dependence test for each case. sophisticated dependence test to be performed at run-
The main insight of the Delta test is that when two time. The compiler might be able to prove all con-
array references are being tested for dependence, in- ditions for independence except one. Proof of that
formation derived from solving the dependence equa- condition might be attempted at run-time. For ex-
tion for one dimension may be used in solving the ample, the compiler might determine that a loop is
dependence equation for another dimension. This al- parallel if only a given array was proven to contain
lows the Delta test to be useful even in the presence no duplicate values (i.e. be a permutation vector).
of coupled subscripts. The algorithm attends to the If the body of the loop is large enough, then the
SIV and ZIV equations first, since they can be solved time savings of running the loop in parallel can be
easily. Then, the information gained is used in the substantial. It could offset the expense of checking
solution of the MIV equations. Since the Delta test for the permutation vector condition at runtime. In
does not attempt to use a single general technique this case, the compiler might generate such a test to
to determine dependence, but rather special tests for choose between the serial and parallel versions of the
each special case, it is possible for the Delta test to loop.
accommodate unknown variable values more easily. Another technique that has been employed is to
attempt to run a loop in parallel despite not know-
ing for sure that the loop is parallel. This is called
Run-time Dependence Testing
speculative parallelization. The pre-loop values of the
It is very common for programs to make heavy use of memory locations that will be modified by the loop
variables whose values are read from input files. Un- must be saved because it might be determined dur-
fortunately, such variables often contain crucial in- ing execution that the loop contained a dependence,
formation about the dependence pattern of the pro- in which case the results of the parallel run must be
gram. In this type of situation, a perfectly parallel discarded and the loop must be re-executed serially.
loop might have to be run serially, simply because During the parallel execution of the loop, extra code
the compiler lacked information about the input vari- is executed that can be used to determine whether a
ables. In these cases, it is sometimes possible for dependence really did exist in the serial version. The

10
LRPD test is one example of such a runtime depen- whose dimensionality is not tied to the program-
dence test. declared dimensionality of a given array. An example
of this type is the Linear Memory Access Descriptors
used in the Access Region test. This form can repre-
4.2 Interprocedural Analysis sent most memory access patterns used in a program,
A loop containing one or more procedure calls and allows one to represent memory reference activity
presents a special challenge for parallelizing compil- consistently across procedure boundaries.
ers. The chief problem is how to compare the memory
activity in different execution contexts (subroutines),
for the purpose of discovering data dependences. One 4.3 Symbolic Analysis
possibility, called subroutine inlining, is to remove all
Symbolic analysis refers to the use of symbolic terms
subroutine calls by directly replacing all subroutine
within ordinary compiler analysis. The extent to
calls with the code from the called subroutine, then
which a compiler’s analysis can handle expressions
parallelizing the whole program as one large routine.
containing variables is a measure of how good a com-
This is sometimes feasible, but often causes an explo-
piler’s symbolic analysis is. Some compilers use an
sion in the amount of source code that the compiler
integrated, custom-built symbolic analysis package,
must compile. Inlining also faces obstacles in trying
which can apply algebraic simplifications to expres-
to represent the formal parameters of the subroutine
sions. Others depend on integrated packages, such
in the context of the calling routine, since in some lan-
as the Omega constraint-solver, to do the symbolic
guages (Fortran is one example) it is legal to declare
manipulation that they need. Still others use links
formal parameter arrays with dimensionality different
to external symbolic manipulation packages, such as
from that in the calling routine.
Mathematica or Maple. Modern parallelizing compil-
The alternative to inlining is to keep the subroutine
ers generally have sophisticated symbolic analysis.
call structure intact and simply represent the memory
access activity caused by a subroutine in some way at
the call site. One method of doing this is to represent
memory activity symbolically with sets of constraints. 4.4 Abstract Interpretation
For instance, at the site of a subroutine call, it might When compilers need to know the result of executing
be noted that the locations written were: a section of code, they often traverse the program
in “execution order”, keeping track of the effect of
{A(i) | 0 ≤ i ≤ 100} each statement. This process is called abstract inter-
pretation. Since the compiler generally will not have
The advantage of using this method is that one can access to the runtime values of all the variables in the
use the sets of constraints directly with a Fourier- program, the effect of each statement will have to be
Motzkin-based dependence test. computed symbolically. The effect of a loop is easily
Several other forms for representing memory ac- determined when there is a fixed number of iterations
cesses have been used in various compilers. Many are (such as in a Fortran do-loop). For loops that do
based on triplet notation, which represents a set of not explicitly state a number of iterations, the effect
memory locations in the form: of the iteration may be determined by widening, in
lower bound : upper bound : stride which the values changing due to the loop are made to
This form can represent many regular access pat- change as though the loop had an infinite number of
terns, but not all. iterations, and then narrowing, in which an attempt
Another representational form consists of Regular is made to factor in the loop exit conditions, to limit
Section Descriptors (RSDs), which uses a simple form the changes due to widening. Abstract interpretation
(I +α), where I is a loop index and α is a loop invari- follows all control flow paths in the program.
ant expression. At least three other forms based on
RSDs have been used: Restricted RSDs (which can
express access on the diagonal of an array), Bounded Range Analysis Range analysis is an application
RSDs that express triplet notation with full symbolic of abstract interpretation. It gathers the range of
expressions, and Guarded Array Regions, which are values that each variable can assume at each point in
Bounded RSDs qualified with a predicate guard. the program. The ranges gathered have been used to
An alternative for representing memory activity in- support the Range test, as mentioned in Section 4.1
terprocedurally is to use a representational format above.

11
4.5 Data Flow Analysis program. The program can be translated into a re-
stricted form that eliminates some of the complexi-
Many analysis techniques need global information
ties of the original program. Two examples of this
about the program being compiled. A general frame-
are control-flow normalization and Static Single As-
work for gathering this information is called data flow
signment (SSA) form.
analysis. To use data flow analysis, the compiler
Control-flow normalization is applied to a program
writer must set up and solve systems of data flow
to transform it to a form that is simpler to analyze
equations that relate information at various points in
than a program with arbitrary control flow. An ex-
a program. The whole program is traversed and in-
ample of this is the removal of GOTO statements from a
formation is gathered from each program node, then
Fortran program, replacing them with IF statements
used in the data flow equations. The traversal of the
and looping constructs. This adds information to the
program can be either forward (in the same direction
program structure, which can be used by the compiler
as normal execution would proceed), or backward (in
to possibly do a better job of optimizing the program.
the opposite direction from normal execution). At
Another example is to transform a program into
join points in the program’s control flow graph, the
SSA form. In SSA form, each variable is assigned ex-
information coming from the paths that join must be
actly once and is only read thereafter. When a vari-
combined, so the rules which govern that combination
able in the original program is assigned more than
must be specified.
once, it is broken into multiple variables, each of
The data flow process proceeds in the direc-
which is assigned once. SSA form has the advan-
tion specified, gathering the information by solving
tage that whenever a given variable is used, there is
the data flow equations, and combining information
only one possible place where it was assigned, so more
at control flow join points until a steady-state is
precise information about the value of the variable is
achieved. That is, until no more changes occur in
encoded directly into the program form.
the information being calculated. When steady state
is achieved, the wanted information has been propa-
gated to each point in the program. Gated SSA form Gated SSA form is a variant of
An example of data flow analysis is constant prop- SSA form that includes special conditional expres-
agation. By knowing the value of a variable at a cer- sions (gating expressions) that make the representa-
tain point in the program, the precision of compiler tion more precise. Gated SSA form has been used
analysis can be improved. Constant propagation is for flow-sensitive array privatization analysis within
a forward data flow problem. A value is propagated a loop. Array privatization analysis requires that the
for each variable. The value gets set whenever an compiler prove that writes to a portion of an array be
assignment statement in the program assigns a con- shown to precede all reads to the same section of the
stant to the variable. The value remains associated array within the loop. When conditional branching
with the variable until another assignment statement happens within the loop, then the gating expressions
assigns a value to the variable. At control flow join can help to prove the privatization condition.
points in the program, if a value is associated with the
variable on all incoming paths, and it is always the
same value, then that value stays associated with the 5 Enabling Transformations
variable. Otherwise, the value is set to “unknown”.
Data flow analysis can be used for many purposesLike all optimizing compilers, parallelizers and vec-
within a compiler. It can be used for determining torizers consist of a sequence of compilation passes.
which variables are aliased to which other variables,
The program analysis techniques described so far are
for determining which variables are potentially modi-
usually the first set of these passes. They gather in-
fied by a given section of code, for determining which
formation about the source program, which is then
variables may be pointed to by a given pointer, andused by the transformation passes to make intelli-
many other purposes. Its use generally increases the
gent decisions. The compilers have to decide which
precision of other compiler analyses. transformations can legally and profitably be applied
to which code sections and how to best orchestrate
4.6 Code Transformations to Aid these transformations.
For the sake of this description we divide the trans-
Analysis
formations into two parts, those that enable other
Sometimes program source code can be transformed techniques and those that perform the actual vector-
in a way that encodes useful information about the izing or parallelizing transformations. This division

12
is useful for our presentation, but not strict. Some processors. Data expansion is an alternative imple-
techniques will be mentioned in both places. mentation to privatization. Instead of marking t pri-
vate, the compiler expands the scalar variable into an
array and uses the loop variable as an array index.
5.1 Dependence Elimination and The main difficulty of data privatization and ex-
Avoidance pansion is to recognize eligible variables. A variable
An important class of enabling transformations deals is privatizable in a loop iteration if it is assigned be-
with eliminating and avoiding data dependences. We fore it is used. This is relatively simple to detect for
will describe data privatization, idiom recognition, scalar variables. However, the transformation is im-
and dependence-aware work partitioning techniques. portant for arrays as well. The analysis of array sec-
tions that are provably assigned before used can be
very involved and requires symbolic analysis, men-
Data Privatization and Expansion tioned in Section 4.
Data privatization is one of the most important tech-
niques because it directly enables parallelization and Idiom Recognition: Reductions, Inductions,
it applies very widely. Data privatization can re- Recurrences
move anti and output dependences. These so-called
storage-related or false dependences are not due to These transformations can remove true (i.e., flow) de-
computation having to wait for data values produced pendences. The elimination of situations where one
by another computation. Instead, the computation computation has to wait for another to produce a
must wait because it wants to assign a value to a needed data value, is only possible if we can express
variable that is still in use by a previous computa- the computation in a different way. Hence, the com-
tion. The basic idea is to use a new storage location piler recognizes certain idioms and rewrites them in
so that the new assignment does not overwrite the a form that exhibits more parallelism.
old value too soon. Data privatization does this as
shown in Figure 1. Induction variables: Induction variables are vari-
ables that are modified in each loop iteration in such
a way that the assumed values can be expressed as
a mathematical sequence. Most common are simple,
additive induction variables. They get incremented
in each loop iteration by a constant value, as shown
in Figure 2. In the transformed code the sequence
is expressed in a closed form, in terms of the loop
variable. The induction statement can then be elim-
inated, which removes the flow dependence.

Figure 1: Data Privatization and Expansion.

In the original code, each loop iteration uses the


variable t as a temporary storage. This represents
a dependence, in that each iteration would have to Figure 2: Induction Variable Substitution.
wait until the previous iteration is done using t. In
the sequential execution of the program this order More advanced forms of induction variable substi-
is guaranteed. However, in a parallel execution we tution deal with multiply nested loops, coupled in-
would like to execute all iterations concurrently on duction variables (which are incremented by other in-
different processors. The transformed code simply duction variables), and multiplicative induction vari-
marks t as a privatizable variable. This instructs the ables. The identification of induction variables can be
code-generating compiler pass to place t into the pri- through pattern matching (e.g., the compiler finds
vate storage of each processor - essentially p times statements that modify variables in the described
replicating the variable t, where p is the number of way) or through abstract interpretation (identifying

13
the sequence of values assumed by a variable in a piler then replaces this code by a call to a mathemat-
loop). ical library that contains the corresponding parallel
solver algorithm. This substitution can pay off if the
Reduction operations: Reduction operations ab- array is large. Many variants of linear recurrences are
stract the values of an array into a form with lesser possible. A large number of library functions need to
dimensionality. The typical example is an array being be made available to the compiler so that an effective
summed up into a scalar variable. The parallelizing substitution can be made in many situations.
transformation is shown in Figure 3.
Correctness and performance of idiom recog-
nition techniques: Idiom recognition and substi-
tution come at a cost. Induction variable substitu-
tion replaces an operation with one of higher strength
(usually, a multiplication replaces an addition). In
fact, this is the reverse operation of strength reduc-
tion – an important technique in classical compil-
ers. Parallel reduction operations introduce addi-
tional code. If the reduction size is small, the over-
Figure 3: Reduction Parallelization. head may offset the benefit of parallel execution. This
is true to a greater extent for parallel recurrence
The idea for enabling parallel execution of this solvers. The overhead associated with parallel solvers
computation exploits mathematical commutativity. is relatively high and can only be amortized if the re-
We can split the array into p parts, sum them up in- currence is long. As a rule of thumb, induction and
dividually by different processors, and then combine reduction substitution is usually beneficial, whereas
the results. The transformed code has two additional the performance of a program with and without re-
loops, for initializing and combining the partial re- currence recognition should be more carefully evalu-
sults. If the size of the main reduction loop (variable ated.
n) is large, then these loops are negligible. The main It is also important to note that, although parallel
loop is fully parallel. recurrence and reduction solvers perform mathemati-
More advanced forms of this technique deal with cally correct transformations, there may be round-off
array reductions, where the sum operation modifies errors because of the limited computer representa-
several elements of an array instead of a scalar. Fur- tion of floating point numbers. This may lead to
thermore, the summation is not the only possible re- inaccurate results in application programs that are
duction operation. Another important reduction is numerically very sensitive to reordering operations.
finding the minimum or maximum value of an array. Compilers usually provide command line options so
that the user can control these transformations.
Recurrences: Recurrences use the result of one or
several previous loop iterations for computing the Dependence-aware Work Partitioning:
value of the next iteration. This usually forces a loop Skewing, Distribution, Uniformization
to be executed serially. However, for certain forms of
linear recurrences, algorithms are known that can be The above three techniques are able to remove data
parallelized. For example, in Figure 4 dependences and, in this way, generate fully parallel
loops. If dependences cannot be removed it may still
be possible to generate parallel computation. The
computation may be reordered so that expressions
that are data dependent on each other are executed
by the same processor. Figure 5 shows an example
loop and its iteration dependence graph.
By regrouping the iterations of the loop as indi-
cated by the shaded wavefronts in the iteration space
graph, all dependences stay within the same proces-
Figure 4: Recurrence Substitution. sor, where they are enforced by the sequential ex-
ecution of this processor. This technique is called
the compiler has recognized a pattern of linear recur- loop skewing. The class of unimodular transforma-
rences for which a parallel solver is known. The com- tions contains more general techniques that can re-

14
a parallel loop to an outer position increases the
amount of work inside the parallel region.
Splitting a single loop into a nest of two loops is
called stripmining or loop blocking. It enables the
exploitation of hierarchical parallelism (e.g., the inner
loop may then be executed across a multiprocessor,
while the outer loop gets executed by a cluster of
such multiprocessors.) It is also an important cache
optimization, as we will discuss.

Figure 5: Partitioning the Iteration Space in “Wave- 6 Vectorization: Exploiting


front” Manner.
Vector Architectures
order loop iterations according to various criteria, Vectorizing compilers exploit vector architectures by
such as dependence considerations and locality of ref- generating code that performs operations on a num-
erence (locality optimizations will be discussed in Sec- ber of data elements in a row. This was of great in-
tion 7.2). terest in classical supercomputers, which were built
Other techniques can find partial parallelism in as vector architectures. In addition, vectorization has
loops that contain data dependences. For example, enjoyed renewed interest in modern microprocessors,
loop distribution may split a loop into two loops. One which can accommodate several short data items in
of them contains all dependent statements and must one word. For example, a 64 bit word can accommo-
execute serially, while the other one is fully paral- date a “vector” of 16 4-bit words. Instructions that
lel. Another example is dependence uniformization, operate on vectors of this kind are sometimes referred
which tries to find minimum dependence distances. to as multi-media extensions (MMX).
If all dependence distances are greater than a thresh- The objective of a vectorizing compiler is to iden-
old t, then t consecutive iterations can be executed tify and express such vector operations in a form that
in parallel. can then be easily mapped onto the vector instruc-
tions available in these architectures. A simple ex-
ample is shown in Figure 6. The following transfor-
5.2 Enabling and Enhancing Other mations aid vectorization in more complex program
Transformations patterns.
Another class of enabling transformations contains
prerequisite techniques for other transformations and
techniques that allow others to be applied more ef-
fectively. Some transformations belong to both the
enabling and enabled techniques. Because of this we
will only give an overview. The following two sections
will describe details of some of the techniques.
Various transformations require statements to be
reordered. This can result in dependence distances Figure 6: Basic Vectorization.
getting shorter (the producing and consuming state-
ments of a value are moved closer together), the
points of use and reuse of a variable are moved closer Scalar Expansion
together (which improves cache locality), or back-
Private variables, introduced in Section 5.1, need to
wards dependences can be turned into forward de-
be expanded in order to allow vectorization. The fol-
pendences.
lowing shows the privatization example of Section 5.1
Loop distribution splits loops into two or more
transformed into vector form.
loops that can be optimized individually. It also
enables vectorization, discussed next. Interchang-
ing two nested loops can help the vectorization tech-
niques, which usually act on the innermost loop. T(1:n) = A(1:n)+B(1:n)
It can also enhance parallelization because moving C(1:n) = T(1:n)+T(1:n)**2

15
Loop Distribution Stripmining Vector Lengths

A loop containing several statements must first be Vector instructions usually take operands of length
distributed into several loops before each one can be 2n - the size of the vector registers. The original loop
turned into a vector operation. Loop distribution must be divided into strips of this length. This is
(also called loop splitting or loop fission) is only pos- called stripmining. In Figure 9, the number of iter-
sible if there is no dependence in a lexically backward ations have been broken down into strips of length
direction. Statements can be reordered to avoid back- 32.
wards dependences, unless there is a dependence cy-
cle (a forward and a backward dependence that form
a loop). Figure 7 shows a loop that is distributed
and vectorized. The original loop contains a depen-
dence in a lexically forward direction. Such a depen-
dence does not prevent loop distribution. That is,
the execution order of the two dependent statements
is maintained in the vectorized code. Figure 9: Stripmining a Loop into Two Nested Loops.

Vector Code Generation

Finding vectorizable statements in a multiply-nested


loop that contains data dependences can be quite dif-
ficult in the general case. Algorithms are known that
perform this operation in a recursive manner. They
move from the outermost to the innermost loop level
and test at each level for code sections that can be dis-
tributed (i.e., they do not contain dependence cycles)
Figure 7: Loop Distribution Enables Vectorization. and then vectorized, as described. Code sections with
dependence cycles are inspected recursively at inner
loop levels.

Handling Conditionals in a Loop


7 Parallelization: Exploiting
Conditional execution is an issue for vectorization be-
cause all elements in a vector are processed in the
Multiprocessors
same way. Figure 8 shows how a conditional execu- Parallelizing compilers have been most successful on
tion can be vectorized. The conditional is first evalu- shared-memory multiprocessors (SMPs). Additional
ated for all vector elements and a vector of true/false techniques are necessary for transforming programs
values is formed, called the mask. The actual op- onto distributed-memory multiprocessors (DMPs).
eration is then executed conditionally, based on the In this section we will first describe techniques that
value of the mask at each vector position. apply to both machine classes and then present tech-
niques specific to DMPs.
Although very similar analysis techniques are used,
parallelization differs substantially from vectoriza-
tion. For example, data privatization is expressed by
adding the variables to a private list, instead of ap-
plying scalar expansion. Loop distribution and strip-
mining are not a prerequisite, because the compu-
tation does not need to be reordered (although this
can be done as an optimization, as will be discussed).
Figure 8: Vectorization in the Presence of Condition- Conditionals don’t need special handling because dif-
als. ferent processors can directly execute different code
sections.

16
The most important sources of parallelism for mul- Exploiting Partial Loop Parallelism
tiprocessors are iterations of loops, such as do-loops
in Fortran programs and for-loops in C programs. Partial parallelism can also be exploited in loops with
We will present techniques for detecting that loop it- true dependences that cannot be removed. The basic
erations can correctly and effectively be executed in idea is to enforce the original execution order of the
parallel. Briefly we will also mention techniques for dependent program statements. Parallelism is still
exploiting partial parallelism in loops and in non-loop exploited as described above, however each depen-
program constructs. dent statement now waits for a go-ahead signal telling
All parallelizing compiler techniques have to deal it that the needed data value has been produced by
with two general issues, (1) they must be provably a prior iteration. The successful implementation of
this scheme relies on efficient hardware synchroniza-
correct and (2) they must improve the performance
of the generated code, relative to a serial execution on tion mechanisms.
one processor. The correctness of techniques is often Compilers can reduce the waiting time of depen-
stated by formally defining data-dependence patterns dent statements by moving the source and sink of a
under which a given transformation is legal. While dependence closer to each other. Statement reorder-
such correctness proofs exist for most of today’s com- ing techniques are important to achieve this effect. In
piler capabilities, they often require the compiler to addition, because every synchronization introduces
make conservative assumptions, as described above. overhead, reducing the number of synchronization
The second issue is no less complex. Assessing per- points is important. This can be done by elimi-
formance improvement involves the assumption of a nating redundant synchronizations (i.e., synchroniza-
machine model. For example, one must assume that tions that are covered by other synchronizations) or
a parallel loop will incur a start/terminate overhead. by serializing a code section. Note that there are
Hence, it will not execute n times faster on an n- many tradeoffs for the compiler to make. For exam-
processor machine than on one processor. Its parallel ple, it has to decide when it is more profitable to
t
execution time is no less than 1 processor + toverhead . serialize a code section than to execute it in parallel
n
For small loops this can be more than the serial exe- with many synchronizations.
cution time. Unfortunately, even the most advanced
compilers sometimes do not have enough information Non-loop Parallelism
to make such performance predictions. This is be-
cause they do not have sufficient information about Loops are not the only source of parallelism.
properties of the target machine and about the pro- Straight-line code can be broken up into indepen-
gram’s input data. dent sections, which can then be executed in par-
allel. For building such parallel sections, a compiler
can, for example, group all statements that are mu-
7.1 Parallelism Recognition tually data dependent into one section. This results
in several sections between which there are no data
Exploiting Fully Parallel Loops
dependences. Applying this scheme at small basic
Basic parallel code generation for multiprocessors code blocks is important for instruction-level paral-
entails identifying loops that have no loop-carried lelization, to be discussed later. At a larger scale,
dependences, and then marking these loops as such parallel regions could include entire subroutines,
parallelizable. Data-dependence analysis and all which can be assigned to different processors for ex-
its enabling techniques for program analysis and ecution.
dependence-removal is most important in this pro- More complex is exploiting parallelism in the repet-
cess. Iterations of parallelizable loops are then as- itive pattern of a recursion. Recursion splitting tech-
signed to the different processors for execution. This niques can transform a recursive algorithm into a
second step may happen through various methods. loop, which can then be analyzed with the already
The parallelizing compiler may be directly coupled described means.
with a code-generating compiler that issues the actual Non-loop parallelism is important for instruction-
machine code. Alternatively, the parallelizer can be level parallelization. It is of lesser importance for
a pre-processor, outputting the source program an- multiprocessors because the degree of parallelism in
notated with information about which loops can be loops is usually much higher than in straight-line code
executed in parallel. A backend compiler then reads sections. Furthermore, since the most prevalent par-
this program form and generates code according to allelization technology is found in compilers for non-
the preprocessor’s directives. recursive languages, such as Fortran, there has not

17
been a pressing need to deal with recursive program coalescing introduces overhead because it needs to in-
patterns. troduce additional expressions for reconstructing the
original loop index variables from the index of the
combined loop. Again, benefits and overheads must
7.2 Parallel Loop Restructuring be compared by the compiler.
Once parallel loops are detected, there are several
loop transformations that can optimize the program
such that it (1) exploits the available resources in an
optimal way and (2) minimizes overheads.

Increasing Granularity Figure 11: Coalescing Two Nested Loops.


A parallel computation usually incurs an overhead
when starting and terminating. For example, start- Loop interchanging can increase the granularity
ing and ending a parallel loop comes at a runtime significantly by moving an inner parallel loop to an
cost sometimes referred to as loop fork/join overhead. outer position in a loop nest. This techniques is
The larger the computation in the loop, the better shown in Figure 12. As a result, the loop fork/join
this overhead can be amortized. overhead is only incurred once overall instead of once
The techniques loop fusion, loop coalescing, and per iteration of the outer loop. Loop interchanging is
loop interchange can all increase the granularity of also subject to legality considerations, which are for-
parallel loops by increasing the computation between mulated as rules on data dependence patterns that
the fork and join points. Each transformation comes permit or disallow the transformation. For example,
with potential overhead, which must be considered in interchange is illegal if a dependence that is carried
the profitability decision of the compiler. by the outer loop goes from a later iteration of the in-
Loop fusion combines two adjacent loops into a sin- ner loop to an earlier iteration of the same loop. The
gle loop. It is the reverse transformation of loop dis- interchanged loop would reverse the order of the two
tribution and is subject to similar legality consider- dependent iterations. In data dependence terminol-
ations. Fusion is straightforward if the loop bounds ogy, one cannot interchange two loops if the depen-
of the two candidates match, as shown in Figure 10. dence with respect to the outer loop has a forward
Several techniques can adjust these bounds if neces- direction (<) while the dependence with respect to
sary. The compiler may peel iterations (split off a the inner loop has a backward direction (>).
number of iterations into a separate loop), reverse it-
erations (loop iterates from upper to lower bound) or
normalize the loops (loop iterates from 0 to some new
upper bound with a stride of one). These adjustments
may cause overhead because they introduce new loops
(loop peeling) or may lead to more complex subscript
expressions. Figure 12: Moving an Inner Parallel Loop to an Outer
Position.

If the granularity of a parallel loop cannot be in-


creased above the profitability threshold, it is bet-
ter to execute the loop serially. Compile-time per-
formance estimation capabilities are critical for this
purpose. They rely on all the program analysis tech-
niques that can determine the values assumed by
Figure 10: Fusing Two Loops into One. certain variables, such as loop bounds. They also
include machine parameters, for example the prof-
Loop coalescing merges two nested loops into a sin- itability threshold for parallel loops, of which the loop
gle loop. Figure 11 shows an example. This trans- fork/join overhead is an important factor. In general,
formation has additional benefits, such as increasing it is not possible to evaluate the profitability at com-
the number of iterations for better load balancing and pile time. One solution to this problem is that the
exploiting two levels of loop parallelism even if the compiler formulates conditional expressions that will
underlying machine supports only one level. Loop be evaluated at runtime and decide when the paral-

18
lel execution is profitable. This can be implemented temporal cache reuse. Figure 13 gives an example of
through two-version loops: the conditional expression this transformation.
forms the condition of an IF statement, which chooses
between the parallel and serial versions of the loop.

Reducing Memory Latency


Techniques to reduce or hide memory access latencies
are increasingly important because the speed of com-
putation in modern processors increases more rapidly Figure 13: Loop Blocking to Increase Cache locality.
than the speed of memory accesses. The primary
hardware mechanism that supports latency reduc- Loop tiling is a more general form of reordering
tions is the cache. It keeps copies of memory cells in computation that can increase cache reuse. The iter-
fast storage, close to the processor, so that repeated ation space of a multiply-nested loop is divided into
accesses incur much lower latencies (which is referred a number of tiles. Each tile is then executed by one
to as temporal locality). In addition, caches fetch processor. Tiling has several goals. In addition to
multiple words from memory in one transfer, so that increasing cache locality it can partition the compu-
accesses to adjacent memory elements hit in cache tation according to the dependence structure in order
as well (spatial locality). Compiler techniques try to to identify parallel loops. This was described above
reorder computation so that the temporal and spa- as dependence-aware work partitioning.
tial locality of the program is increased. While this Other transformations can influence cache local-
is already important in compilers for single-processor ity. For example, loop fusion binds pairs of iterations
machines, there are additional considerations in mul- from adjacent loops to each other. They are then
tiprocessors. This is because of the need to keep mul- guaranteed to execute on the same processor. This
tiple caches coherent and because of the interaction can increase cache reuse across the two loops.
between locality and parallelism. Loop distribution can be an enabling transforma-
Loop interchange is one of the most effective trans- tion for cache optimization. For example, the code in
formations for increasing spatial locality. It can Figure 14 shows a non-perfectly-nested loop that is
change the order of the computation so that it per- distributed into a single loop and a perfectly-nested
forms stride-1 references. That is, adjacent iterations double loop. The loop nest is then interchanged to
access adjacent memory cells. Note that this trans- obtain stride-1 accesses.
formation may be different for different programming
languages. For example, Fortran programs place the
left-most dimension of an array in contiguous mem-
ory (referred to as column major order), whereas C
programs use row major order. Therefore, Fortran
compilers will try to move the loop that accesses
the left-most dimension of an array to the innermost
position in a loop nest. C compilers would do the
same with the right-most dimension. The loop in-
terchange example shown above achieves this effect Figure 14: Loop Distribution Enables Interchange.
as well. However, note that the two goals of obtain-
ing stride-1 references and increasing granularity may There are more advanced techniques to reduce
conflict. In this case, the compiler will have to esti- memory latency, which are less widely used and not
mate and compare the performance of both program generally supported by today’s computer architec-
variants. tures. They include compiler cache management
Another important cache optimization technique and prefetch operations. Software cache management
is loop blocking, which is basically the same as strip- controls cache operations explicitly. The compiler in-
mining, introduced above. By dividing a computa- serts instructions that flush the cache content, select
tion into several blocks, we can reorder it so that the cache strategies, and control which data sections are
use and reuse of a data item are moved closer to each cached. Prefetch operations aim at transferring data
other. It is then more likely that the item is still in into cache or into a dedicated prefetch buffer before
cache and has not been evicted by other computation the executing program requests it, so that the data
before it is reused. Hence, loop blocking can increase is available in fast storage when needed.

19
Multi-level Parallelization scheduling action. However this is often negligible
compared to the gain from load balancing.
Most multiprocessors today offer one-level paral- The goal of computing where the resources are, on
lelism. That means that, usually, a single loop out of the other hand, favors static scheduling methods. In
a loop nest can be executed in parallel. Architectures heterogeneous systems it is mandatory to perform the
that have a hierarchical structure can exploit two or computation where necessary processor capabilities
more levels of parallelism. For example, a cluster of or input/output devices are. In multiprocessors, data
multiprocessors may be able to execute two nested are critical resources, whose location may determine
parallel loops, the outer one across the clusters while the best executing processor. For example, if a data
the inner loop employs the processors within each item is known to be in the cache of a specific pro-
cluster. cessor, then it is best to execute other computations
Program sections that contain singly-nested loops accessing the same data item on the same processor
can be turned into two-level parallelism by stripmin- as well. The compiler has knowledge of which com-
ing them into two nested loops as shown in Figure 15. putation accesses which data in the future. Hence,
such scheduling decisions are good to make at com-
pile time. In distributed-memory systems this situa-
tion is even more pronounced. Accessing a data item
on a processor other than its owner, involves commu-
nication with high latencies.
Scheduling decisions also depend on the environ-
ment of the machine. For example, in a single-user
environment it may be best to statically schedule
computation that is known to execute evenly. How-
ever if the same program is executed in a multi-user
Figure 15: Stripmining Enables 2-level Parallelism. environment, the load of the processors can be very
uneven, making dynamic scheduling methods the bet-
ter option.

7.3 Scheduling 7.4 Techniques Specific to DMPs


After parallelism is detected and loops are restruc- Distributed-memory multiprocessors do not provide
tured for optimal performance, there is still the issue a shared address space. Many of the techniques de-
of defining an execution order and assigning parallel scribed so far assume that all computation can see the
activities to processors. This must be done in a way necessary data, no matter where it is performed. In
that (1) balances the load, (2) performs computation DMPs this is no longer the case. The compiler needs
where the necessary resources are, and (3) considers to distribute the program’s data onto several compute
the environment. Scheduling decisions can be done at nodes, and data items that are needed by nodes other
compile time (statically) or at runtime (dynamically). than their home need to be communicated by send-
Both methods have advantages and disadvantages. ing and receiving messages. This creates two major
Load balancing is the primary reason for dynamic new tasks for the compiler: (1) finding a good data
scheduling. Static scheduling methods typically split partitioning and distribution scheme, and (2) orches-
up the number of loop iterations into equal chunks trating the communication between the nodes in an
and assign them to the different processors. This optimal way.
works well if the loop iterations are equal in size.
However this is not the case in loops that contain
Data Partitioning and Distribution
conditional statements or in an inner loop of a nest
whose number of iterations depends on an outer loop The goal of data partitioning and distribution is to
variable. Dynamic scheduling methods assign loop it- place each data item on the compute node that ac-
erations to processors, a chunk at a time. The chunk cesses it most frequently. Data partitioning and dis-
can contain one or more iterations. Of special inter- tribution are often performed as two or more steps
est are also scheduling schemes that vary the chunk in a compiler. For simplicity we describe it here as
size, such as trapezoidal scheduling and guided self one data distribution step. Several issues need to be
scheduling methods. Dynamic scheduling methods resolved. First, the proper units of data distribution
come with some runtime overhead for performing the need to be determined. Typically these are sections

20
of arrays. Second, data access costs and frequencies mon form of communication is to send and receive
need to be estimated to compare distribution alter- messages, which can communicate one or several data
natives. Third, if there are program sections with elements between two specific processors. More com-
different access patterns, redistribution may need to plex communication primitives are also known, such
be considered at runtime. as broadcasts (send to all processors) and reductions
The simplest data distribution scheme is block dis- (receive from all processors and combine the results
tribution, which divides an array into p sections, in some form).
where p is the number of processors. Block distribu-
tion creates contiguous array sections on each proces- The basic idea of message generation is simple.
sor. If adjacent processors access adjacent array ele- Each statement needs to communicate to/from re-
ments, then cyclic distribution is appropriate. Block- mote processors the accessed data elements that are
cyclic distribution is a combination of both schemes. not local. Often, the owner-computes principle is as-
For irregular access patterns, indexed distribution sumed. For example, assignment statements are ex-
may be most appropriate. Figure 16 illustrates these ecuted on the processor that owns the left-hand-side
distribution schemes. element. (Note that this execution scheme is differ-
ent from the one assumed for SMPs, where an entire
loop iteration is executed by one and the same pro-
cessor.) Data partitioning information hence supplies
both the information about which processor executes
which statement and what data is local/remote. The
Figure 17 shows the basic messages generated. It as-
sumes a block partitioning of both arrays, A and B.

Figure 16: Data Distribution Schemes.Numbers indi-


cate the node of a 4-processor DMP on which the array section
is placed.

A major difficulty is that data distribution deci-


sions affect all program sections accessing a given
data element. Hence, global optimizations need to
be performed, which can be algorithmically complex. Figure 17: Generating Messages for Data Exchange
Furthermore, compile-time information about array in a Distributed-Memory Machine.
accesses is often incomplete. Better global optimiza-
tion would require knowledge of the program input
data. Also, distribution decisions cannot be made in
isolation. They need to factor in available parallelism Although generating messages in this scheme
and the cost of messages, described next. Finally, would lead to a functionally correct program, this
indexed distribution, although most flexible, may in- program may be inefficient. To increase the effi-
cur additional overhead because the index array may ciency, the compiler needs to aggregate communi-
need to be distributed itself, causing double latenciescation. That is, messages generated for individual
for each array access. Because of all this, develop- statements need to be combined into a larger mes-
ing compiler techniques for automatic data partition- sage. Also, messages may be moved to an earlier
ing and distribution is still an active research topic.point in the instruction stream, so that communi-
Many current parallel programming approaches as- cation latencies can be overlapped with computa-
sume that the user assists the compiler in this pro- tion. Message aggregation for the general block-cyclic
cess. distribution is already rather complex. It is made
even more difficult because message sizes may only
Message Generation be known in the form of symbolic expressions. For
indexed distributions, support through “communica-
Once the owner processor for each data item is identi- tion libraries for irregular computation” has been an
fied, the compiler needs to determine which accesses active research topic. In general, compilers that deal
are from remote processors and then insert communi- with the issues of message generation for DMPs are
cation to and from these processors. The most com- highly sophisticated and complex.

21
8 Exploiting Parallelism at the Removing Dependences
Instruction Level The basic patterns for removing dependences are sim-
ilar to the ones discussed for loop-level parallelism.
Instruction-level parallelism (ILP) refers to the pro- Anti dependences can be removed through variable
cessor’s capability to execute several instructions at renaming techniques. In addition, register renaming
the same time. Instruction-level parallelism can be becomes important. It avoids conflicts between po-
exploited implicitly by the processor without the tentially parallel sets of instructions that make use
compiler issuing special directives or instructions to of the same register for storing temporary values.
the hardware. Or, the compiler can extract par- Such techniques can be opposite from good register
allelism explicitly and express it in the generated allocation in sequential instruction streams, where
code. Examples of the latter type of generated code non-overlapping life times of different variables are
are Very-Long Instruction Word (VLIW) or Explic- assigned to the same register. Because of this, the
itly Parallel Instruction Computing (EPIC) architec- compiler may rely on hardware register-renaming ca-
tures. In addition to the techniques presented here, pabilities available in the processor.
all or most techniques known from classical compilers Similar to the induction variable recognition tech-
are important and are often applied as a first set of nique discussed above, the compiler can replace in-
transformations and analysis passes in ILP compilers cremental operations through operations that use
(see “Program Compilers”). operands available at the beginning of a parallel code
section. Likewise, a sequence of sum operations may
be replaced by sum operations into a temporary vari-
able, followed by an update step at the end of the
8.1 Implicit Instruction-level Paral- parallel region. Figure 18 illustrates these transfor-
lelism mations.

For implicit instruction-level parallelism,


the compiler-generated code can be the same as for
single-issue machines. However, knowing the proces-
sor’s ILP mechanisms, the compiler can change the
code so that the processor can do a more effective
job. Three categories of techniques are important
for this objective, (1) scheduling instructions, (2) re-
moving dependences, and (3) increasing the window
size within which the processor can exploit parallel
instructions.

Instruction Scheduling

Modern processors exploit ILP by starting a new in- Figure 18: Dependence-Removing Transformation
for Instruction-Level Parallelism.Shaded blocks of in-
struction before an earlier instruction has completed. structions are independent of each other and can be executed
All instructions begin their execution in the order in parallel.
defined by the program. This order can have a sig-
nificant performance impact. Hence, an important
task of the compiler is to define a good order. It does
Increasing the Window Size
this by moving instructions that incur long latencies
prior to those with short latencies. Such instruction A large window of instructions within which the pro-
scheduling is subject to data dependence constraints. cessor can discover and exploit ILP is important for
For example, an instruction consuming a value in a two reasons. First, it leads to more opportunities for
register or memory must not be moved before an in- parallelism and, second, it reduces the relative cost of
struction producing this value. Instruction schedul- starting the instruction pipeline, which usually hap-
ing is a well-established technology, discussed in stan- pens at window boundaries.
dard compiler textbooks. We refer the reader to such Window boundaries are typically branch instruc-
literature for further details. tions. Instruction analysis and parallel execution can-

22
not easily cross branch instructions because the pro- choose a VLIW architecture and a software pipelining
cessor does not know what instructions will execute technique. The goal is to find a repetitive instruction
after the branch until it is reached. For example, pattern in a loop that can be mapped efficiently to a
an instruction may only execute on the true branch sequence of VLIW instructions. It is called pipelin-
of a conditional jump. Hence, although the instruc- ing because the execution may “borrow” instructions
tion could be executed in parallel with another in- from earlier or later iterations to fill an efficient sched-
struction before the branch, this is not feasible. If ule. Hence the execution of parts of different itera-
the false branch is taken, the instruction might have tions overlap with each other. This is illustrated in
written an undesired value to memory. Even if all Figure 20. The reader may notice that there would be
side effects of such instructions were kept in tempo- a conflict in register use between the overlapping loop
rary storage, they may raise exceptions that could iterations. For example, in the same VLIW instruc-
incorrectly abort program execution. tion s4 uses R0’s value of one loop iteration while s2
There are several techniques for increasing the win- uses R0’s value belonging to the next iteration. Both
dow size. Code motion techniques can move instruc- software and hardware register renaming techniques
tions across branches under certain conditions. For are known to resolve this problem.
example, if the same instruction appears on both
branches it can be moved before the branch, subject
to data dependence considerations. In this way, the
basic block on the critical path can be increased or
another basic block can be completely removed.
Instruction predication can also remove a branch by
assigning the branch condition to a mask, which then
guards the merged statements of both branches. Fig-
ure 19 shows an example. Predicated execution needs
hardware support and the compiler must tradeoff the
benefits of enhanced ILP with the overhead of more
executed instructions.

Figure 20: Translating a Loop in a Software Pipeling


Scheme.
Figure 19: Generating Predicated Code. Efficient parallelism at the instruction level de-
pends on the compiler’s ability to identify code se-
Basically, ILP is exploited in straight-line code quences that are executed with high probability. The
sections. Additional performance can be gained instructions of these code sequences must then be
when exploiting parallelism across loop iterations. A reordered so that the processor’s functional units
straightforward way to achieve this is to unroll loop are exploited in an optimal way. The two tasks
iterations, that is, to replicate the loop body by a are sometimes called trace selection and trace com-
factor n (and divide the number of iterations by the paction, respectively. Trace selection is a global op-
same factor). Many techniques that were discussed timization in that it looks for code sequences across
under loop optimizations for multiprocessors are ap- multiple branches, factoring in estimated branch fre-
plicable to this case as well. quences. Trace compaction can move instructions
across branches, whereby it may add bookkeeping
8.2 Explicit Instruction-level Paral- code to ensure correct execution if a predicted branch
lelism is not taken. The two techniques together are referred
to as trace scheduling.
The basic techniques for removing dependences and
increasing the window size are important for explicit
ILP as well. However the goal is no longer to expose 9 Compiler-internal Concerns
more parallelism to the hardware detection mecha-
nisms, but to make parallelism more analyzable by Compiler developers have to resolve a number of
the compiler itself. As an example of explicit ILP we issues other than designing analysis and transfor-

23
mation techniques. These issues become important removing transformations. Other techniques mutu-
when creating a complete compiler implementation, ally influence each other. We have introduced sev-
in which the described techniques are integrated into eral techniques where this situation occurs. For ex-
a user-friendly tool. Several compiler research in- ample, loop blocking for cache locality also involved
frastructures have played pioneering roles in this re- loop interchanging, and loop interchanging was made
gard. Among them are the Parafrase [1, 2], PFC [3], possible through loop distribution. There are many
PTRAN [4], ParaScope [5], Polaris [6], and SUIF [7] situations where the order of transformations is not
compilers. The following paragraphs describe a num- easy to determine. One possible solution is for the
ber of issues that have to be addressed by such in- compiler to generate internally a large number of pro-
frastructures. An adequate compiler-internal repre- gram variants and then estimate their performance.
sentation of the program must be chosen, the large We have already described the difficulty of perfor-
number of transformation passes need to be put in the mance estimation. In addition, generating a large
proper order, and decisions have to be made about number of program variants may get prohibitively ex-
where to apply which transformation, so as to max- pensive in terms of both compiler execution time and
imize the benefits but keep the compile time within space need. Practical solutions to the phase ordering
bounds. The user interface of the compiler is im- problem are based on heuristics and ad-hoc strate-
portant as well. Optimizing compilers typically come gies. Finding better solutions is still a research issue.
with a large set of command-line flags, which should One approach is for the compiler to decide on the
be presented in a form that makes them as easy to applicability of several transformations at once. This
use as possible. is the goal of unimodular transformations, which can
determine a best combination of iteration-reordering
Internal Representation techniques, subject to data dependence constraints.

A large variety of compiler-internal program repre- Applying Transformations at the Right Place
sentations (IRs) are in use. They differ with re-
spect to the level of program translation and the One of the most difficult problems for compilers is
type of program analysis information that is implic- to decide when and where to apply a specific tech-
itly represented. Several IRs may be used for several nique. In addition to the phase ordering problem,
phases of the compilation. The syntax tree IR rep- there is the issue that most transformations can have
resents the program at a level that is close to the a negative performance impact if applied to the wrong
original program. At the other end of the spectrum program section. For example, a very small loop may
are representations close to the generated machine run slower in parallel than serially. Interchanging two
code. An example of an IR in between these extremes loops may increase the parallel granularity but reduce
is the register transfer language, which is used by data locality. Stripmining for multi-level parallelism
the widely-available GNU C compiler. Source-level may introduce more overhead than benefit if the loop
transformations, such as loop analysis and transfor- has a small number of iterations. This difficulty is
mations, are usually applied on an IR at the level increased by the fact that machine architectures are
of the syntax-tree, whereas instruction-level transfor- getting more complex, requiring specialized compiler
mations are applied on an IR that is closer to the transformations in many situations. Furthermore, an
generated machine code. Examples of representations increasing number of compiler techniques are being
that include analysis information are the static single developed that apply to a specific program pattern,
assignment form (SSA) and the program dependence but not in general. For reasons discussed before, the
graph (PDG). SSA was introduced in Section 4. The compiler does not always have sufficient information
PDG includes information about both data depen- about the program input data and machine parame-
dences and control dependences in a common rep- ters to make optimal decisions.
resentation. It facilitates transformations that need
to deal with both types of dependences at the same Speed versus Degree of Optimization
time.
Ordinary compilers transform medium-size programs
in a few seconds. This is not so for parallelizing com-
Phase Ordering
pilers. Advanced program analysis methods, such as
Many compiler techniques are applied in an obvious data dependence analysis and symbolic range anal-
order. Data dependence analysis needs to come be- ysis, may take significantly longer. In addition, as
fore parallel loop recognition, and so do dependence mentioned above, compilers may need to create sev-

24
eral optimization variants of a program and then pick [4] F. Allen, M. Burke, P. Charles, R. Cytron, and
the one with the best estimated performance. This J. Ferrante”, “An overview of the PTRAN analy-
can further multiply the compilation time. It raises a sis system for multiprocessing,” Proc. of the Int’l
new issue in that the compiler now needs to make de- Conf. on Supercomputing, 1987, pages 194–211.
cisions about which program sections to optimize to
the fullest of its capabilities and where to save compi- [5] V. Balasundaram, K. Kennedy, U. Kremer, K.
lation time. One way of resolving this issue is to pass McKinley, and J. Subhlok, “The ParaScope ed-
the decision on to the user, in the form of command itor: an interactive parallel programming tool,”
line flags. Proc. of the Int’l Conf. on Supercomputing, 1989,
pages 540–550.

Compiler Command Line Flags [6] W. Blume, R. Doallo, R. Eigenmann, J. Grout,


J. Hoeflinger, T. Lawrence, J. Lee, D. Padua, Y.
Ideally, from the user’s point of view – and as an Paek, B. Pottenger, L. Rauchwerger, and Peng
ultimate research goal – a compiler would not need Tu, “Parallel Programming with Polaris.” IEEE
any command line flags. It would make all decisions Computer, Volume 29, Number 12, pages 78–82,
about where to apply which optimization technique December 1996.
fully automatically. Today’s compilers are far from
this goal. Compiler flags can be seen as one way for [7] M. W. Hall, J. M. Anderson, S. P. Amarasinghe,
the compiler to gather additional knowledge that is B. R. Murphy, S.-W. Liao, E. Bugnion, and M.
unavailable in the program source code. They may S. Lam, “Maximizing multiprocessor performance
supply information that otherwise would come from with the SUIF compiler,” IEEE Computer, Vol-
program input data (e.g., the most frequently exe- ume 29, Number 12, pages 84–89, December 1996.
cuted program sections), from the machine environ-
ment (e.g., the cache size), or from application needs
(e.g., degree of permitted roundoff error). They may Further Reading
also express user preferences (e.g., compilation speed
versus degree of optimization). A parallelizing com- • U. Banerjee, R. Eigenmann, A. Nicolau, and
piler can include several tens of command line op- D. Padua, “Automatic Program Paralleliza-
tions. Reducing this number can be seen as an im- tion.” Proceedings of the IEEE, 81(2) pages
portant goal for the future generation of vectorizing 211–243, February 1993.
and parallelizing compilers. • B. R. Rau and J. A. Fisher, “Instruction-
Level Parallel Processing: History,
Overview, and Perspective,” The Journal of
References Supercomputing, 7, 9–50, 1993.
• G. Almasi and A. Gottlieb, “Highly Paral-
[1] D. J. Kuck, R. H. Kuhn, B. Leasure, and M. lel Computing,” The Benjamin/Cummings
Wolfe, “The structure of an advanced vectorizer Publishing Company, Inc., 1994.
for pipelined processors,” Proc. of COMPSAC 80,
The 4th Int’l. Computer Software and Applica- • U. Banerjee, “Dependence Analysis,”
tions Conference, 1980, pages 709–715. Kluwer Academic Publishers, Boston Mass.,
1997.
[2] C. Poly- • J. Hennessy and D. Patterson, “Computer
chronopoulos, M. Girkar, M. R. Haghighat, C.-L. Architecture: A Quantitative Approach,”
Lee, B. Leung, and D. Schouten, “Parafrase-2: a Morgan Kaufmann Publishers, 1996.
new generation parallelizing compiler,” Proc. of • K. Kennedy, “Advanced Compiling for High
the 1989 Int’l. Conference on Parallel Processing, Performance,” Morgan Kaufmann Publish-
St. Charles, Ill., 1998, Volume II, pages 39–48. ers, 2001.

[3] J.R. Allen and K. Kennedy, “PFC: a program to • D. J. Kuck, “High Performance Computing,
convert Fortran to parallel form,” In K. Hwang Challenges for Future Systems,” Oxford Uni-
(ed.), Supercomputers: Design and Applications, versity Press, New York, 1996.
IEEE Computer Society Press, 1985, pages 186– • M. Wolfe, “High-Performance Compilers for
205. Parallel Computing,” Addison-Wesley, 1996.

25
• H. Zima, “Supercompilers for Parallel and
Vector Computers,” Addison-Wesley, 1991.

26

You might also like