Module 7 Notes Parallelizing-Vectorizing
Module 7 Notes Parallelizing-Vectorizing
Module 7 Notes Parallelizing-Vectorizing
1
puters with vector instructions and multi-processors, a SIMD machine uses a SISD machine as a host,
compilers were created to convert serial programs for to broadcast the instructions to be executed.
use with these machines. Such compilers, called vec-
torizing and parallelizing compilers, attempt to re- 3. Multiple instruction streams - single data stream
lieve the programmer from dealing with the machine (MISD) - no whole machine of this type has ever
details. They allow the programmer to concentrate been built, but if it were, it would have multi-
on solving the object problem, while the compiler ple processors, all operating on the same data.
concerns itself with the complexities of the machine. This is similar to the idea of pipelining, where
Much more sophisticated analysis is required of the different pipeline stages operate in sequence on
compiler to generate efficient machine code for these a single data stream.
types of machines.
Vector processors provide instructions that load a 4. Multiple instruction streams - multiple data
series of numbers for each operand of a given opera- streams (MIMD) - these machines have two or
tion, then perform the operation on the whole series. more processors that can all execute different
This can be done in pipelined fashion, similar to oper- programs and operate on their own data.
ations done by an assembly line, which is faster than
doing the operation on each item separately. Parallel
Another way to classify multiprocessor computers
processors offer the opportunity to do multiple oper-
is according to how the programmer can think of
ations at the same time on the different processors.
the memory system. Shared-memory multiprocessors
This article will attempt to give the reader an
(SMPs) are machines in which any processor can ac-
overview of the vast field of vectorizing and paral-
cess the contents of any memory location by simply is-
lelizing compilers. In Section 2, we will review the
suing its memory address. Shared-memory machines
architecture of high performance computers. In Sec-
can be thought of as having a shared memory unit
tion 3 we will cover the principle ways in which high-
accessible to every processor. The memory unit can
performance machines are programmed. In Section 4,
be connected to the machine through a bus (a set
we will delve into the analysis techniques that help
of wires and a control unit that allows only a single
parallelizing and vectorizing compilers optimize pro-
device to connect to a single processor at one time),
grams. In Section 5, we discuss techniques that trans-
or an interconnection network (a collection of wires
form program code in ways that can enable improved
and control units, allowing multiple data transfers at
vectorization or parallelization. Sections 6 and 7 dis-
one time). If the hardware allows nearly equal access
cuss the generation of vector instructions and paral-
time to all of memory for each processor, these ma-
lel regions, respectively, and the issues surrounding
chines can be called uniform memory access (UMA)
them. Finally, Section 9 discusses a number of im-
computers.
portant compiler-internal issues.
Distributed-memory multiprocessors (DMPs) use
processors that each have their own local memory, in-
accessible to other processors. To move data from one
2 Parallel Machines processor to another, a message containing the data
must be sent between the processors. Distributed
2.1 Classifying Machines memory machines have frequently been called multi-
Many different terms have been devised for classi- computers.
fying high performance machines. In a 1966 paper, Distributed shared memory (DSM) machines use a
M.J. Flynn divided computers into four classifica- combined model, in which each processor has a sep-
tions, based on the instruction and data streams used arate memory, but special hardware and/or software
in the machine. These classifications have proven use- is used to retrieve data from the memory of another
ful, and are still in use today: processor. Since in these machines it is faster for
a processor to access data in its own memory than
1. Single instruction stream - single data stream to access data in another processor’s memory, these
(SISD) - these are single processor machines. machines are frequently called non-uniform memory
access (NUMA) computers. NUMA machines may
2. Single instruction stream - multiple data streams be further divided into two categories – those in
(SIMD) - these machines have two or more pro- which cache-coherence is maintained between proces-
cessors that all execute the same instruction at sors (cc-NUMA) and those in which cache-coherence
the same time, but on separate data. Typically, is not maintained (nc-NUMA).
2
2.2 Parallel Computer Architectures did not have vector registers, but rather loaded data
directly from memory to the processor. The first
People have experimented with quite a few types of commercially successful vector machine was the Cray
architectures for high-performance computers. There Research Cray-1. It used vector registers and paired
has been a constantly re-adjusting balance between a fast vector unit with a fast scalar unit. In the
ease of implementation and high performance. 1980s, CDC built the Cyber 205 as a follow-on to the
Star-100, and three Japanese companies, NEC, Hi-
SIMD Machines tachi and Fujitsu built vector machines. These three
companies continued manufacturing vector machines
The earliest parallel machines were SIMD machines. through the 1990s.
SIMD machines have a large number of very simple
slave processors controlled by a sequential host or
Shared Memory Machines
master processor. The slave processors each contain a
portion of the data for the program. The master pro- In a shared memory multi-processor, each processor
cessor executes a user’s program until it encounters can access the value of any shared address by simply
a parallel instruction. At that time, the master pro- issuing the address. Two principle hardware schemes
cessor broadcasts the instruction to all the slave pro- have been used to implement this. In the first (called
cessors, which then execute the instruction on their centralized shared memory), the processors are con-
data. The master processor typically applies a bit- nected to the shared memory via either a system bus
mask to the slave processors. If the bit-mask entry or an interconnection network. The memory bus is
for a particular slave processor is 0, then that pro- the cheapest way to connect processors to make a
cessor does not execute the instruction on its data. shared memory system. However, the bus becomes a
The set of slave processors is also called an attached bottleneck since only one device may use it at a time.
array processor because it can be built into a single An interconnection network has more inherent paral-
unit and attached as a performance upgrade to a uni- lelism, but involves more expensive hardware. In the
processor. second (called distributed shared-memory), each pro-
An early example of a SIMD machine was the Illiac cessor has a local memory, and whenever a processor
IV, built at the University of Illinois during the late issues the address of a memory location not in its lo-
1960s and early 1970s. The final configuration had 64 cal memory, special hardware is activated to fetch the
processors, one-quarter of the 256 originally planned. value from the remote memory that contains it.
It was the world’s fastest computer throughout its The Sperry Rand 1108 was an early centralized
lifetime, from 1975 to 1981. Examples of a SIMD shared memory computer, built in the mid 1960s.
machine from the 1980s were the Connection Machine It could be configured with up to three processors
from Thinking Machines Corporation, introduced in plus two input/output controller processors. In the
1985, and its follow-on, the CM-2, which contained 1970s, Carnegie Mellon University built the C.mmp
64K processors, introduced in 1987. as a research machine, connecting 16 minicomput-
ers (PDP-11s) to 16 memory units through a cross-
Vector Machines bar interconnection network. Several companies built
bus-based centralized shared memory computers dur-
A vector machine has a specialized instruction set ing the 1980s, including Alliant, Convex, Sequent and
with vector operations and usually a set of vector Encore. The 1990s saw fewer machines of this type
registers, each of which can contain a large number introduced. A prominent manufacturer was Silicon
of floating point values (up to 128). With a single Graphics Inc. (SGI), which produced the Challenge
instruction, it applies an operation to all the floating and Power Challenge systems in that period.
point numbers in a vector register. The processor of During the 1980s and 1990s, several research ma-
a vector machine is typically pipelined, so that the chines explored the distributed shared memory archi-
different stages of applying the operation to the vec- tecture. The Cedar machine built at the University
tor of values overlap. This also avoids the overheads of Illinois in the late 1980s connected a number of
associated with loop constructs. A scalar processor bus-based multiprocessors (called clusters) with an
would have to apply the operation to each data value interconnection network to a global memory. The
in a loop. global memory modules contained special synchro-
The first vector machines were the Control Data nization processors which allowed clusters to syn-
Corporation (CDC) Star-100, and the Texas Instru- chronize. The Stanford DASH, built in the early
ments ASC, built in the early 1970s. These machines 1990s, also employed a two-level architecture, but
3
added a cache-coherence mechanism. One node of processor’s memory to another during the run of a
the DASH was a bus-based multiprocessor with a lo- program. The term attraction memory has been used
cal memory. A collection of these nodes were con- to describe the tendency of data to migrate toward
nected together in a mesh. When a processor referred the processor that uses it the most. Theoretically,
to a memory location not contained within the local this can minimize the latency to access data, since
node, the node’s directory was consulted to determine latency increases as the data get further from the
the remote location. The MIT Alewife project also processor.
produced a directory-based machine. A prominent The COMA idea was introduced by a team at the
directory-based commercial machine was the Origin Swedish Institute of Computer Science, working on
2000 from SGI. the Data Diffusion Machine. The idea was commer-
cialized by Kendall Square Research (KSR), which
Distributed Memory Multiprocessors built the KSR1 in the early 1990s.
4
1. writing a serial program and compiling it with a OpenMP standard emerged. OpenMP describes a
parallelizing compiler common set of directives for implementing various
types of parallel execution and synchronization. One
2. composing a program from modules that have advantage of the OpenMP directives is that they are
already been implemented as parallel programs designed to be added to a working serial code. If the
3. writing a program that expresses parallel activi- compiler is told to ignore the directives, the serial
ties explicitly program will still execute correctly. Since the serial
program is unchanged, such a parallel program may
Option 1 above is obviously the easiest for the pro- be easier to debug.
grammer. It is easier to write a serial program than A third way of expressing parallelism is to use li-
it is to write a parallel program. The programmer brary calls within an otherwise sequential program.
would write the program in one of the languages for The libraries perform the task of creating and termi-
which a parallelizing compiler is available (Fortran, nating parallel activities, scheduling them, and sup-
C, and C++), then employ the compiler. The tech- porting communication and synchronization. Exam-
nology that supports this scenario is the main focus ples of libraries that support this method are the
of this article. POSIX threads package, which is supported by many
Option 2 above can be easy as well, because the operating systems, and the MPI libraries, which have
user does not need to deal with explicit parallelism. become a standard for expressing message passing
For many problems and computer systems there exist parallel applications.
libraries that perform common operations in parallel.
Among them, mathematical libraries for manipulat- Parallel Programming Models
ing matrices are best known. One difficulty for users
is that one must make sure that a large fraction of Programming Vector Machines: Vector paral-
the program execution is spent inside such libraries. lelism typically exploits operations that are per-
Otherwise, the serial part of the program may domi- formed on array data structures. This can be ex-
nate the execution time when running the application pressed using vector constructs that have been added
on many processors. to standard languages. For instance, Fortran90 uses
Option 3 above is the most difficult for program- constructs such as
mers, but gives them direct control over the perfor-
mance of the parallel execution. Explicit parallel lan- A(1:n) = B(1:n) + C(1:n)
guages are also important as a target for parallelizing
For a vector machine, this could cause a vector loop
compilers. Many parallelizers act as source-to-source
to be produced, which performs a vector add between
restructurers, translating the original, serial program
chunks of arrays B and C, then a vector copy of the
into parallel form. The actual generation of paral-
result into a chunk of array A. The size of a chunk
lel code is then performed by a “backend compiler”
would be determined by the number of elements that
from this parallel language form. The remainder of
fit into a vector register in the machine.
this section discusses this option in more detail.
5
Thus the term fork/join parallelism is used for loop parallelization and vectorization, the compiler typi-
parallelism. cally takes as input the serial form of a program, then
Determining which processor executes which iter- determines which parts of the program can be trans-
ation of the loop is called scheduling. Loops may formed into parallel or vector form. The key con-
be scheduled statically, which means that the assign- straint is that the “results” of each section of code
ment of processors to loop iterations is fully deter- must be the same as those of the serial program.
mined prior to the execution of the loop. Loops may Sometimes the compiler can parallelize a section of
also be self-scheduled, which means that whenever a code in such a way that the order of operations is
given processor is ready to execute a loop iteration, it different than that in the serial program, causing a
takes the next available iteration. Other scheduling slightly different result. The difference may be so
techniques will be discussed in Section 7.3. small as to be unimportant, or actually might alter
the results in an important way. In these cases, the
programmer must agree to let the compiler parallelize
Parallel Threads Model: If the parallel activi-
the code in this manner.
ties in a program can be packaged well in the form of
Some of the analysis techniques used by paralleliz-
subroutines that can execute independently of each
ing compilers are also done by optimizing compilers
other, then the threads model is adequate. Threads
compiling for serial machines. In this section we will
are parallel activities that are created and terminated
generally ignore such techniques, and focus on the
explicitly by the program. The code executed by a
techniques that are unique to parallelizing and vec-
thread is a specified subroutine, and the data accessed
torizing compilers.
can either be private to a thread or shared with other
threads. Various synchronization constructs are usu-
ally supported for coordinating parallel threads. Us- 4.1 Dependence Analysis
ing the threads model, users can implement highly
A data dependence between two sections of a pro-
dynamic and flexible parallel execution schemes. The
gram indicates that during execution of the optimized
POSIX threads package is one example of a well-
program, those two sections of code must be run in
known library that supports this model.
the order indicated by the dependence. Data depen-
dences between two sections of code that access the
The SPMD Model: Distributed-memory paral- same memory location are classified based on the type
lel machines are typically programmed by using the of the access (read or write) and the order, so there
SPMD execution model. SPMD stands for “single are four classifications:
program, multiple data”. This refers to the fact that
each processor executes an identical program, but on input dependence: READ before READ
different data. One processor can not directly access
the data of another processor, but a message contain- anti dependence: READ before WRITE
ing that data can be passed from one processor to the flow dependence: WRITE before READ
other. The MPI standard defines an important form
for passing such messages. In a DMP, a processor’s output dependence: WRITE before WRITE
access to its own data is much faster than access to
data of another processor through a message, so pro- Flow dependences are also referred to as true de-
grammers typically write SPMD programs that avoid pendences. If an input dependence occurs between
access to the data of other processors. Programs writ- two sections of a program, it does not prevent the
ten for a DMP can be more difficult to write than sections from running at the same time (in parallel).
programs written for an SMP, because the program- However, the existence of any of the other types of
mer must be much more careful about how the data dependences would prevent the sections from run-
is accessed. ning in parallel, because the results may be differ-
ent from those of the serial code. Techniques have
been developed for changing the original program in
4 Program Analysis many situations where dependences exist, so that the
sections can run in parallel. Some of them will be
Program analysis is crucial for any optimizing com- described later in this article.
piler. The compiler writer must determine the analy- A loop is parallelized by running its iterations in
sis techniques to use in the compiler based on the tar- parallel, so the question must be asked whether the
get machine and the type of optimization desired. For same storage location would be accessed in different
6
iterations of a loop, and whether one of the accesses is 1 ≤ i01 ≤ 100
a write. If so, then a data dependence exists within 1 ≤ i001 ≤ 100
the loop. Data dependence within a loop is typi-
cally determined by equating the subscript expres- The constraint i01 < i001 comes from the idea that
sions of each pair of references to a given array, and only dependences across iterations are important. A
attempting to solve the equation (called the depen- dependence within the same iteration (i01 ≡ i001 ) is
dence equation), subject to constraints imposed by never a problem, since each iteration executes on a
the loop bounds. For a multi-dimensional array, there single processor, so it can be ignored.
is one dependence equation for each dimension. The Of course, there are many solutions to this equation
dependence equations form a system of equations, the that satisfy the constraints: {i01 : 1, i001 : 2} is one;
dependence system, which is solved simultaneously. If {i01 : 2, i001 : 3} is another. Therefore, the given loop
the compiler can find a solution to the system, or if it contains a dependence.
cannot prove that there is no solution, then it must A dependence test is an algorithm employed to de-
conservatively assume that there is a solution, which termine if a dependence exists in a section of code.
would mean that a dependence exists. Mathematical The problem of finding dependence in this way has
methods for solving such systems are well known if been shown to be equivalent to the problem of find-
the equations are linear. That means, the form of the ing solutions to a system of Diophantine equations,
subscript expressions is as follows: which is NP-complete, meaning that only extremely
expensive algorithms can be found to solve the com-
Xk
plete problem exactly. Therefore, a large number of
aj ij + a0
j=1
dependence tests have been devised that solve the
problem under simplifying conditions and in special
where k is the number of loops nested around an array situations.
reference, ij is the loop index in the j th loop in the
nest, and aj is the coefficient of the j th loop index in Iteration Spaces
the expression.
The dependence equation would be of the form: Looping statements with pre-evaluated loop bounds,
such as the Fortran do-loop, have a predetermined set
of values for their loop indices. This set of loop-index
X k Xk
0 00
aj ij + a0 = bj ij + b0 (1) values is the iteration space of the loop. k-nested loop
j=1 j=1 statements have a k-dimensional iteration space. A
specific iteration within that space may be named
or by a k-tuple of iteration values, called an iteration
vector:
X k
(aj i0j − bj i00j ) = b0 − a0 (2) {i1 , i2 , · · · , ik }
j=1
in which i1 represents the outermost loop, i2 the next
In these equations, i0j and i00j represent the values inner, and i is the innermost loop.
k
of the j th loop index of the two subscript expressions
being equated. For instance, consider the loop below.
Direction and Distance Vectors
DO i = 1, 100
When a dependence is found in a loop nest, it is some-
A(i) = B(i)
times useful to characterize it by indicating the itera-
C(i) = A(i-1)
tion vectors of the iterations where the same location
ENDDO
is referenced. For instance, consider the following
There are two references to array A, so we equate loop.
the subscript expressions of the two references. The
equation would be: DO i = 2, 100
DO j = 1, 100
i01 = i001 − 1 S1: A(i,j) = B(i)
S2: C(i) = A(i-1,j+3)
subject to the constraints: ENDDO
0
i <i 00 ENDDO
1 1
7
The dependence between statements S1 and S2 We call a dependence test exact if it only reports
happens between iterations answers 1 or 2. Otherwise, it is inexact.
{2, 5} and {3, 2}, {2, 6} and {3, 3}, etc Dependence Tests
Since {2, 5} happens before {3, 2} in the serial ex- The first of the dependence tests was the GCD test,
ecution of the loop nest, we say that the dependence an inexact test. The GCD test finds the greatest com-
source is {2, 5} and the dependence sink is {3, 2}. The mon divisor g of the coefficients of the left-hand-side
dependence distance for a particular dependence is de- of the dependence equation (Equation 2 above). If
fined as the difference between the iteration vectors, g does not divide the right-hand-side value of Equa-
the sink minus the source. tion 2, then there can be no dependence. Otherwise,
a dependence is still a possibility. The GCD test is
dependence distance = {3, 2} − {2, 5} = {1, −3} cheap compared to some other dependence tests. In
Notice that in this example the dependence dis- practice, however, often the GCD g is 1, which will
tance is constant, but this may not always be the always divide the right-hand-side, so the GCD test
case. doesn’t help in those cases.
The dependence direction vector is also useful in- The Extreme Value test, also inexact in the general
formation, though coarser than the dependence dis- case, has proven to be one of the most useful depen-
tance. There are three directions for a dependence: dence tests. It takes the dependence equation (2) and
{<, =, >}. The < direction corresponds to a positive constructs both the minimum and the maximum pos-
dependence distance, the = direction corresponds to sible values for the left-hand-side. If it can show that
a distance of zero, and the > direction corresponds to the right-hand-side is either greater than the maxi-
a negative dependence distance. Therefore, the direc- mum, or less than the minimum, then we know for
tion vector for the example above would be {<, >}. certain that no dependence exists. Otherwise, a de-
Distance and direction vectors are used within par- pendence must be assumed. A combination of the
allelizing compilers to help determine the legality of Extreme Value test and the GCD test has proved to
various transformations, and to improve the efficiency be very valuable and fast because they complement
of the compiler. Loop transformations that reorder each other very well. The GCD test does not in-
the iteration space, or modify the subscripts of array corporate information about the loop bounds, which
references within loops cannot be applied for some the Extreme Value test provides. At the same time,
configurations of direction vectors. In addition, in the Extreme Value test does not concern itself with
multiply-nested loops that refer to multi-dimensional the structure of the subscript expressions, which the
arrays, we can hierarchically test for dependence, GCD test does.
guided by the direction vectors, and thereby make The Extreme Value test is exact under certain con-
fewer dependence tests. Distance vectors can help ditions. It is exact if any of the following are true:
partially parallelize loops, even in the presence of de-
• all loop index coefficients are ±1 or 0,
pendences.
• the coefficient of one index variable is ±1 and
Exact versus Inexact Tests the magnitudes of all other coefficients are less
than the range of that index variable, or
There are three possible answers that any dependence
test can give: • the coefficient of one index variable is ±1 and
there exists a permutation of the remaining in-
1. No dependence - the compiler can prove that no dex variables such that the coefficient of each
dependence exists. is less than the sum of the products of the co-
2. Dependence - the compiler can prove that a de- efficients and ranges for all the previous index
pendence exists. variables.
3. Not sure - the test could neither prove nor dis- Many other dependence tests have been devised
prove dependences. To be safe, the compiler over the years. Many deal with ways of solving the
must assume a dependence in this case. This is dependence system when it takes certain forms. For
the conservative assumption for dependence test- instance, the Two Variable Exact test can find an
ing, necessary to guarantee correct execution of exact solution if the dependence system is a single
the parallel program. equation of the form:
8
impossible result, indicating that the original param-
eterized solution was actually empty, disproving the
ai + bj = c dependence. The Power Test can also be used to test
for dependence for specific direction vectors.
.
All of the preceding dependence tests are applica-
The most general dependence test would be to use
ble when all coefficients and loop bounds are integer
integer programming to solve a linear system - a set
constants and the subscript expressions are all affine
of equations (the dependence system) and a set of in-
functions. The Power Test is the only test mentioned
equalities (constraints on variables due to the struc-
up to this point that can make use of variables as
ture of the program). Integer programming conducts
coefficients or loop bounds. A variable can simply be
a search for a set of integer values for the variables
treated as an additional unknown. The value of the
that satisfy the linear system. Fourier-Motzkin elim-
variable would simply be expressed in terms of the
ination is one algorithm that is used to conduct the
free variables of the solution, then Fourier-Motzkin
search for solutions. Its complexity is very high (ex-
elimination could incorporate any constraints on that
ponential), so until the advent of the Omega test (dis-
variable into the constraints on the free variables of
cussed below), it was considered too expensive to use
the solution. A small number of dependence tests
integer programming as a dependence test.
have been devised that can make use of variables and
The Lambda test is an increased-precision form of
non-affine subscript expressions.
the Extreme Value test. While the Extreme Value
test basically checks to see whether the hyper-plane The Omega test makes use of a fast algorithm for
doing Fourier-Motzkin elimination. The original de-
formed by any of the dependence equations falls com-
pendence problem that it tries to solve consists of a
pletely outside the multi-dimensional volume delim-
ited by the loop bounds of the loop nest in question, set of equalities (the dependence system), and a set of
inequalities (the program constraints). First, it elimi-
the Lambda test checks for the situation in which
nates all equality constraints (as was done in the Gen-
each hyper-plane intersects the volume, but the in-
eralized GCD test) by using a specially-designed modd
tersection of all hyper-planes falls outside the volume.
It is especially useful for the situation in which a sin- function to reduce the magnitude of the coefficients,
gle loop index appears in the subscript expression until at least one reaches ±1, when it is possible to
for more than one dimension of an array reference remove the equality. Then, the set of resulting in-
(a reference referred to as having coupled subscripts). equalities is tested to determine whether any integer
If the Lambda test can find that the intersection of solutions can be found for them. It has been shown
any two dependence equation hyper-planes falls com- that, for most real program situations, the Omega
pletely outside the volume, then it can declare that test gives an answer quickly (polynomial time). In
there is no solution to the dependence system. some cases, however, it cannot and it resorts to an
The I Test is a combination of the GCD and Ex- exponential-time search.
treme Value tests, but is more precise than would be The Range test extends the Extreme Value test to
the application of the two tests individually. symbolic and nonlinear subscript expressions. The
The Generalized GCD test, built on Gaussian ranges of array locations accessed by adjacent loop
Elimination (adapted for integers) attempts to solve iterations are symbolically compared. The Range test
the system of dependence equations simultaneously. makes use of range information for variables within
It forms a matrix representation of the dependence the program, obtained by symbolically analyzing the
system, then using elementary row operations forms program. It is able to discern data dependences in
a solution of the dependence system, if one exists. a few, important situations that other tests cannot
The solution is parameterized so that all possible so- handle.
lutions could be generated. The dependence distance The recent Access Region test makes use of a sym-
can also be determined by this method. bolic representation of the array elements accessed at
The Power Test first uses the Generalized GCD separate sites within a loop. It uses an intersection
test. If that produces a parameterized solution to operation to intersect two of these symbolic access re-
the dependence system, then it uses constraints de- gions. If the intersection can be proven empty, then
rived from the program to determine lower and upper the potential dependence is disproven. The Access
bounds on the free variables of the parameterized so- Region test likewise can test dependence when non-
lution. Fourier-Motzkin elimination is used to com- affine subscript expressions are used, because in some
bine the constraints of the program for this purpose. cases it can apply simplification operators to express
These extra constraints can sometimes produce an the regions in affine terms.
9
An array subscript expression classification system the compiler to compile a test into the program that
can assist dependence testing. Subscript expressions would test for certain parallelism-enabling conditions,
may be classified according to their structure, then then choose between parallel and serial code based on
the dependence solution technique may be chosen the result of the test. This technique of paralleliza-
based on how the subscript expressions involved are tion is called run-time dependence testing.
classified. A useful classification of the subscript ex- The inspector/executor model of program execu-
pression pairs involved in a dependence problem is as tion allows a compiler to run some kind of analysis of
follows: the data values in the program (the inspector), which
sets up the execution, then to execute the code based
ZIV (zero index variable) The two subscript ex- on the analysis (the executor). The inspector can
pressions contain no index variables at all, e.g. do anything from dependence testing to setting up a
A(1) and A(2). communication pattern to be carried out by the ex-
ecutor. The complexity of the test needed at run-time
SIV (single index variable) The two subscript ex-
varies based on the details of the loop itself. Some-
pressions contain only one loop index variable,
times the test needed is very simple. For instance, in
e.g. A(i) and A(i+2).
the loop
MIV (multiple index variable) The two subscript
expressions contain more than one loop index DO i=1,100
variable, e.g. A(i) and A(j) or A(i+j) and A(i+m) = B(i)
A(i). C(i) = A(i)
ENDDO
The different classifications call for unique depen-
dence testing methods. The SIV class is further sub-
no dependence exists in the loop if m > 99. The
divided into various special cases, each enabling a
compiler might generate code that executes a paral-
special dependence test or loop transformation.
lel version of the loop if m > 99, otherwise a serial
The Delta test makes use of these subscript ex-
version.
pression classes. It first classifies each dependence
problem according to the above types. Then, it uses More complicated situations might call for a more
a specially-targeted dependence test for each case. sophisticated dependence test to be performed at run-
The main insight of the Delta test is that when two time. The compiler might be able to prove all con-
array references are being tested for dependence, in- ditions for independence except one. Proof of that
formation derived from solving the dependence equa- condition might be attempted at run-time. For ex-
tion for one dimension may be used in solving the ample, the compiler might determine that a loop is
dependence equation for another dimension. This al- parallel if only a given array was proven to contain
lows the Delta test to be useful even in the presence no duplicate values (i.e. be a permutation vector).
of coupled subscripts. The algorithm attends to the If the body of the loop is large enough, then the
SIV and ZIV equations first, since they can be solved time savings of running the loop in parallel can be
easily. Then, the information gained is used in the substantial. It could offset the expense of checking
solution of the MIV equations. Since the Delta test for the permutation vector condition at runtime. In
does not attempt to use a single general technique this case, the compiler might generate such a test to
to determine dependence, but rather special tests for choose between the serial and parallel versions of the
each special case, it is possible for the Delta test to loop.
accommodate unknown variable values more easily. Another technique that has been employed is to
attempt to run a loop in parallel despite not know-
ing for sure that the loop is parallel. This is called
Run-time Dependence Testing
speculative parallelization. The pre-loop values of the
It is very common for programs to make heavy use of memory locations that will be modified by the loop
variables whose values are read from input files. Un- must be saved because it might be determined dur-
fortunately, such variables often contain crucial in- ing execution that the loop contained a dependence,
formation about the dependence pattern of the pro- in which case the results of the parallel run must be
gram. In this type of situation, a perfectly parallel discarded and the loop must be re-executed serially.
loop might have to be run serially, simply because During the parallel execution of the loop, extra code
the compiler lacked information about the input vari- is executed that can be used to determine whether a
ables. In these cases, it is sometimes possible for dependence really did exist in the serial version. The
10
LRPD test is one example of such a runtime depen- whose dimensionality is not tied to the program-
dence test. declared dimensionality of a given array. An example
of this type is the Linear Memory Access Descriptors
used in the Access Region test. This form can repre-
4.2 Interprocedural Analysis sent most memory access patterns used in a program,
A loop containing one or more procedure calls and allows one to represent memory reference activity
presents a special challenge for parallelizing compil- consistently across procedure boundaries.
ers. The chief problem is how to compare the memory
activity in different execution contexts (subroutines),
for the purpose of discovering data dependences. One 4.3 Symbolic Analysis
possibility, called subroutine inlining, is to remove all
Symbolic analysis refers to the use of symbolic terms
subroutine calls by directly replacing all subroutine
within ordinary compiler analysis. The extent to
calls with the code from the called subroutine, then
which a compiler’s analysis can handle expressions
parallelizing the whole program as one large routine.
containing variables is a measure of how good a com-
This is sometimes feasible, but often causes an explo-
piler’s symbolic analysis is. Some compilers use an
sion in the amount of source code that the compiler
integrated, custom-built symbolic analysis package,
must compile. Inlining also faces obstacles in trying
which can apply algebraic simplifications to expres-
to represent the formal parameters of the subroutine
sions. Others depend on integrated packages, such
in the context of the calling routine, since in some lan-
as the Omega constraint-solver, to do the symbolic
guages (Fortran is one example) it is legal to declare
manipulation that they need. Still others use links
formal parameter arrays with dimensionality different
to external symbolic manipulation packages, such as
from that in the calling routine.
Mathematica or Maple. Modern parallelizing compil-
The alternative to inlining is to keep the subroutine
ers generally have sophisticated symbolic analysis.
call structure intact and simply represent the memory
access activity caused by a subroutine in some way at
the call site. One method of doing this is to represent
memory activity symbolically with sets of constraints. 4.4 Abstract Interpretation
For instance, at the site of a subroutine call, it might When compilers need to know the result of executing
be noted that the locations written were: a section of code, they often traverse the program
in “execution order”, keeping track of the effect of
{A(i) | 0 ≤ i ≤ 100} each statement. This process is called abstract inter-
pretation. Since the compiler generally will not have
The advantage of using this method is that one can access to the runtime values of all the variables in the
use the sets of constraints directly with a Fourier- program, the effect of each statement will have to be
Motzkin-based dependence test. computed symbolically. The effect of a loop is easily
Several other forms for representing memory ac- determined when there is a fixed number of iterations
cesses have been used in various compilers. Many are (such as in a Fortran do-loop). For loops that do
based on triplet notation, which represents a set of not explicitly state a number of iterations, the effect
memory locations in the form: of the iteration may be determined by widening, in
lower bound : upper bound : stride which the values changing due to the loop are made to
This form can represent many regular access pat- change as though the loop had an infinite number of
terns, but not all. iterations, and then narrowing, in which an attempt
Another representational form consists of Regular is made to factor in the loop exit conditions, to limit
Section Descriptors (RSDs), which uses a simple form the changes due to widening. Abstract interpretation
(I +α), where I is a loop index and α is a loop invari- follows all control flow paths in the program.
ant expression. At least three other forms based on
RSDs have been used: Restricted RSDs (which can
express access on the diagonal of an array), Bounded Range Analysis Range analysis is an application
RSDs that express triplet notation with full symbolic of abstract interpretation. It gathers the range of
expressions, and Guarded Array Regions, which are values that each variable can assume at each point in
Bounded RSDs qualified with a predicate guard. the program. The ranges gathered have been used to
An alternative for representing memory activity in- support the Range test, as mentioned in Section 4.1
terprocedurally is to use a representational format above.
11
4.5 Data Flow Analysis program. The program can be translated into a re-
stricted form that eliminates some of the complexi-
Many analysis techniques need global information
ties of the original program. Two examples of this
about the program being compiled. A general frame-
are control-flow normalization and Static Single As-
work for gathering this information is called data flow
signment (SSA) form.
analysis. To use data flow analysis, the compiler
Control-flow normalization is applied to a program
writer must set up and solve systems of data flow
to transform it to a form that is simpler to analyze
equations that relate information at various points in
than a program with arbitrary control flow. An ex-
a program. The whole program is traversed and in-
ample of this is the removal of GOTO statements from a
formation is gathered from each program node, then
Fortran program, replacing them with IF statements
used in the data flow equations. The traversal of the
and looping constructs. This adds information to the
program can be either forward (in the same direction
program structure, which can be used by the compiler
as normal execution would proceed), or backward (in
to possibly do a better job of optimizing the program.
the opposite direction from normal execution). At
Another example is to transform a program into
join points in the program’s control flow graph, the
SSA form. In SSA form, each variable is assigned ex-
information coming from the paths that join must be
actly once and is only read thereafter. When a vari-
combined, so the rules which govern that combination
able in the original program is assigned more than
must be specified.
once, it is broken into multiple variables, each of
The data flow process proceeds in the direc-
which is assigned once. SSA form has the advan-
tion specified, gathering the information by solving
tage that whenever a given variable is used, there is
the data flow equations, and combining information
only one possible place where it was assigned, so more
at control flow join points until a steady-state is
precise information about the value of the variable is
achieved. That is, until no more changes occur in
encoded directly into the program form.
the information being calculated. When steady state
is achieved, the wanted information has been propa-
gated to each point in the program. Gated SSA form Gated SSA form is a variant of
An example of data flow analysis is constant prop- SSA form that includes special conditional expres-
agation. By knowing the value of a variable at a cer- sions (gating expressions) that make the representa-
tain point in the program, the precision of compiler tion more precise. Gated SSA form has been used
analysis can be improved. Constant propagation is for flow-sensitive array privatization analysis within
a forward data flow problem. A value is propagated a loop. Array privatization analysis requires that the
for each variable. The value gets set whenever an compiler prove that writes to a portion of an array be
assignment statement in the program assigns a con- shown to precede all reads to the same section of the
stant to the variable. The value remains associated array within the loop. When conditional branching
with the variable until another assignment statement happens within the loop, then the gating expressions
assigns a value to the variable. At control flow join can help to prove the privatization condition.
points in the program, if a value is associated with the
variable on all incoming paths, and it is always the
same value, then that value stays associated with the 5 Enabling Transformations
variable. Otherwise, the value is set to “unknown”.
Data flow analysis can be used for many purposesLike all optimizing compilers, parallelizers and vec-
within a compiler. It can be used for determining torizers consist of a sequence of compilation passes.
which variables are aliased to which other variables,
The program analysis techniques described so far are
for determining which variables are potentially modi-
usually the first set of these passes. They gather in-
fied by a given section of code, for determining which
formation about the source program, which is then
variables may be pointed to by a given pointer, andused by the transformation passes to make intelli-
many other purposes. Its use generally increases the
gent decisions. The compilers have to decide which
precision of other compiler analyses. transformations can legally and profitably be applied
to which code sections and how to best orchestrate
4.6 Code Transformations to Aid these transformations.
For the sake of this description we divide the trans-
Analysis
formations into two parts, those that enable other
Sometimes program source code can be transformed techniques and those that perform the actual vector-
in a way that encodes useful information about the izing or parallelizing transformations. This division
12
is useful for our presentation, but not strict. Some processors. Data expansion is an alternative imple-
techniques will be mentioned in both places. mentation to privatization. Instead of marking t pri-
vate, the compiler expands the scalar variable into an
array and uses the loop variable as an array index.
5.1 Dependence Elimination and The main difficulty of data privatization and ex-
Avoidance pansion is to recognize eligible variables. A variable
An important class of enabling transformations deals is privatizable in a loop iteration if it is assigned be-
with eliminating and avoiding data dependences. We fore it is used. This is relatively simple to detect for
will describe data privatization, idiom recognition, scalar variables. However, the transformation is im-
and dependence-aware work partitioning techniques. portant for arrays as well. The analysis of array sec-
tions that are provably assigned before used can be
very involved and requires symbolic analysis, men-
Data Privatization and Expansion tioned in Section 4.
Data privatization is one of the most important tech-
niques because it directly enables parallelization and Idiom Recognition: Reductions, Inductions,
it applies very widely. Data privatization can re- Recurrences
move anti and output dependences. These so-called
storage-related or false dependences are not due to These transformations can remove true (i.e., flow) de-
computation having to wait for data values produced pendences. The elimination of situations where one
by another computation. Instead, the computation computation has to wait for another to produce a
must wait because it wants to assign a value to a needed data value, is only possible if we can express
variable that is still in use by a previous computa- the computation in a different way. Hence, the com-
tion. The basic idea is to use a new storage location piler recognizes certain idioms and rewrites them in
so that the new assignment does not overwrite the a form that exhibits more parallelism.
old value too soon. Data privatization does this as
shown in Figure 1. Induction variables: Induction variables are vari-
ables that are modified in each loop iteration in such
a way that the assumed values can be expressed as
a mathematical sequence. Most common are simple,
additive induction variables. They get incremented
in each loop iteration by a constant value, as shown
in Figure 2. In the transformed code the sequence
is expressed in a closed form, in terms of the loop
variable. The induction statement can then be elim-
inated, which removes the flow dependence.
13
the sequence of values assumed by a variable in a piler then replaces this code by a call to a mathemat-
loop). ical library that contains the corresponding parallel
solver algorithm. This substitution can pay off if the
Reduction operations: Reduction operations ab- array is large. Many variants of linear recurrences are
stract the values of an array into a form with lesser possible. A large number of library functions need to
dimensionality. The typical example is an array being be made available to the compiler so that an effective
summed up into a scalar variable. The parallelizing substitution can be made in many situations.
transformation is shown in Figure 3.
Correctness and performance of idiom recog-
nition techniques: Idiom recognition and substi-
tution come at a cost. Induction variable substitu-
tion replaces an operation with one of higher strength
(usually, a multiplication replaces an addition). In
fact, this is the reverse operation of strength reduc-
tion – an important technique in classical compil-
ers. Parallel reduction operations introduce addi-
tional code. If the reduction size is small, the over-
Figure 3: Reduction Parallelization. head may offset the benefit of parallel execution. This
is true to a greater extent for parallel recurrence
The idea for enabling parallel execution of this solvers. The overhead associated with parallel solvers
computation exploits mathematical commutativity. is relatively high and can only be amortized if the re-
We can split the array into p parts, sum them up in- currence is long. As a rule of thumb, induction and
dividually by different processors, and then combine reduction substitution is usually beneficial, whereas
the results. The transformed code has two additional the performance of a program with and without re-
loops, for initializing and combining the partial re- currence recognition should be more carefully evalu-
sults. If the size of the main reduction loop (variable ated.
n) is large, then these loops are negligible. The main It is also important to note that, although parallel
loop is fully parallel. recurrence and reduction solvers perform mathemati-
More advanced forms of this technique deal with cally correct transformations, there may be round-off
array reductions, where the sum operation modifies errors because of the limited computer representa-
several elements of an array instead of a scalar. Fur- tion of floating point numbers. This may lead to
thermore, the summation is not the only possible re- inaccurate results in application programs that are
duction operation. Another important reduction is numerically very sensitive to reordering operations.
finding the minimum or maximum value of an array. Compilers usually provide command line options so
that the user can control these transformations.
Recurrences: Recurrences use the result of one or
several previous loop iterations for computing the Dependence-aware Work Partitioning:
value of the next iteration. This usually forces a loop Skewing, Distribution, Uniformization
to be executed serially. However, for certain forms of
linear recurrences, algorithms are known that can be The above three techniques are able to remove data
parallelized. For example, in Figure 4 dependences and, in this way, generate fully parallel
loops. If dependences cannot be removed it may still
be possible to generate parallel computation. The
computation may be reordered so that expressions
that are data dependent on each other are executed
by the same processor. Figure 5 shows an example
loop and its iteration dependence graph.
By regrouping the iterations of the loop as indi-
cated by the shaded wavefronts in the iteration space
graph, all dependences stay within the same proces-
Figure 4: Recurrence Substitution. sor, where they are enforced by the sequential ex-
ecution of this processor. This technique is called
the compiler has recognized a pattern of linear recur- loop skewing. The class of unimodular transforma-
rences for which a parallel solver is known. The com- tions contains more general techniques that can re-
14
a parallel loop to an outer position increases the
amount of work inside the parallel region.
Splitting a single loop into a nest of two loops is
called stripmining or loop blocking. It enables the
exploitation of hierarchical parallelism (e.g., the inner
loop may then be executed across a multiprocessor,
while the outer loop gets executed by a cluster of
such multiprocessors.) It is also an important cache
optimization, as we will discuss.
15
Loop Distribution Stripmining Vector Lengths
A loop containing several statements must first be Vector instructions usually take operands of length
distributed into several loops before each one can be 2n - the size of the vector registers. The original loop
turned into a vector operation. Loop distribution must be divided into strips of this length. This is
(also called loop splitting or loop fission) is only pos- called stripmining. In Figure 9, the number of iter-
sible if there is no dependence in a lexically backward ations have been broken down into strips of length
direction. Statements can be reordered to avoid back- 32.
wards dependences, unless there is a dependence cy-
cle (a forward and a backward dependence that form
a loop). Figure 7 shows a loop that is distributed
and vectorized. The original loop contains a depen-
dence in a lexically forward direction. Such a depen-
dence does not prevent loop distribution. That is,
the execution order of the two dependent statements
is maintained in the vectorized code. Figure 9: Stripmining a Loop into Two Nested Loops.
16
The most important sources of parallelism for mul- Exploiting Partial Loop Parallelism
tiprocessors are iterations of loops, such as do-loops
in Fortran programs and for-loops in C programs. Partial parallelism can also be exploited in loops with
We will present techniques for detecting that loop it- true dependences that cannot be removed. The basic
erations can correctly and effectively be executed in idea is to enforce the original execution order of the
parallel. Briefly we will also mention techniques for dependent program statements. Parallelism is still
exploiting partial parallelism in loops and in non-loop exploited as described above, however each depen-
program constructs. dent statement now waits for a go-ahead signal telling
All parallelizing compiler techniques have to deal it that the needed data value has been produced by
with two general issues, (1) they must be provably a prior iteration. The successful implementation of
this scheme relies on efficient hardware synchroniza-
correct and (2) they must improve the performance
of the generated code, relative to a serial execution on tion mechanisms.
one processor. The correctness of techniques is often Compilers can reduce the waiting time of depen-
stated by formally defining data-dependence patterns dent statements by moving the source and sink of a
under which a given transformation is legal. While dependence closer to each other. Statement reorder-
such correctness proofs exist for most of today’s com- ing techniques are important to achieve this effect. In
piler capabilities, they often require the compiler to addition, because every synchronization introduces
make conservative assumptions, as described above. overhead, reducing the number of synchronization
The second issue is no less complex. Assessing per- points is important. This can be done by elimi-
formance improvement involves the assumption of a nating redundant synchronizations (i.e., synchroniza-
machine model. For example, one must assume that tions that are covered by other synchronizations) or
a parallel loop will incur a start/terminate overhead. by serializing a code section. Note that there are
Hence, it will not execute n times faster on an n- many tradeoffs for the compiler to make. For exam-
processor machine than on one processor. Its parallel ple, it has to decide when it is more profitable to
t
execution time is no less than 1 processor + toverhead . serialize a code section than to execute it in parallel
n
For small loops this can be more than the serial exe- with many synchronizations.
cution time. Unfortunately, even the most advanced
compilers sometimes do not have enough information Non-loop Parallelism
to make such performance predictions. This is be-
cause they do not have sufficient information about Loops are not the only source of parallelism.
properties of the target machine and about the pro- Straight-line code can be broken up into indepen-
gram’s input data. dent sections, which can then be executed in par-
allel. For building such parallel sections, a compiler
can, for example, group all statements that are mu-
7.1 Parallelism Recognition tually data dependent into one section. This results
in several sections between which there are no data
Exploiting Fully Parallel Loops
dependences. Applying this scheme at small basic
Basic parallel code generation for multiprocessors code blocks is important for instruction-level paral-
entails identifying loops that have no loop-carried lelization, to be discussed later. At a larger scale,
dependences, and then marking these loops as such parallel regions could include entire subroutines,
parallelizable. Data-dependence analysis and all which can be assigned to different processors for ex-
its enabling techniques for program analysis and ecution.
dependence-removal is most important in this pro- More complex is exploiting parallelism in the repet-
cess. Iterations of parallelizable loops are then as- itive pattern of a recursion. Recursion splitting tech-
signed to the different processors for execution. This niques can transform a recursive algorithm into a
second step may happen through various methods. loop, which can then be analyzed with the already
The parallelizing compiler may be directly coupled described means.
with a code-generating compiler that issues the actual Non-loop parallelism is important for instruction-
machine code. Alternatively, the parallelizer can be level parallelization. It is of lesser importance for
a pre-processor, outputting the source program an- multiprocessors because the degree of parallelism in
notated with information about which loops can be loops is usually much higher than in straight-line code
executed in parallel. A backend compiler then reads sections. Furthermore, since the most prevalent par-
this program form and generates code according to allelization technology is found in compilers for non-
the preprocessor’s directives. recursive languages, such as Fortran, there has not
17
been a pressing need to deal with recursive program coalescing introduces overhead because it needs to in-
patterns. troduce additional expressions for reconstructing the
original loop index variables from the index of the
combined loop. Again, benefits and overheads must
7.2 Parallel Loop Restructuring be compared by the compiler.
Once parallel loops are detected, there are several
loop transformations that can optimize the program
such that it (1) exploits the available resources in an
optimal way and (2) minimizes overheads.
18
lel execution is profitable. This can be implemented temporal cache reuse. Figure 13 gives an example of
through two-version loops: the conditional expression this transformation.
forms the condition of an IF statement, which chooses
between the parallel and serial versions of the loop.
19
Multi-level Parallelization scheduling action. However this is often negligible
compared to the gain from load balancing.
Most multiprocessors today offer one-level paral- The goal of computing where the resources are, on
lelism. That means that, usually, a single loop out of the other hand, favors static scheduling methods. In
a loop nest can be executed in parallel. Architectures heterogeneous systems it is mandatory to perform the
that have a hierarchical structure can exploit two or computation where necessary processor capabilities
more levels of parallelism. For example, a cluster of or input/output devices are. In multiprocessors, data
multiprocessors may be able to execute two nested are critical resources, whose location may determine
parallel loops, the outer one across the clusters while the best executing processor. For example, if a data
the inner loop employs the processors within each item is known to be in the cache of a specific pro-
cluster. cessor, then it is best to execute other computations
Program sections that contain singly-nested loops accessing the same data item on the same processor
can be turned into two-level parallelism by stripmin- as well. The compiler has knowledge of which com-
ing them into two nested loops as shown in Figure 15. putation accesses which data in the future. Hence,
such scheduling decisions are good to make at com-
pile time. In distributed-memory systems this situa-
tion is even more pronounced. Accessing a data item
on a processor other than its owner, involves commu-
nication with high latencies.
Scheduling decisions also depend on the environ-
ment of the machine. For example, in a single-user
environment it may be best to statically schedule
computation that is known to execute evenly. How-
ever if the same program is executed in a multi-user
Figure 15: Stripmining Enables 2-level Parallelism. environment, the load of the processors can be very
uneven, making dynamic scheduling methods the bet-
ter option.
20
of arrays. Second, data access costs and frequencies mon form of communication is to send and receive
need to be estimated to compare distribution alter- messages, which can communicate one or several data
natives. Third, if there are program sections with elements between two specific processors. More com-
different access patterns, redistribution may need to plex communication primitives are also known, such
be considered at runtime. as broadcasts (send to all processors) and reductions
The simplest data distribution scheme is block dis- (receive from all processors and combine the results
tribution, which divides an array into p sections, in some form).
where p is the number of processors. Block distribu-
tion creates contiguous array sections on each proces- The basic idea of message generation is simple.
sor. If adjacent processors access adjacent array ele- Each statement needs to communicate to/from re-
ments, then cyclic distribution is appropriate. Block- mote processors the accessed data elements that are
cyclic distribution is a combination of both schemes. not local. Often, the owner-computes principle is as-
For irregular access patterns, indexed distribution sumed. For example, assignment statements are ex-
may be most appropriate. Figure 16 illustrates these ecuted on the processor that owns the left-hand-side
distribution schemes. element. (Note that this execution scheme is differ-
ent from the one assumed for SMPs, where an entire
loop iteration is executed by one and the same pro-
cessor.) Data partitioning information hence supplies
both the information about which processor executes
which statement and what data is local/remote. The
Figure 17 shows the basic messages generated. It as-
sumes a block partitioning of both arrays, A and B.
21
8 Exploiting Parallelism at the Removing Dependences
Instruction Level The basic patterns for removing dependences are sim-
ilar to the ones discussed for loop-level parallelism.
Instruction-level parallelism (ILP) refers to the pro- Anti dependences can be removed through variable
cessor’s capability to execute several instructions at renaming techniques. In addition, register renaming
the same time. Instruction-level parallelism can be becomes important. It avoids conflicts between po-
exploited implicitly by the processor without the tentially parallel sets of instructions that make use
compiler issuing special directives or instructions to of the same register for storing temporary values.
the hardware. Or, the compiler can extract par- Such techniques can be opposite from good register
allelism explicitly and express it in the generated allocation in sequential instruction streams, where
code. Examples of the latter type of generated code non-overlapping life times of different variables are
are Very-Long Instruction Word (VLIW) or Explic- assigned to the same register. Because of this, the
itly Parallel Instruction Computing (EPIC) architec- compiler may rely on hardware register-renaming ca-
tures. In addition to the techniques presented here, pabilities available in the processor.
all or most techniques known from classical compilers Similar to the induction variable recognition tech-
are important and are often applied as a first set of nique discussed above, the compiler can replace in-
transformations and analysis passes in ILP compilers cremental operations through operations that use
(see “Program Compilers”). operands available at the beginning of a parallel code
section. Likewise, a sequence of sum operations may
be replaced by sum operations into a temporary vari-
able, followed by an update step at the end of the
8.1 Implicit Instruction-level Paral- parallel region. Figure 18 illustrates these transfor-
lelism mations.
Instruction Scheduling
Modern processors exploit ILP by starting a new in- Figure 18: Dependence-Removing Transformation
for Instruction-Level Parallelism.Shaded blocks of in-
struction before an earlier instruction has completed. structions are independent of each other and can be executed
All instructions begin their execution in the order in parallel.
defined by the program. This order can have a sig-
nificant performance impact. Hence, an important
task of the compiler is to define a good order. It does
Increasing the Window Size
this by moving instructions that incur long latencies
prior to those with short latencies. Such instruction A large window of instructions within which the pro-
scheduling is subject to data dependence constraints. cessor can discover and exploit ILP is important for
For example, an instruction consuming a value in a two reasons. First, it leads to more opportunities for
register or memory must not be moved before an in- parallelism and, second, it reduces the relative cost of
struction producing this value. Instruction schedul- starting the instruction pipeline, which usually hap-
ing is a well-established technology, discussed in stan- pens at window boundaries.
dard compiler textbooks. We refer the reader to such Window boundaries are typically branch instruc-
literature for further details. tions. Instruction analysis and parallel execution can-
22
not easily cross branch instructions because the pro- choose a VLIW architecture and a software pipelining
cessor does not know what instructions will execute technique. The goal is to find a repetitive instruction
after the branch until it is reached. For example, pattern in a loop that can be mapped efficiently to a
an instruction may only execute on the true branch sequence of VLIW instructions. It is called pipelin-
of a conditional jump. Hence, although the instruc- ing because the execution may “borrow” instructions
tion could be executed in parallel with another in- from earlier or later iterations to fill an efficient sched-
struction before the branch, this is not feasible. If ule. Hence the execution of parts of different itera-
the false branch is taken, the instruction might have tions overlap with each other. This is illustrated in
written an undesired value to memory. Even if all Figure 20. The reader may notice that there would be
side effects of such instructions were kept in tempo- a conflict in register use between the overlapping loop
rary storage, they may raise exceptions that could iterations. For example, in the same VLIW instruc-
incorrectly abort program execution. tion s4 uses R0’s value of one loop iteration while s2
There are several techniques for increasing the win- uses R0’s value belonging to the next iteration. Both
dow size. Code motion techniques can move instruc- software and hardware register renaming techniques
tions across branches under certain conditions. For are known to resolve this problem.
example, if the same instruction appears on both
branches it can be moved before the branch, subject
to data dependence considerations. In this way, the
basic block on the critical path can be increased or
another basic block can be completely removed.
Instruction predication can also remove a branch by
assigning the branch condition to a mask, which then
guards the merged statements of both branches. Fig-
ure 19 shows an example. Predicated execution needs
hardware support and the compiler must tradeoff the
benefits of enhanced ILP with the overhead of more
executed instructions.
23
mation techniques. These issues become important removing transformations. Other techniques mutu-
when creating a complete compiler implementation, ally influence each other. We have introduced sev-
in which the described techniques are integrated into eral techniques where this situation occurs. For ex-
a user-friendly tool. Several compiler research in- ample, loop blocking for cache locality also involved
frastructures have played pioneering roles in this re- loop interchanging, and loop interchanging was made
gard. Among them are the Parafrase [1, 2], PFC [3], possible through loop distribution. There are many
PTRAN [4], ParaScope [5], Polaris [6], and SUIF [7] situations where the order of transformations is not
compilers. The following paragraphs describe a num- easy to determine. One possible solution is for the
ber of issues that have to be addressed by such in- compiler to generate internally a large number of pro-
frastructures. An adequate compiler-internal repre- gram variants and then estimate their performance.
sentation of the program must be chosen, the large We have already described the difficulty of perfor-
number of transformation passes need to be put in the mance estimation. In addition, generating a large
proper order, and decisions have to be made about number of program variants may get prohibitively ex-
where to apply which transformation, so as to max- pensive in terms of both compiler execution time and
imize the benefits but keep the compile time within space need. Practical solutions to the phase ordering
bounds. The user interface of the compiler is im- problem are based on heuristics and ad-hoc strate-
portant as well. Optimizing compilers typically come gies. Finding better solutions is still a research issue.
with a large set of command-line flags, which should One approach is for the compiler to decide on the
be presented in a form that makes them as easy to applicability of several transformations at once. This
use as possible. is the goal of unimodular transformations, which can
determine a best combination of iteration-reordering
Internal Representation techniques, subject to data dependence constraints.
A large variety of compiler-internal program repre- Applying Transformations at the Right Place
sentations (IRs) are in use. They differ with re-
spect to the level of program translation and the One of the most difficult problems for compilers is
type of program analysis information that is implic- to decide when and where to apply a specific tech-
itly represented. Several IRs may be used for several nique. In addition to the phase ordering problem,
phases of the compilation. The syntax tree IR rep- there is the issue that most transformations can have
resents the program at a level that is close to the a negative performance impact if applied to the wrong
original program. At the other end of the spectrum program section. For example, a very small loop may
are representations close to the generated machine run slower in parallel than serially. Interchanging two
code. An example of an IR in between these extremes loops may increase the parallel granularity but reduce
is the register transfer language, which is used by data locality. Stripmining for multi-level parallelism
the widely-available GNU C compiler. Source-level may introduce more overhead than benefit if the loop
transformations, such as loop analysis and transfor- has a small number of iterations. This difficulty is
mations, are usually applied on an IR at the level increased by the fact that machine architectures are
of the syntax-tree, whereas instruction-level transfor- getting more complex, requiring specialized compiler
mations are applied on an IR that is closer to the transformations in many situations. Furthermore, an
generated machine code. Examples of representations increasing number of compiler techniques are being
that include analysis information are the static single developed that apply to a specific program pattern,
assignment form (SSA) and the program dependence but not in general. For reasons discussed before, the
graph (PDG). SSA was introduced in Section 4. The compiler does not always have sufficient information
PDG includes information about both data depen- about the program input data and machine parame-
dences and control dependences in a common rep- ters to make optimal decisions.
resentation. It facilitates transformations that need
to deal with both types of dependences at the same Speed versus Degree of Optimization
time.
Ordinary compilers transform medium-size programs
in a few seconds. This is not so for parallelizing com-
Phase Ordering
pilers. Advanced program analysis methods, such as
Many compiler techniques are applied in an obvious data dependence analysis and symbolic range anal-
order. Data dependence analysis needs to come be- ysis, may take significantly longer. In addition, as
fore parallel loop recognition, and so do dependence mentioned above, compilers may need to create sev-
24
eral optimization variants of a program and then pick [4] F. Allen, M. Burke, P. Charles, R. Cytron, and
the one with the best estimated performance. This J. Ferrante”, “An overview of the PTRAN analy-
can further multiply the compilation time. It raises a sis system for multiprocessing,” Proc. of the Int’l
new issue in that the compiler now needs to make de- Conf. on Supercomputing, 1987, pages 194–211.
cisions about which program sections to optimize to
the fullest of its capabilities and where to save compi- [5] V. Balasundaram, K. Kennedy, U. Kremer, K.
lation time. One way of resolving this issue is to pass McKinley, and J. Subhlok, “The ParaScope ed-
the decision on to the user, in the form of command itor: an interactive parallel programming tool,”
line flags. Proc. of the Int’l Conf. on Supercomputing, 1989,
pages 540–550.
[3] J.R. Allen and K. Kennedy, “PFC: a program to • D. J. Kuck, “High Performance Computing,
convert Fortran to parallel form,” In K. Hwang Challenges for Future Systems,” Oxford Uni-
(ed.), Supercomputers: Design and Applications, versity Press, New York, 1996.
IEEE Computer Society Press, 1985, pages 186– • M. Wolfe, “High-Performance Compilers for
205. Parallel Computing,” Addison-Wesley, 1996.
25
• H. Zima, “Supercompilers for Parallel and
Vector Computers,” Addison-Wesley, 1991.
26