JaJa Parallel - Algorithms Intro
JaJa Parallel - Algorithms Intro
The copyright law of the United States (Title 17, United States Code) governs the making
of photocopies or other reproduction of copyrighted material.
Under certain conditions specified in the law, libraries and archives are authorized to
furnish a photocopy or other reproduction. One of these specified conditions is that the
photocopy or reproduction is not to be used for any purpose other than private study,
scholarship, or research. If electronic transmission of reserve material is used for
purposes in excess of what constitutes "fair use", that user may be liable for copyright
An Introduction to
Parallel Algorithms
Joseph JaJa
UNIVERSITY OF MARYLAND
1
Introduction
Chapter 1
Introduction
way, and can provide a simple parallel model that does not include an;
architecture-related features. The shared-memory model, where a numbe
of processors communicate through a common global memory, offers ar
attractive framework for the development of algorithmic techniques for par
allel computations. Unlike the two other models, the network model captures
communication by incorporating the topology of the interconnections into
the model itself. We show several parallel algorithms on these models, fol
lowed by a brief comparison.
The shared memory model serves as our vehicle for designing and
analyzing parallel algorithms in this book and has been a fertile ground for
theoretical research into both the power and limitations of parallelism. We
shall describe a general framework for presenting and analyzing parallel
algorithms in this model.
1.1
Parallel Processing
within a small distance of one another, and are primarily used to solve a given
problem jointly. Contrast such computers with distributed systems, where a
set of possibly many different types of processors are distributed over a large
geographic area, and where the primary goals are to use the available distrib
uted resources, and to collect information and transmit it over a network
connecting the various processors.
Parallel computers can be classified according to a variety of architec
tural features and modes of operations. In particular, these criteria include
the type and the number of processors, the interconnections among the
processors and the corresponding communication schemes, the overall con
trol and synchronization, and the input/output operations. These consider
ations are outside the scope of this book.
Our main goal is to present algorithms that are suitable for implementation
on parallel computers. We emphasize techniques, paradigms, and methods,
rather than detailed algorithms for specific applications. An immediate ques
tion comes to mind: How should an algorithm be evaluated for its suitability
for parallel processing? As in the case of sequential algorithms, there are
several important criteria, such as time performance, space utilization, and
programmability. The situation for parallel algorithms is more complicated
due to the presence of additional parameters, such as the number of proces
sors, the capacities of the local memories, the communication scheme, and
the synchronization protocols. To get started, we introduce two general mea
sures commonly used for evaluating the performance of a parallel algorithm.
Let P be a given computational problem and let n be its input size.
Denote the sequential complexity off by T*(n). That is, there is a sequential
algorithm that solves/"within this time bound, and, in addition, we can prove
that no sequential algorithm can solve P faster. Let A be a parallel algorithm
that solves/5 in time Tp ()ona parallel computer withp processors. Then, the
speedup achieved by A is defined to be
s (n)
Op\ n )
= m
Tp(ny
Chapter 1
Introduction
< n \ - ZlM.
^PV '
Tp(n)'
1
1.2 Background
Readers should have an understanding of elementary data structures and
basic techniques for designing and analyzing sequential algorithms. Such
material is usually covered at the undergraduate level in computer science
and computer engineering curricula. Our terminology and notation are stan
dard; they are described in several of the references given at the end of this
chapter.
Algorithms are expressed in a high-level language in common use. Each
algorithm begins with a description of its input and its output, followed by a
statement (which consists of a sequence of one or more statements). We next
give a list of the statements most frequently used in our algorithms. We shall
augment this list later with constructs needed for expressing parallelism.
1. Assignment statement:
variable: = expression
The expression on the right is evaluated and assigned to the variable on
the left.
1.2
Background
2. Begin/end statement:
begin
statement
statement
statement
end
Chapter 1
Introduction
T(n) = fl ( f (n)) if there exist positive constants c and n t i such that T(n) >
cf(n), for all n > n$.
T{n) = (/()) if 7X) = 0 ( f ( n ) ) and T(n) = l(f[n)).
The running time of a sequential algorithm is estimated by the number
of basic operations required by the algorithm as a function of the input size.
This definition naturally leads to the questions of what constitutes a basic
operation, and whether the cost of an operation should be a function of the
word size of the data involved. These issues depend on the specific problem
at hand and the model of computation used. Briefly, we charge a unit of time
to the operations of reading from and writing into the memoiy, and to basic
arithmetic and logic operations (such as adding, subtracting, comparing, or
multiplying two numbers, and computing the bitwise logic OR or AND of two
words). The cost of an operation does not depend on the word size; hence, we
are using what is called the uniform cost criterion. A formal computational
model suitable for our purposes is the Random Access Machine (RAM),
which assumes the presence of a central processing unit with a random-access
memory attached to it, and some way to handle the input and the output
operations. A knowledge of this model beyond our informal description is not
necessary for understanding the material covered in this book. For more
details concerning the analysis of algorithms, refer to the bibliographic notes
at the end of this chapter.
Finally, all logarithms used in this book are to the base 2 unless other
wise stated. A logarithm used in an asymptotic expression will always have a
minimum value of 1.
1.3
Parallel Models
Chapter 1
Introduction
algorithm in Fig. 1.1(b) proceeds in a complete binary tree fashion that begins
by computing the sums^l(l) + A(2), A(3) + A(4), ... ,A(n - 1) + /!() at the
lowest level, and repeats the process at the next level with ^ elements, and so
on until the sum is computed at the root.
+
+
4(5
T~
T~
+
4(1
+
4(1)
+
4(4
4(6
FIGURE 1.1
The dags of two possible algorithms for Example 1.1. (a) A dag for computing
the sum of eight elements, (b) An alternate dag for computing the sum of eight
elements based on a balanced binary tree scheme.
1,3
Parallel Models
1. If ti = t/c for some i * k , then /, * jk- That is, each processor can
perform a single operation during each unit of time.
2. If (i, k) is an arc in the graph, then tk > tj + 1. That is, the operation
represented by node k should be scheduled after the operation repre
sented by node i has been completed.
The time t-t of an input node i is assumed to be 0, and no processor is
allocated to the node i. We call the sequence{(/, , t,-) | i N} a schedule for
the parallel execution of the dag by p processors, where N is the set of nodes
in the dag.
For any given schedule, the corresponding time for executing the algo
rithm is given by max,-^/,-. The parallel complexity of the dag is defined by
Tp(n) = minjmax^A'r,}, where the minimum is taken over all schedules that
use p processors. Clearly, the depth of the dag, which is the length of the
longest path between an input and an output mode, is a lower bound on
Tp(n), for any number p of processors.
EXAMPLE 1.2:
Consider the two sum algorithms presented in Example 1.1 for an arbitrary
number n of elements. It is clear that the best schedule of the algorithm
represented in Fig. 1.1(a) takes 0(n) time, regardless of the number of
processors available, whereas the best schedule of the dag of Fig. 1.1(b) takes
0(log) time with
|
processors. In either case, the scheduling algorithm is
straightforward and proceeds bottom up, level by level, where all the nodes at
the same level have the same execution time.
EXAMPLE 1.3:
(Matrix Multiplication)
10
Chapter 1
Introduction
C{i,j)
C+J
X
(A(i, 1))
(B(l,j)
[A(i,
2)1
X
[B(2J)j
{ A ( i , 3)
B(3,i)
M(r\4)J
[B(4,j)J
FIGURE 1.2
A dag for computing an entry C(/\ /) of the matrix product C = AS for the case
of 4 x 4 matrices.
FIGURE 1.3
1,3
Parallel Models
11
12
Chapter 1
Introduction
shared memory. The initialized local variables are (1) the order n,
(2) the processor number i, and (3) the number p < n of processors
such that r = nip is an integer.
Output: The components (i l)r + 1, ... , ir of the vector y = Ax
stored in the shared variable y.
begin
1. global read(x, z)
2. global read(A((i - l)r + 1:ir, 1 :n),B)
3. Compute vv = Bz.
4. global write(w, v((i
- l ) r 4- 1 : ir))
end
1.3
Parallel Models
13
Setz:- x + y
global write(z, B(i))
end
Figure 1.4 illustrates the algorithm for the case when n = 8. During steps 1
and 2, a copy B of A is created and is stored in the shared memory. The
computation scheme (step 3) is based on a balanced binary tree whose leaves
correspond to the elements of A. The processor responsible for performing
14
Chapter 1
Introduction
FIGURE 1.4
1.3
Parallel Models
15
(EREW) PRAM does not allow any simultaneous access to a single memory
location. The concurrent read exclusive write (CREW) PRAM allows simul
taneous access for a read instruction only. Access to a location for a read or
a write instruction is allowed in the concurrent read concurrent write
(CRCW) PRAM. The three principal varieties of CRCW PRAMs are differ
entiated by how concurrent writes are handled. The common CRCW PRAM
allows concurrent writes only when all processors are attempting to write the
same value. The arbitrary CRCW PRAM allows an arbitrary processor to
succeed. The priority CRCW PRAM assumes that the indices of the proces
sors are linearly ordered, and allows the one with the minimum index to
succeed. Other variations of the CRCW PRAM model exist. It turns out that
these three models (EREW, CREW, CRCW) do not differ substantially in
their computational powers, although the CREW is more powerful than the
EREW, and the CRCW is most powerful. We discuss their relative powers in
Chapter 10.
Remark 1.1: To simplify the presentation of PRAM algorithms, we omit the
details concerning the memory-access operations. An instruction of the form
Set A: = B + C, where A, B, and C are shared variables, should be inter
preted as the following sequence of instructions,
global read(B, x)
global read(C, y)
Set z: = x + y
global write(z, A)
In the remainder of this book, no PRAM algorithm will contain explicit
memory-access instructions.
EXAMPLE 1.6:
where n = 2k. The initialized local variables are n, and the triple of
indices (i, j, I) identifying the processor.
16
Chapter 1
Introduction
1.3
Parallel Models
17
FIGURE 1.5
18
Chapter 1
Introduction
1.3
Parallel Models
19
FIGURE 1.6
A 4 x 4 mesh.
^1.
xz
U2
X4
2A
2,2
2,i
2,4
Xl
2,2
X2
X4
4,2
4,4
20
Chapter 1
Introduction
B(4, 3)
fi(4, 3) B(3, 4)
B ( 4,2)
B(3, 3) B ( 2,4)
13(4, 1)
B ( 3,2)
B ( 2,3) B(l,4)
13(3,1)
B(2,2) fl(l,
3)
1(2, 1) B(l, 2)
13(1, 1)
4(2,4)4(2,3)4(2,2)4(2,1)
4(3,4)4(3,3)4(3,2)4(3,1)
4(4,4)4(4,3)4(4,2)4(4,1)
i_ Ju
1,1
1,2
1, 3
1,4
2, 1
2,2
2,3
2,4
3, 1
3,2
3, 3
3,4
4, 1
4,2
4,3
4,4
FIGURE 1.7
After O(n) steps, each processor Py will have the correct value of C(iJ).
Hence, the algorithm achieves an optimal speedup for n2 processors relative to
the standard matrix-multiplication algorithm, which requires 0(n3) operations.
Systolic algorithms operate in a fully synchronous fashion, where, at each
time unit, a processor receives data from some neighbors, then performs some
local computation, and finally sends data to some of its neighbors.
You should not conclude from Example 1.8 that mesh algorithms are
typically synchronous. Many of the algorithms developed for the mesh have
been asynchronous.
The Hypercube. A hypercube consists o i p = 2 d processors interconnected
into a d-dimensional Boolean cube that can be defined as follows. Let the
binary representation of i be id-iid-2 io, where 0 < i < p - 1. Then
processorP, is connected to processorsP,oi, where i'W = id_l
ilh
1.3
Parallel Models
21
and ij = 1 - ij, for 0 < j < d - 1. In other words, two processors are
connected if and only if their indices differ in only one bit position. Notice
that our processors are indexed from 0 top - 1.
The hypercube has a recursive structure. We can extend a ddimensional cube to a (d + l)-dimensional cube by connecting correspond
ing processors of two d-dimensional cubes. One cube has the most significant
address bit equal to 0; the other cube has the most significant address bit
equal to 1. Figure 1.8 shows a four-dimensional hypercube.
The diameter of a d-dimensional hypercube is d = log p, since the
distance between any two processors P, and P; is equal to the number of bit
positions in which i and j differ; hence, it is less than or equal to d, and the
distance between say Pq and PY -1 is d. Each node is of degree d = log/?.
The hypercube is popular because of its regularity, its small diameter,
its many interesting graph-theoretic properties, and its ability to handle many
computations quickly and simply.
We next develop synchronous hypercube algorithms for several simple
problems, including matrix multiplication.
EXAMPLE
Each entry A(i) of an array A of size n is stored initially in the local memory
of processor P, of an (n = 2rf)-processor synchronous hypercube. The goal is
llll
0100
0101
1100
1101
1011
0000
0001
1000
1001
FIGURE 1.8
22
Chapter 1
Introduction
to compute the sum S = 2,"Jo A (i), and to store it in processor Po. Notice that
the indices of the array elements begin with 0.
The algorithm to compute S is straightforward. It consists of d itera
tions. The first iteration computes sums of pairs of elements between proces
sors whose indices differ in the most significant bit position. These sums are
stored in the (d - l)-dimensional subcube whose most significant address bit
is equal to 0. The remaining iterations continue in a similar fashion.
In the algorithm that follows, the hypercube operates synchronously,
and /(/l denotes the index i whose / bit has been complemented. The instruc
tion/Iff): = A(i) + A(i^>) involves two substeps. In the first substep,P, copies
A(fh) from processor P,to along the link connecting Pud and P(; in the second
substep, Pi performs the additional) + A(i^), storing the result in-4(/).
ALGORITHM 1.5
(Sum on the Hypercube)
Input: An array A of n = 2d elements such thatA{i) is stored in the
SetA(i): = A(i) + A ( f f
end
Consider, for example, the case when n = 8. Then, during the first
iteration of the for loop, the sums^4(0) = A(0) + A(4),A(1) =^4(1) +^4(5),
A(2) =-4(2) +-4(6),and-4(3) =-4(3) -I--4(7) are computed and stored in the
processors Pq, P\, PI, and P3, respectively. At the completion of the second
iteration, we obtain-4(0) = (-4(0) + -4(4)) + (-4(2) + -4(6)) and -4(1) =
(-4(1) -I- -4(5)) -I- (-4(3) -I- -4(7)). The third iteration clearly sets -4(0) to the
sumS. Algorithm 1.5 terminates afterd = logo parallel steps. Compare this
algorithm with the PRAM algorithm (Algorithm 1.2) that computes the sum
in C(log n) steps as well.
EXAMPLE
1.3
Parallel Models
23
24
Chapter 1
Introduction
the index i, and finally the q least significant bits correspond to the index j. In
particular, if we fix any pair of indices I, i, and j, and vary the remaining index
over all its possible values, we obtain a subcube of dimension q.
The input array A is stored in the subcube determined by the processors
P/,,,0, where 0 < I, i < n - 1, such that A(i, I) is stored in processor P/,;,oSimilarly, the input array B is stored in the subcube formed by the processors
Pi t 0 j where processor P^oj holds the entry B(l,j).
The goal is to compute C(i,j) =
forO < i , j < n - 1.
The overall algorithm consists of three stages.
1. The input data are distributed such that processor P^j will hold the two
entries A(i, /) and (/,;'), for 0 < l,i,j < n - 1.
2. Processor P/j(y computes the product C'(l,i,j) = A(i,l)B(l,j), for all 0 < i ,
j , l < n - 1.
3. For each 0 < i, j < n - 1, processors P i j j , where 0 < / < n - 1,
compute the sum C(i,j) = 2/To 1 C'(l,i,j).
The implementation of the first stage consists of two substages. In the
first substage, we broadcast, for each i and /, A(i,/) from processor P/^o to
for 0 < j < n 1. Since the set of processors {Pijj | 0 <;'<- ljforms a
^-dimensional cube for each pair i and /, we can use the previous broadcasting
algorithm (Algorithm 1.6) to broadcast A(i,/) from P; , o to all the processors
PT,ij. In the second substage, each element B(l,j) held in processor P/Uj is
broadcast to processors P/,;,/, for all 0 < i < n - 1. At the end of the second
substage, processor P/j(J will hold the two entriesA(i, /) andP(/J). Using our
broadcasting algorithm (Algorithm 1.6), we can complete the first stage in 2q =
0(log n) parallel steps.
The second stage consists of performing one multiplication in each
processor P/^ j. Hence, this stage requires one parallel step. At the end of this
stage, processor P / ^ j holds C'(l, i,j).
The third stage consists of computing^2 sums C(i,j); the terms C'{1, i,j)
of each sum reside in a ^-dimensional hypercube {Pt,i,j I 0 < / < n - 1}. As
we have seen before (Algorithm 1.5), each such sum can be computed in q =
O(logn) parallel steps. Processor Pq,,,/ will hold the entry C(i,j) of the product.
Therefore, the product of two n x n matrices can be computed in 0(log n)
time on an n3-processor hypercube.
1.3.4 COMPARISON
Although, for a given situation, each of the parallel models introduced could
be clearly advantageous, we believe that the shared memory model is most
1.3
Parallel Models
25
suited for the general presentation of parallel algorithms. Our choice for the
remainder of this book is the PRAM model, a choice justified by the discus
sion that follows.
In spite of its simplicity, the dag model applies to a specialized class of
problems and suffers from several deficiencies. Unless the algorithm is fairly
regular, the dag could be quite complicated and very difficult to analyze. The
dag model presents only partial information about a parallel algorithm, since
a scheduling problem and a processor allocation problem will still have to be
resolved. In addition, it has no natural mechanisms to handle communication
among the processors or to handle memory allocations and memory accesses.
Although the network model seems to be considerably better suited to
resolving both computation and communication issues than is the dag model,
its comparison with the shared-memory model is more subtle. For our pur
poses, the network model has two main drawbacks. First, it is significantly
more difficult to describe and analyze algorithms for the network model.
Second, the network model depends heavily on the particular topology under
consideration. Different topologies may require completely different algo
rithms to solve the same problem, as we have already seen with the parallel
implementation of the standard matrix-multiplication algorithm. These ar
guments clearly tip the balance in favor of the shared-memory model as a
more suitable algorithmic model.
The PRAM model, which is the synchronous version of the sharedmemory model, draws its power from the following facts:
There exists a well-developed body of techniques and methods to handle
many different classes of computational problems on the PRAM model.
The PRAM model removes algorithmic details concerning synchroni
zation and communication, and thereby allows the algorithm designer
to focus on the structural properties of the problem.
The PRAM model captures several important parameters of parallel
computations. A PRAM algorithm includes an explicit understanding
of the operations to be performed at each time unit, and explicit allo
cation of processors to jobs at each time unit.
The PRAM design paradigms have turned out to be robust. Many of the
network algorithms can be directly derived from PRAM algorithms. In
addition, recent research advances have shown that PRAM algorithms
can be mapped efficiently on several bounded-degree networks (see the
bibliographic notes at the end of this chapter).
It is possible to incorporate issues such as synchronization and commu
nication into the shared-memory model; hence, PRAM algorithms can
be analyzed within this more general framework.
For the remainder of this book, we use the PRAM model as our formal
model to design and analyze parallel algorithms. Sections 1.4 through 1.6
26
Chapter 1
Introduction
1.5
27
28
Chapter 1
Introduction
3. Set S: = 5(1)
end
This version of the parallel algorithm contains no mention of how many
processors there are, or how the operations will be allocated to processors. It
is stated only in terms of time units, where each time unit may include any
number of concurrent operations. In particular, we have log n + 2 time units,
where n operations are performed within the first time unit (step 1); the;th
time unit (iteration h = j 1 of step 2) includes ni2j~ l operations, for 2 < j <
log n + 1; and only one operation takes place at the last time unit (step 3).
Therefore, the work performed by this algorithm is W(n) = n + ^}"{n!2i) +
1 = O(n). The running time is clearly T(n) = 0(log n).
1.5
T(n)
...
0
3
Time
Unil
H
...
29
WT(n)
WA(n)
..
W,(n)
.. .
W7(n)
...
W.()
P
Number of
Operations
FIGURE 1.9
The WT scheduling principle. During each time unit /, the W;(n) operations are
scheduled as evenly as possible among the available p processors. For
example, during time units 1 and 2, each processor is scheduled to execute
the same number of operations; during time unit 3, the pth processor executes
one less operation than are executed by the remaining processors; and during
time unit 4, there are only k possible concurrent operations, which are distrib
uted to the k smallest-indexed processors.
EXAMPLE 1.13:
28
Chapter 1
Introduction
3. Set S: = 5(1)
end
This version of the parallel algorithm contains no mention of how many
processors there are, or how the operations will be allocated to processors. It
is stated only in terms of time units, where each time unit may include any
number of concurrent operations. In particular, we have log n + 2 time units,
where n operations are performed within the first time unit (step 1); the jth
time unit (iteration ft =j 1 of step 2) includes n[2) ~1 operations, for 2 <j <
log n + 1; and only one operation takes place at the last time unit (step 3).
Therefore, the work performed by this algorithm is W(n) = n + ^ ^ " ( n / 2 / ) +
1 = 0(n). The running time is clearly T(n) = O(logn).
1.5
7"(n)
...
29
WT(n)
W.CO
m . . . m
..
W,(n)
P2
Pl
Time
Unit
P2
. . . 9
1
px
W4n)
. . .
P2
W,(n)
Number of
Operations
FIGURE 1.9
The WT scheduling principle. During each time unit /, the W j ( n ) operations are
scheduled as evenly as possible among the available p processors. For
example, during time units 1 and 2, each processor is scheduled to execute
the same number of operations; during time unit 3, the pth processor executes
one less operation than are executed by the remaining processors; and during
time unit 4, there are only k possible concurrent operations, which are distrib
uted to the k smallest-indexed processors.
EXAMPLE 1.13:
30
Chapter 1
Introduction
(S=S(i)
p\
B( 1)
B ( 3)
= A ( i )J
B(4)
y=A(4)J
B ( 4)
B { 5)
\=A(5)J
Time
Unit
FIGURE 1.10
Processor allocation for computing the sum of eight elements on the PRAM.
The operation represented by a node is executed by the processor indicated
below the node.
1.5
31
32
Chapter 1
Introduction
and so on. Therefore, even though our parallel algorithm requires only a total
of 0(n) operations, the algorithm cannot efficiently utilize the n processors to
do useful work.
VW
"
T*(")
T*(n)
of
PT'i")
U\T'(n)+pT(n)l-
+ I(n)l
It follows that the algorithm achieves an optimal speedup (that is, S p (n) =
(p)) whenever p = o[ TT^j). Therefore, the faster the parallel algorithm,
the larger the range ofp for which the algorithm achieves an optimal speedup.
We have not yet factored the running time T(n) of the parallel algo
rithm into our notion of optimality. An optimal parallel algorithm is worktime (WT) optimal o r optimal i n the strong sense if it can be shown that T(n)
cannot be improved by any other optimal parallel algorithm. Therefore, the
running time of a WT optimal algorithm represents the ultimate speed that
can be achieved without sacrificing in the total number of operations.
EXAMPLE 1.14:
Consider the PRAM algorithm to compute the sum given in the WT frame
work (Algorithm 1.7). We have already noticed that T(n) = 0(log n) and
1.7
'Communication Complexity
33
34
Chapter 1
Introduction
Consider the adaptation of this algorithm to the case where there are n
processors available. In particular, the corresponding running time must be
0(n2). We examine the communication complexity of Algorithm 1.9 relative
to a particular processor allocation scheme.
A straightforward scheme proceeds by allocating the operations in
cluded in each time unit to the available processors (as in the statement of the
WT scheduling principle). In particular, the n3 concurrent operations of step
1 can be allocated equally among the n processors as follows. For each 1 < i <
n, processor P( computes C'(i,j, I) = A(i, l)B(l,j), where 1 <j,l< n; hence, P,
has to read the ith row of A and all of matrix B from the shared memory. A
traffic of 0(n2) numbers is created between the shared memory and each of
the local memories of the processors.
The hth iteration of the loop at step 2 requires n3/2h concurrent oper
ations, which can be allocated as follows. Processor P,'s task is to update the
values
I), for all 1 < j, / < n; hence, P, can read all the necessary values
C'(i,j, I) for all indices 1 </,/<, and can then perform the operations
required on this set of values. Again, 0(n2) entries get swapped between the
shared memory and the local memory of each processor.
Finally, step 3 can be implemented easily with 0 { n ) communication,
since processor P; has to store the /th row of the product matrix in the shared
memory, for 1 < i < n.
Therefore, we have a processor allocation scheme that adapts the WT sched
uling principle successfully and that has a communication requirement of 0 ( n 2 ) .
We now develop another parallel implementation of the standard
matrix-multiplication algorithm that will result in many fewer data elements
being transferred between the shared memory and each of the local memo
ries of the n processors. In addition, the computation time remains the same.
Assume, without loss of generality, that a = Vn is an integer. Partition
matrixA into \fn x \fn blocks of submatrices, each of size n2'3 x n2'3, as follows:
A 1,1 A [2
A
2,1 ^4 2,2
A=
A l,a
A 2,a
Aa,i A a, 2
A a,a
We partition B and C in the same way. Notice that there are exactly n
pairs (Aij, Bq) for all /,/, and /.
The new processor allocation follows a strategy different from the one
outlined in the WT scheduling principle. Each processor P reads a unique
pair (Aij, Bij) of blocks from A and B, respectively, and computes the
product Dijj = AijBij, which is then stored in the shared memory. The
amount of communication needed is 0(4/3), which accounts for the cost o
transferring a pair of blocks from the shared memory into each local memor
1.8
Summary
35
and the cost of storing a new block from each local memory into the shared
memory. On the other hand, the amount of computation required for per
forming Dijj = AijBij is 0(n2), since each block is of size n2/3 x 2/3.
Next, each block Cy of the product matrix C is given by Cy = 2/f iA,y,;>
and there are n2/3 such blocks. We can now allocate Vn processors to com
pute each block Cy such that the computation proceeds in the fashion of a
balanced binary tree whose Vn leaves contain the blocks Dyy, where 1 < / <
V'n. Each level of the tree requires the concurrent access of a set of blocks each
of size n2/3 x n2/3. Hence, the execution of the operations represented by each
level of a tree requires 0(n4'3) communication. Therefore the total amount of
communication required for computing all the C,y's is 0(n4/3 log n), which is
substantially smaller than the 0(n2) required by the previous processor alloca
tion scheme.
Remark 1.3: We can reduce to 0(n4/3) the communication cost of the second
1.8 Summary
The design and analysis of parallel algorithms involve a complex set of inter
related issues that is difficult to model appropriately. These issues include
computational concurrency, processor allocation and scheduling, communi
cation, synchronization, and granularity (granularity is a measure of the
amount of computation that can be performed by the processors between
synchronization points). An attempt to capture most of the related parameters
makes the process of designing parallel algorithms a challenging task. We have
opted for simplicity and elegance, while attempting to shed light on some of
the important performance issues arising in the design of parallel algorithms.
36
Chapter 1
Introduction
Exercises
1.1. We have seen how to schedule the dag corresponding to the standard
algorithm for multiplying two n x n matrices in d(log n) time using
n3 processors. What is the optimal schedule for an arbitary number p
of processors, where 1 < p < n3? What is the corresponding parallel
conjplexity?
12.
Consider the problem of computing X", where n = 2k for some
integer k. The repeated-squaring algorithm consists of computing
X1 = X x X,X4 = X2 x X2, Xs = X4 x X4, and so on. Draw the
dag corresponding to this algorithm. What is the optimal schedule
forp processors, where 1 < p < nl
b. Draw the dag and give the optimal schedule for the case when Ais an
m x m matrix?
13. LetAbean/i x n lower triangular matrix such that
* 0, fori < i <
n, and let 6 be an n-dimensional vector. The back-substitution method to
solve the linear system of equations Ax = b begins by determining x\
using the first equation (a\\x\ = b\), then determiningxi using the
second equation (a21X1 + 022x2 = b2), and so on.
Exercises
37
38
Chapter 1
Introduction
Exercises
39
40
Chapter 1
Introduction
Bibliographic Notes
The three parallel models introduced in this chapter have received considerable
attention in the literature. Dags have been widely used to model algorithms especially
for numerical computations (an early reference is [6]). More advanced parallel algo
rithms for this model and a discussion of related issues can be found in [5]. Some of the
early algorithms for shared-memory models have appeared in [4,9,10,16,17,19,24,
26]. Rigorous descriptions of shared-memory models were introduced later in [11,12].
The WT scheduling principle is derived from a theorem in [7], In the literature, this
principle is commonly referred to as Brent's theorem or Brent's scheduling principle.
The relevance of this theorem to the design of PRAM algorithms was initially pointed
out in [28]. The mesh is perhaps one of the earliest parallel models studied in some
detail. Since then, many networks have been proposed for parallel processing. The
recent books [2, 21, 23] give more advanced parallel algorithms on various networks
and additional references. Recent work on the mapping of PRAM algorithms on
bounded-degree networks is described in [3,13,14, 20, 25], Our presentation on the
communication complexity of the matrix-multiplication problem in the sharedmemory model is taken from [1], Data-parallel algorithms are described in [15].
Parallel architectures have been described in several books (see, for example, [18,
29]). The necessary background material for this book can be found in many text
books, including [8, 22, 27].
References
1. Aggarwal, A., A. K. Chandra, and M. Snir. Communication complexity of
PRAMs. Theoretical Computer Science, 71 (1):328, 1990.
2. Akl, S. G. The Design and Analysis of Parallel Algorithms. Prentice-Hall, Englewood Cliffs, NJ, 1989.
3. Alt., H., T. Hagerup, K. Mehlhorn, and F. P. Preparata. Simulation of idealized
parallel computers on more realistic ones. SLAM J. Computing, 16(5):808-835,
1987.
4. Arjomandi, E. A Study of Parallelism in Graph Theory. PhD thesis, Computer
Science Department, University of Toronto, Toronto, Canada, 1975.
5. Bertsekas, D. P., and J. N. Tsitsiklis. Parallel and Distributed Computation: Nu
merical Methods. Prentice-Hall, Englewood Cliffs, NJ, 1989.
6. Borodin, A., and I. Munro. The Computational Complexity of Algebraic and
Numeric Problems. American Elsevier, New York, 1975.
7. Brent, R. P. The parallel evaluation of general arithmetic expressions. JACM,
21(2):201-208,1974.
8. Cormen, T. H., C. E. Leiserson, and R. L. Rivest. Introduction to Algorithms. MIT
Press, Cambridge, MA, and McGraw-Hill, New York, 1990.
9. Csanky, L. Fast parallel matrix inversion algorithms. SLAM J. Computing,
5(4):618-623,1976.
10. Eckstein, D. M. Parallel Processing Using Depth-First Search and Breadth-First
Search. PhD thesis, Computer Science Department, University of Iowa, Iowa
City, IA, 1977.
References
41
11. Fortune, S., and J. Wyllie. Parallelism in random access machines. In Proceedings
Tenth Annual ACM Symposium on Theory of Computing, San Diego, CA, 1978,
pages 114-118. ACM Press, New York.
12. Goldschlager, L. M. A unified approach to models of synchronous parallel ma
chines. In Proceedings Tenth Annual ACM Symposium on Theory of Computing,
San Diego, CA, 1978, pages 89-94. ACM Press, New York.
13. Herley, K. T. Efficient simulations of small shared memories on bounded degree
networks. In Proceedings Thirtieth Annual Symposium on Foundations of Com
puter Science, Research Triangle Park, NC, 1989, pages 390-395. IEEE Computer
Society Press, Los Alamitos, CA.
14. Herley, K. T., and G. Bilardi. Deterministic simulations of PRAMs on boundeddegree networks. In Proceedings Twenty-Sixth Annual Allerton Conference on Com
munication, Control and Computation, Monticello, IL, 1988, pages 1084-1093.
15. Hillis, W. D., and G. L. Steele. Data parallel algorithms. Communication of the
ACM, 29(12):1170-1183, 1986.
16. Hirschberg, D. S. Parallel algorithms for the transitive closure and the connected
components problems. In Proceedings Eighth Annual ACM Symposium on Theory
of Computing, Hershey, PA, 1976, pages 55-57. ACM Press, New York.
17. Hirschberg, D. S. Fast parallel sorting algorithms. Communication of the ACM,
21 (8):657661, 1978.
18. Hwang, K., and F. Briggs. Computer Architecture and Parallel Processing. McGrawHill, New York, 1984.
19. JaJa, J. Graph connectivity problems on parallel computers. Technical Report
CS-78-05, Pennsylvania State University, University Park, PA, 1978.
20. Karlin, A., and E. Upfal. Parallel hashingan efficient implementation of shared
memory. SIAMJ. Computing, 35(4):876-892, 1988.
21. Leighton, T. Introduction to Parallel Algorithms and Architectures: Arrays, Trees,
and Hypercubes. Morgan Kaufmann, San Mateo, CA, 1991.
22. Manber, U. Introduction to Algorithms: A Creative Approach. Addison-Wesley,
Reading, MA, 1989.
23. Miller, R., and Q. F. Stout. Parallel Algorithms for Regular Architectures. MIT
Press, Cambridge, MA, 1992.
24. Preparata, F. P. New parallel sorting schemes. IEEE Transactions Computer,
C-27(7):669-673, 1978.
25. Ranade, A. G. How to emulate shared memory. In Proceedings Twenty-Eighth
Annual Symposium on the Foundations of Computer Science, Los Angeles, CA,
1987, pages 185-192. IEEE Press, Piscataway, NJ.
26. Savage, C. Parallel Algorithms for Graph Theoretic Problems. PhD thesis, Com
puter Science Department, University of Illinois, Urbana, IL, 1978.
27. Sedgewick, R. Algorithms. Addison-Wesley, Reading, MA, 1983.
28. Shiloach, Y. and U. Vishkin. An O( n 2 logn) parallel max-flow algorithm. Journal
of Algorithms, 3(2):128-146, 1982.
29. Stone, H. S. High-Performance Computer Architecture. Addison-Wesley, Reading,
MA, 1987.