0% found this document useful (0 votes)
100 views8 pages

1973 A Parallel Algorithm For The Efficient Solution of A General Class of Recurrence Equations

This document describes a parallel algorithm for efficiently solving general classes of recurrence equations on parallel computers. The algorithm uses a technique called recursive doubling, which splits the computation for a function into two equally complex subfunctions that can be evaluated simultaneously on separate processors. The splitting continues to spread the computation across more processors. The algorithm can solve any recurrence equation of the form xi = f(b1, g(a1, xi-1)) in time proportional to the logarithm of N, where N is the number of elements to be computed, on a computer with N-fold parallelism. This represents a significant speedup over solving the recurrence sequentially, which would take time proportional to N.

Uploaded by

Tamer Cakici
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
100 views8 pages

1973 A Parallel Algorithm For The Efficient Solution of A General Class of Recurrence Equations

This document describes a parallel algorithm for efficiently solving general classes of recurrence equations on parallel computers. The algorithm uses a technique called recursive doubling, which splits the computation for a function into two equally complex subfunctions that can be evaluated simultaneously on separate processors. The splitting continues to spread the computation across more processors. The algorithm can solve any recurrence equation of the form xi = f(b1, g(a1, xi-1)) in time proportional to the logarithm of N, where N is the number of elements to be computed, on a computer with N-fold parallelism. This represents a significant speedup over solving the recurrence sequentially, which would take time proportional to N.

Uploaded by

Tamer Cakici
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

786 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-22, NO.

8, AUGUST 1973

[43] J. S. Tyler, J. D. Powell, and R. K. Mehra, "The use of smoothing can Automatic Control Council. He was also named the Outstanding
and other advanced techniques for VTOL aircraft parameter iden- Young Electrical Engineer for 1969 by Eta Kappa Nu. He has served
tification," Cornell Aeronaut. Lab., Final Rep., Naval Air Syst. the IEEE Control Systems Society in a number of capacities, and is
Command Contract N00019-69-6-0534, June 1970. ,, presently the Vice President of the Society, Chairman of its Informa-
[441 S. R. McReynolds, "Parallel filtering and smoothing algorithms, tion Dissemination Committee, and an elected AdCom member. He
presented at the 3rd Symp. Nonlinear Estimation Theory, San w
Diego, Calif., Sept. 1972. will also be Program Chairman for the 1973 Jot Automatic Control
Confercnce.

Robert E. Larson (S'58-M'66-F'73) was born


in Stockton, Calif., on September 19, 1938. He
received the B.S. degree from the Massachusetts Edison Tse (M'70) was born in Kwangtung,
Institute of Technology, Cambridge, in 1960, China, on January 21, 1944. He received the
and the M.S. and Ph.D. degrees from Stanford B.S. and M.S. degrees simultaneously in 1967,
University, Stanford, Calif., in 1961 and 1964, and the Ph.D. degree in 1970, all in electrical
respectively, all in electrical engineering. engineering, from the Massachusetts Institute of
He has been employed by the IBM Corpora- Technology, Cambridge.
tion and the Hughes Aircraft Company. From From 1966 to 1967 and from 1968 to 1969
1964 to 1968 he was with the Information and he was a Teaching Assistant at M.I.T., teaching
Control Laboratory, Stanford Research Insti- graduate courses in stochastic system and opti-
tute, Menlo Park, Calif. In 1968, he and two colleagues founded mal control theory, and from 1967 to 1968 he
Systems Control, Inc., Palo Alto, Calif., where he is currently Executive was a Research Assistant at M.I.T., doing re-
Vice President. He is the author of State Increment Dynamic Program- search in nonlinear filtering. From 1968 to 1969 he was also a consul-
ming (New York: Elsevier, 1968) and of over 90 technical papers. His tant for the State Street Bank of Boston, where he applied optimal
fields of specialization are computational aspects of dynamic program- control to banking problems. Since 1969 he has been with Systems
ming and applications of optimal control and estimation theory. Control, Inc., Palo Alto, Calif., where he is now a Senior Research
Dr. Larson received the IEEE Group on Automatic Control Best Engineer. His current research interests include optimal control theory,
Ptaper Award in 1965, and the 1968 Donald P. Eckman Award for out- dynamic allocation, and stochastic estimation, identification and control.
standing achievement in the field of automatic control from the Ameri- Dr. Tse is a member of Sigma Xi, Eta Kappa Nu, and Tau Beta Pi.

A Parallel Algorithm for the Efficient Solution


of a General Class of Recurrence Equations
PETER M. KOGGE AND HAROLD S. STONE

Abstract-An mnth-order recurrence problem is defined as the compu- rirst order, all linear mth-order recurrence equations can be cast into
tation of the series xI, x2, ** , XN, where xi =fh(x_j,. *. ',Xjim) for this form. Suitable applications indude linear recurrence equations,
some function f1. This paper uses a technique called recursive doubling polynomial evaluation, several nonlinear problems, the determination
in an algorithm for solving a large class of recurrence problems on paral- of the maximum or minimum of N numbers, and the solution of tri-
lel computers such as the lIliac IV. diagonal linear equations. The resulting algorithm computes the entire
Recursive doubling involves the splitting of the computation of a series x 1, * * *, XN in time proportional to [log2 NJ on a computer with
function into two equally complex subfunctions whose evaluation can N-fold parallelism. On a serial computer, computation time is propor-
be performed simultaneously in two separate processors. Successive tional to N.
splitting of each of these subfunctions spreads the computation over
more processors. Index Terms-Parallel algorithms, parallel computation, recurrence
This algorithm can be appled to any recurrence equation of the form problems, recursive doubling.
xi = f(b1, g(a1, xi-, )) where f and g are functions that satisfy certain
distributive and associative4ike properties. Although this recurrence is
INTRODUCTION
Manuscript received October 7, 1972; revised March 21, 1973. This A. Definition of Problem
work was supported in part by NSF Grant GJ 1180 and in part by an T FREQUENTLY occurs in applied mathematics that the
IBM Corporation fellowship. E
P. M. Kogge was with the Department of Electrical Engineering, I solution to some problem is a sequence xI, x2, , XN,
Digital Systems Laboratory, Stanford University, Stanford, Calif. He is where each xi is a function of the previous m x's, namely
now with the Systems Architecture Department, IBM Corporation,
Owego, N.Y. 13827. x-,,* , Xim A common example of such a problem is a
H. S. Stone is with the Department of Electrical Engineering and the time-varying linear system, where the state of the system at
Department of Computer Science, Digital Systems Laboratory, Stan-
ford University, Stanford, Calif. time i is xi, and can be computed from the equations

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
KOGGE AND STONE: PARALLEL ALGORITHM 787

x1 = B1 Assumption 3: Each processor has a distinct index by which


X2= A2 X1 + B2 it is referenced.
x3= A3 x2 + B3 Assumption 4: All processors obtain their instructions simul-
taneously from a single instruction stream. Thus all processors
execute the same instruction, but they operate on data stored
in their own memories.
Assumption 5: Any processor may be "blocked" or
"masked" from performing some instruction. This mask may
be set by an explicit instruction directed to that processor via
its index, or by the result of some test instruction such as "set
mask if accumulator = O."
XN= ANXN-1 + BN (1) Assumption 6: Elementary arithmetic operations have two
where Ai and Bi represent the internal dynamics of the system. P
It is assumed throughout this paper that the numberjof pro-
Ai and Bi can be real or complex numbers, ' constant or time-
~~~~cessors p iS greater than N, the maximum number of elements
varying matrices, etc., depending on the problem. to be computed. In reality when p is less than N, this algo-
The equation used to compute xi is called a recurrence equa- rithm can be used [Nip] times to calculate p elements of the
tion and, together with some initial values for some of the xi, series at a time.
represents a complete problem description. Formally, a recur-
rence problem consists of a set of recurrence equations: II. GENERAL FIRST-ORDER RECURRENCE EQUATION
xi = f1(x-1,, ,xi-m), i=m + 1 (2) A. Example
,N
In this section we develop a parallel solution to a simple
and some boundary values, which may consist of the following. first-order recurrence problem. The solution is a special case
of the general algorithm, but its development is not obscured
1) x1, , xXm. This is an initial value problem. by the notation needed to describe the general algorithm.
2) XNm+l, * * *, XN. This is a final value problem. Given xl = b1, find X2,... , XN, where
3) A mixture of m initial and final values.
xi= aixi-I +bi. (3)
This paper discusses an algorithm for solving a particular
class of initial value recurrence problems on parallel computing Befior s.
systems such as the Illiac IV. This class of problems includes Definition.
the computation of the sequence x1, * * *, xN when the expres-
sionforxiisalinear recurrence equation of the form of (l), the m m
calculation of the maximum or minimum of N numbers, the A E / 4
evaluation of Nth-degree polynomials, and several nonlinear j=n ir=j+l /
problems. Such problems as these can be solved in a very
straightforward manner on serial processors in time propor- where the vacuous product (Hrm+i ar) is given the value 1.
tional to N. Some have also been solved on parallel computers Stone [4] first used this notation in the derivation of this
with special-purpose algorithms tailored to those problems, algorithm. The basic algorithm involves a concept called re-
e.g., polynomial evaluation (Munro and Paterson [3]). With a cursive doubling, which consists of breaking the calculation of
computer having N-fold parallelism, the algorithm in this paper one term into two equally complex subterms.
solves all these problems and others in time proportional to Now we can write the solution to (3) as follows:
[log2 N!1.A
Xi b1
= = Q(, l)
B. Computer Model X2 = a2 xl + b2 = a2 b + b2 = Q(2, 1)
The algorithm to be described in this paper is designed for a X3 = a3X2 + b3 = a3a2 b1 + a3 b2 + b3 = Q(3, 1)
computer of the Illiac IV class. The major assumptions about
the computer's architecture are as follows.
Assumption] : There are p identical processors, each able to
execute the usual arithmetic and logical operations, and each xi = aixi l + bi = Q(i, 1)
with its own memory.
Assumption 2: Each processor can communicate with every
other processor. The exact method of data exchange between
processors can affect the algorithm's computational complex- XN = aNxN.l + bN = Q(N, 1). (4)
ity and will be discussed in a future report.
We can also write this solution as
1 [x] is the ceiling function and represents the smallest integer not A
smaller than x. Q(l, 1) = x1 = b1

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
788 IEEE TRANSACTIONS ON COMPUTERS, AUGUST 1973

Q(2,1)=x2 =a2x1 +b2 =a2Q(1,1)+Q(2,2) T3 =,1) A A

Q(4, 1)-x4 a4X3 + b4 Q(8,5) Q(4,1)=x4


-a4a3x2 +a4b3 + b4 T 8a07Q(6,5)+0(8,7) o43 0(2,1)+Q(4,3)
-a4a3(a2 b, + b2) + (a4 b3 + b4)
-a4a3Q(2, 1) + Q(4, 3) A(87)
A

=a8~~~~~~~~~~~~~Q(8,7)
=aQ(6,5)
A
A(43) X 2,1)x2
=aQ(4.3) =(2Q(,1) x
* I ~o as(7,7)
TT=lZ Q-7, 06 040Q(3,3) -020(1, 1)
Q+(8,8) +8(6,6) Q+(4,4) +8(2,2)

In general
2i b8 b7 b6 b5 b4 b3 b2 b,
Q(2i, 1) = x2 n a Q(i, 1) + Q(2i, i + 1).
A

(5) Q(8,8) Q(7,7) Q(6,6) 0(5,5) 0(4,4) 0(3,3) 0(2,2) 0(1,1)


\r=i+l / Fig. I. Parallel computation of x8 in the sequence xi =aixi- + bi.
Equation (5) gives us our recursive doubling. Both Q(i, 1)
and Q(2i, i + 1) are identical in structure since they both re- X8=(83a706o&)'(04o3o2a)
quire the same number and sequence of multiplications and
additions. Also, each of these terms involves i a's and i b's,
exactly one-half the number of a's and b's used in Q(2i, 1).
Thus if at the kth step we want to compute x2i, then at the
k - 1 st step we should have one processor compute Q(i, 1) T=2 (0807)'(0605) (O4o3)'(0201)-x4
and another compute Q(2i, i + 1). We then continue this split-
ting operation recursively. The resulting computation graph
for the case N = 8 is given in Fig. 1.
Note that when we compute A(2i, 1) from the two equally T 0I 8.07 06'05 04.03320X2
complex subterms (i, 1) and Q(2i, i + 1), we also need the
additional product (H2 +1 ar). This is not a serious hindrance
since we can compute the products using the scheme shown in
Fig. 2. We see that in all cases the correction products needed T-0' ,
at one level of the tree in Fig. 1 are always available just after 08 07 06 05 04 a3 02 01a-X
the previous level in Fig. 2. Figs. 1 and 2 show the computa- Fig. 2. Parallel computation of xg = ni= a1.
tion of Q(8, 1). However, it is straightforward to extend the
computation to eight processors, and compute Q(i, 1) for 1 . Restriction 2: g distributes over f. g(x, f(y, z)) f(g(x, y),
i S 8 in parallel. The algorithm solves (3) in a time propor- g(x, z)).
tional to [log2 Ni.
An example of the complete solution of (3) for the case Restriction 3: g is semiassociative, that is, there exists some
N = 8 is given in detail in Table I. function h such that g(x, g(y, z)) = g(h (x, y), z).
The previous restrictions on f and g are the only ones neces-
B. A General Class of First-Order Recurrence Equations sary to prove the correctness of the general parallel algorithm.
In this section we define a general class of first-order recur- However, these restrictions may also limit the domains from
which a and bi and the variables xxi can
ca be chosen. For most
rence equations for which we develop a parallel algorithm. hral andtbmeand thevatriales beForlmo sen.
The limitation to first-order equations is not as restrictive as it n or rithmtic operatios like + for r,eiisno poble
might first appear, since it is often the case that we can very butimore extic operain uhe asflo ceiling odun
easily reformulate a more general mth-order problem as a first-
order problem. Section III-B describes such a reformulation.
dsion,betc.,cma cora
should be checked carefully.
t p sl i

Thdergneaarallm.Setionalgor developed Section.


The generalp narallel Algorithrn cleveloned ini Section II-C The semiassociative property of g forces h to behave as if it
solves all recurrence equations that can be placed in the follow- were associative. In particular, we have
ing form: g(h(h(a, b), c), d) = g(h(a, b), g(c, d))
xl1 b1 =g(a, g(b, g(c, d)))

where bi and ai are arbitrary constants and f and g are index- = g(h (a, h (b, c)), d).
independent functions that satisfy the following restrictions. Hence, iterated compositions of h when used as the first
Restriction 1: f is associative. f(x, f(y, z)) =f(f(x,y), z). argument of the function g can be evaluated as if h were as-

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
KOGGE AND STONE: PARALLEL ALGORITHM 789

TABLE I

T=O T= 1 T= 2 T=3
Processor A(i) B(i) A(i) B(i) A(i) B(i) A(i) B(i)
1 ** b, =X1 =Q(1, 1) ** X1 =Q(1, 1) ** X1 =Q(1, 1) ** X
2 a2 b2 = Q(2, 2) a2 a2x1 + b2 =x2= Q(2, 1) a2 X2 = Q(2, 1) a2 X2
3 a3 b3 = Q(3, 3) a3a2 a3b2 + b3= Q(3, 2) a3a2 (a3a2)Xl + (a3b2 + b3) = X3=Q(3, 1) a3a2 X3
4 a4 b4 = Q(4, 4) a4a3 a4b3 + b4 = Q(4, 3) a4a3a2 (a4a3)X2 + (a4b3 + b4) = X4 = Q(4, 1) a4a3a2 X4
a5 a4(a3 b2 + b3) + a5 b4 + b5 = Q (5, 2)
5 a5 b5 =Q(5, 5) a5a4 a5b4 + b5 = Q(5, 4) a5a4a3a2 (b a al Xs
w=2 m=w+i M=2

6 a6 b6 = Q(6, 6) a6a5 a6b5 +b = Q(6,5) a6a5a4a3 Z ( H am b= Q(6, 3) H a; X6


W=3 m=w+1 m=2

7 a7 b7 = Q(7, 7) a7a6 a7b6 + = Q(7, 6) a7a6a5a4 E (fn am)bw=Q(7 4) fT at X7


8 as b8 = Q(8, 8) a8a7 a8b7 + b8 =Q(8, 7) a8a7a6a5 ambw= Q(, 5) ii at X8
w-s m=w+l M=2

* Not really needed to compute X1, ,X.


** Arbitrary.

sociative without altering the output value of g. In all interest- are scalar multiplication, then the Q(m, n) defined previously
ing practical problems discovered thus far, the function h is is exactly the same as the Q(m, n) defined for the example in
associative. Section II-A.
The similarities between Q and Q carry even further. The
C. Parallel Algorithm function Q(i, 1) is the solution of the general recurrence equa-
The principle of recursive doubling can be applied in a tion (6), that is,
natural way to any recurrence equation that satisfies the re- = Q(i, 1, XiV 1 < i N. (7)
strictions of Section II-B. In fact, the resulting general algo-
rithm bears a very strong resemblance to the example of Sec- Also, as in the example, we can derive a formula computing
tion II-A. Before giving the algorithm, however, we first give Q(2i, 1) strictly in terms of two equally complex subterms,
two definitions. namely,
Definition: For any function q of two arguments define the Q(2i, 1) f(Q(2i, i + 1), g(h Q(i, 1))). (8)
generalized composition of q as qg=) (a1), where Q(2i,I'
q1J=n(aB)
q(n) )= aan, for n > I Both (7) and a more general version of (8) are proved in the
fonlAppendix.
q5r,) (a1) = q =(a q(nm) (a)), for m > n=' 1 . Equation (8) is a perfect candidate for recursive doubling.
= q(am, q(am-1 , q(an+2, q (an+l, an))... Q(2i, i + 1) and Q(i, 1) are identical in terms of the number of
unique a's and b's referenced and require the same sequence of
If we let q(a, b) = a + b (scalar addition), then f, g, and h function calls to evaluate them. As with the second
q(m) (a)
qj=n 1
= (a
m
+ (am
m-l
+ + (an+2
n+2 (an+1
+ +
n .)
nal an))
example, the only hindrance in implementing (8) directly as a
recursive doubling algorithm is the correction term, the h com-
m position. However, since h can be treated as an associative
= £ a1 function, we can use a scheme similar to Fig. 2 to compute
l =n
these correction terms exactly as they are needed.
Likewise, if q (a, b) = a b (scalar multiplication), then Fig. 3 is a computation graph using (8) and the h composi-
m tion algorithm to compute x8. Despite its increased com-
q(mn) (aj) = [l a1. plexity, the general structure of this graph is identical to Figs.
j=n ~~~~~~1
and 2 and can be extended to solve for all elements of the
Definition: Define Q(m, n) as sequence x1,.. , Xy in parallel.
.,n)=(m) rzh(m.)I
Q(,)-j ( I9=+1 (a) b ')
a)b] E We can now state the complete algorithm for solving our
general recurrence equations. The detailed proof of the cor-
where we define rectness of this algorithm is given in the Appendix.
g(J4m) + (ar),bj)=b,. Algorithm A -Genleral Algorithm: This algorithm solves for
'~~~ ~~~~~ X, X2, * * *,XN where xi = f(b1, g(aj, X 1 )) and f and g satisfy
If we consider the case where / is scalar addition, and g and h the restrictions of Section lI-B.

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
790 IEEE TRANSACTIONS ON COMPUTERS, AUGUST 1973

Q(8,1) =Xs number of available processors is greater than about 3N/2, an-
T=3> -f[Q(85)g9(h(8)(o.)Q0(4I1)] other method can avoid these extra instructions. The N pro-
=5
\ cessors with the highest indices are allocated to the solving of
Xl , XN, and the next N/2 processors are initialized so that
when one of the top N processors references their data, the
7=2 f[Q(8,7),g(h(ao87),Q(6,5))] =f [Q(4,3),g(h(a403),Q(2, ))] values returned cause no change in the higher processor's
values for A(i) and B(i). These bottom N/2 processors are
completely masked off initially so that these initial values
Q(8,7) Q(6,5) Q(4,3) Q(2,1) never change. These initial values are
IQ1f[(8,8)0
uge,Q(7,7))]
fQ66)i[(,)
g(a60(5,5))] g (o4,Q(3,3))]
=f[Q(2,2),
g(o2,Q(I,I))]=x1
A(i) =I,
B(i) = Z, for -N/2 . i . 0
for -N/2 . i .< 1

where for all a and b, h (a, I) =a and f(b, g(a, Z)) b. For the
T=bOb, bb7 bb4
b4
b6 b b
b5
b3 b
b2 b,b, xi example of Section II-A, I is simply 1, and Z is 0.
Q(8,8) Q(7,7) Q(5,5)
0(4,4) Q(3,3) Q(22) Q(1,1)=xl
Q(6,6) LI. APPLICATIONS
Fig. 3. Parallel computation of x8 from the general recurrence equation.
A. Various First-Order Problems
As has been mentioned before, Algorithm A is applicable to
The algorithm requires two vectors A and B of N elements a rather wide class of problems. Table II gives a collection of
The ith component of each vector, namely A(i) and B(i), is such problems that satisfy the functional constraints stated in
stored in the memory of processor (i). The actual data struc- earlier sections.
ture required to represent A(i) and B(i) depends on the defini- An interesting case occurs when we constrain all the as of
tion of the domain of the entities as and bi in the basic equa- Example 1 in Table II to be the same number z as indicated in
tion (8) and may be scalars, matrices, lists, etc., depending on Example 5 in Table II. We then get the recursion
the problem.
Let A(k)(i) and B(k)(i) represent respectively, the contents Xi=zxi 1 +b
of A(i) and B(i) after the kth step of the following algorithm, which, if we solve for XN, yields
Initialization Step (k = 0J.N-i +'+
B(0)(i)=bifor 1 .i <N. XN = b1z + b2ZN2 + +bN-tz+bN
A(O)(i)=ai for 1 <i.N. But this is simply the evaluation of the polynomial
A(l) is never referenced and may be initialized arbitrarily. blxNl +- + bN at x = z. In fact, Algorithm A in this case
is simply the parallel evaluation of polynomials (Munro and
Recursion Steps: For k = 1, 2, rlog2 N] do each of the Paterson [3])
following assignment statements:
B(k) (i) f(B(k-) (i), g(A(k-1) (i), B (k-1) (i - 2k-,)))
= B. Extension to mth-Order Equations
for 2k- 1 < i <N. (9) The algorithm given in the previous sections is applicable to
a class of first-order recurrence equations. However, a little
A(k) (i) = h(A (kl) (i), A(k) (i - 2k1)) manipulation of the description of a problem can often con-
for 2k-i 1 iA N. (10)
+ < vert an mth-order recurrence equation into a first-order equa-
tion with a slightly more complicated data structure. The
Each statement is assumed dto be evaluated simultaneously clu to howuteths smon cn e oudinthotirueamlei
by all processors whose indices lie in the specified interval. clue to how this is done can be found in the third example in
After the [log2 Nth] step, B(i) contains xi for 1 < i < N. Table II, a matrix or "state variable" problem.
As an example, consider the problem
End of Algorithm A.
Several things should be noted about any implementation xi = ai, 1 xi-1 + +ai,m xi-m + bi. (1 1)
of Algorithm A. First, when the ith processor executes (9) and We wish to reformulate it in a form amenable to
(10) in that order, it must have the old values of B(i - 2k-1 Algorithm A.
2k-i can be obtained from The first step is to see that we can collapse the m x's that are
an A(i
and A( -2 k,), which can only obtained from processor needed
only ofbe the -. ...
(i - 2k1i-', ). TThus whi
at the beginning
at pkrsioses
kth recursion step, all able n in (11)asinto a single
fllows. new "variable" by using state varn-
processors must shift their values of A and B to the processors Let
with index 2k-1 greater than their own. Exactly how this data
routing is performed depends on the processor interconnection x
pattern available in a given computer system.
Another problem with implementing Algorithm A lies in Z= . .(12)
limiting the processors that execute (9) and (10) to just those.
with the proper indices. The masking feature (Section I-B) is -+
the most direct way. This, however, requires executing ex-
plicit mask instructions during each recurrence step. If the Now we can rewrite (11) as

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
KOGGE AND STONE: PARALLEL ALGORITHM 791

TABLE II
APPLICATIONS OF ALGORITHM A

Example Domain of b Domain of a Domain of x f(a, b) g(a, b) h (a, b) Comments

1) realnumbers a+b a*b a*b xi+, bi I +ai+Ixi


2) real numbers a b b ta a b xi+, =bi+ (xit ai+1), "t"is exponentiation
3) m X 1 matrix m X m matrix m X 1 matrix vector mult. of matrix matrix xi+, = Bi+I + Ai+xi where A is m X m and
addition by vector mult. x,B are m X 1
4) real numbers b min (a, b) min (a, b) xi is the smallest of a1, ai.,
5) real numbers b max (a, b) max (a, b) xi is the largest of a,, *, ai
6) real number any real real number a+b a*b a*b xi = xi-, * z + bi, polynomial evaluation
number z XN = P(Z) = b, ZN-I + b2ZN-2 + . . +
bN.1 z + bN

xi-1~~~~~~~~~~~~~~

bi ~~~~~~~X(k+l)m
[ i...am Xk+l ((k+l)m

o..o 1 oj L jkm+i
ziIoo o + (13)Xk+

(k+i)m A (k+i)m f(k+l)m A


Ak+l= ( Aj) Bk+i = E f( Ai)Br. (16)
=AiZi-l +Bi (14) j=km+l r=km+l j=r+l /
A A Now (15) becomes
where Ai and Bi are the m X m matrix and m X 1 vector re-
spectively. The first row of Ai represents the original ( 1) and Xk+l = Ak+lXk + Bk+l, k=I ,N/m (17)
the remaining rows simply select the proper xj to make Zi be which again is our familiar first-order linear matrix recurrence
consistent. equation.
Equation (14), however, is in exactly the right format for Now to compute all N elements of (11), we need only com-
Example 3 of Table IL to be applied. The variables in the re- pute N/m elements of the series XI, * , Nl m using (17).
cursion are rn-element vectors, the Ai are m X m matrices, and Using Algorithm A we can compute these N/m elements with
the Bi are m-element vectors. The function f is vector addi- 1log2 N/ml applications of the recurrence step, plus some
tion, g is multiplication of a matrix by a vector, and h is initial time to compute the initial A's and B's given by (16).
matrix multiplication. Thus if we rewrite (11) into (14) we Further, since there are only N/m elements to compute, Algo-
can apply Algorithm A to get a parallel solution to the original rithm A also calls for only N/r processors.
problem (11). The important aspect of this reformulation is not that the
This particular formulation, however, is not very efficient in number of steps has been reduced, but that the number of
its use of the parallel processors. At the end of the calculation processors has dropped. Equation (17) takes log2 fewer nml
we have N m-element vectors Z1, * , ZN Only one mth of recurrence iterations to evaluate than does (14), but about
each Zi, namely its first component xi, represents new calcula- log2 ren additional iterations are required to set up (17) from
tions not available from previous Z's. Most of the matrix (14) with Naprocessors. Thus wae eqnotreduced
the time to
calculations done in the recurrence steps are redundant. solve the problem, but we have reduced redundant computa-
We can increase the amount of parallelism in the problem by tions to the point where we nee only N/r processors after
propagating (14) forward m steps before using Algorithm A. the initial setup.
This results in a new formulation of the problem, which yields
Z(k+l)m =(X(k+l)m, .* , Xkm+l)' directly from Zkm = IV. SUMMARY AND CONCLUSION
(Xkm .- ,X(k-i)m+i)'. Various researchers have developed parallel algorithms for
It is easy to show by induction that Zkm+m can be com- specific problems, such as polynomial evaluation (Munro and
puted as follows: Paterson [3]), and the solution of tridiagonal systems of equa-
tions (Buneman [1], Buzbee et al. [2], and Stone [4]). As
/ km+m ^ k m+m / km+m A A with Algorithm A, the se al gorithm s typ ically requ ire e xecu-
Zkm+m iH Il jL +m EI H A1) Br, tion times proportional to [log2 N]. None of them, however,
\j=km+l / r=km+l j=r+l I
is applicable to any wider class of problems than the particular
k=l1 ** ,N/rn- 1. (15) ones they were designed to solve. Algorithm A, on the other
hand, solves any problem for which the solution can be stated
This equation can be restated in a form directly usable by in terms of a recurrence equation satisfying a few simple re-
Algorithm A as follows. strictions. It is worthwhile mentioning that the running time
Let for Algorithm A can vary widely from problem to problem

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
792 IEEE TRANSACTIONS ON COMPUTERS, AUGUST 1973

even though the time is always proportional to [log2 Ni. The End of proof.
constant of proportionality depends on the time it takes to We can now state a theorem that demonstrates the validity
evaluate f, g, and h. These functions can be as simple as a mag- of Algorithm A.
nitude comparison or floating-point addition and can be as Theorem 3: For all I 6 i 6 N, 0 6 k 6 [10g2 Ni,
complex as a matrix multiplication, with very large differences rh() (r), I < i 2kk +
in their respective constants of proportionality. a) A(k) (i) = r= r,
The power of Algorithm A comes from the generalization of ah(lDi2 k+I (ar), 2k + I <i 6N
the technique of recursive doubling. This technique seems to Q(i, 1), I <i<2k
hold an important key to understanding exactly how parallel- b) B(k) (i) =
ism can be extracted from what appear to be highly serial 1Q(i i - 2" + 1), 2k <i6N.
problems. The major results of this paper indicate that the Proof: Directly by induction and Theorems 1 and 2. The
class of serially stated problems that are amenable to parallel proof of Theorem 3 is direct but tedious; we omit it here.
solutions is a large one, and includes some problems that have Using part b) of Theorem 3 we get the immediate result.
been thought to be poorly suited to parallel processors. Corollary: After the [log2 NIth iteration of Algorithm A,
APPENDIX B(i) contains xi for 1 6 i 6 N.
VALIDITY OF ALGORITHm A Thus we have shown* *that not only does Algorithm A com-
pute the solution x1, *, XN to (6), but also that it termi-
This Appendix contains some basic theorems that establish nates in exactly [log2 NJ iterations.
the validity of Algorithm A. We assume we are solving equa-
tions of the form of (6), where the functions f, g, and h all ACKNOWLEDGMENT
satisfy the restrictions of Section Il-B. We also assume that Recursive doubling solutions to the first-order problem of
the concept of generalized composition and the definition of Section II-A were discovered independently by H. R. Downs
the function Q(m, n) carry over from Section II-C. of Systems Control, Inc., and H. Lomax of NASA Ames Re-
Theorem 1: For any i, k such that I < k < i N then for search Center. Recursive doubling solutions to second-order
any j such that 1 6 j < k linear recurrences have been known to J. J. Sylvester as early
Q(i, i - k) =f(Q(i, i - i + 1), g(h()_,1 (ar), Q(i - j, i - k))). as 1853. The authors wish to thank D. Knuth for pointing out
Sylvester's work and for several stimulating suggestions while
Proof: Assume I < i < k. Then this research was in progress and the referees for pointing out
that the h function need not be associative.
f(Q(i, i - + i-+)r'=)_-+l i(ar),
fgh Q(i - j,i
(ar) Q(i- , i -- k))) )))After this paper had been reviewed and accepted for publica-
-f(Q(i, i - i + 1),g(hr(.)..1+ (ar), f(mq) (g(h(i-,)+1 (ar),bm)))) tion, the authors encountered the report by Trout [5], which
[by definition of Q(i - j, i - k)] has several similar results. His work was done independently
of the work reported here and carries the research beyond the
=f(Q(i, i-i+ l),f(,~Mqk ..Q(i,(g(hr'=.1+i
ij (ar),9(h(r=m)+.
(Or) .) i (ar),bm)))) limitsof this paper.
[g distributes over f]
=f(Q(i,i-j+1), fmi)- (g(h=)m+l (ar), bm))) REFERENCES
[11 0. Buneman, "A compact non-iterative Poisson solver," Stanford
[g is semiassociative] Univ. Inst. Plasma Res., Stanford, Calif., Rep. 294, 1969.
r4*{i) () 1) (r B. L. Buzbee, G. H. Golub, and C. W. Nelson, "On direct methods
fo21
0)
=fvm=a-+ (g (h~+M=1-1+1 (,,bm)), fmiti- (g(h(',,,+
r=m+l r= I (a,), bin))) for solving Poisson's equations," SIAM J. Numner. Anal., vol. 7,
~~~~~~~~~pp.
627-656, Dec. 1970.
[definition of Q(i, i j 1)] [3] 1. Munro and M. Paterson, "Optimal algorithms for paraUel poly-
- +
nomial evaluation," in Conf Rec., 1971 12th Annu. Symp. Switch-
=-f, M=i-k l1 (a,), bm))
-k(g(h^ r=m+l [associativity of f] ing and Automata Theory, IEEE Publ. 71 C 45-C, pp. 132-139.
[4] H. S. Stone, "An efficient parallel algorithm for the solution of a
=-Q(Q, i k).
- tridiagonal linear system of equations," J. Ass Comput. Mach.,
vol. 20, pp. 27-38, Jan. 1973.
End of proof. [51 H. R. G. Trout, "Parallel techniques," Dep. Comput. Sci., Univ.
Theorem 2: For 1 i <N, xi = Q(i, 1). IUinois, Urbana, Rep. UIUCDCS-R-72-549, Oct. 1972.
Proof: By induction on i.
Basis Step: i = L.
Q(l, 1) = b, = xi [by definition].
Induction Step: Assume Q(j, 1) = xi for j < i. Then Peter M. Kogge (S'65-M'68) was born in Wash-
ington, D.C., on December 3, 1946. He received
xi = f(bi, g(ai, xi- I)) [recurrence equation (6)] l the B.S.E.E. degree from the University of
f(Q(i, i), g(h,i) (a,), Q(i 1, 1))) l lNotre Dame, Notre Dame, nd., in 1968, the
f(Qi, ),g14'1(r), Qi11))M.S. degree in systems and information sciences
=

[inductive hypothesis, from Syracuse University, Syracuse, N.Y., in


electrical engi-
Q and
definition
of h
definition of Q and h
~~neering
1970, degree
and Ph.D.

from Stanford University, Stanford,


composition] Calif., in 1973.
Since 1968 he has been with the Federal Sys-
= Q(i, 1) [by Theorem I]. tems Division, IBM Corporation, and is pres-

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-22, NO. 8, AUGUST 1973 793

ently in the Systems Architecture Department, Owego, N.Y., where he sign of advanced computer systems, with particular emphasis on the
is involved with the definition of advanced computer architecture and introduction of parallelism into computer design and problem solutions.
organizations. From 1970 to 1972 he was at Stanford University under
an IBM Resident Fellowship. Prior to that time he was involved with
the design and organization of a large multiprocessor system for aero- Harold S. Stone (S'61-M'63), for a photograph and biography, see this
space applications. His present interests include the definition and de- issue, p. 710.

A Fast Poisson Solver Amenable to Parallel


Computation
BILLY L. BUZBEE

Abstract-The matrix decomposition Poisson solver is developed for


the five-point difference approximation to Poisson's equation on a rec- Mk _
tangle. This algorithm's suitability for parallel computation, its sim-
plicity, its performance relative to successive overrelaxation, and its 2k
generality are then discussed.
Index Terms-Linear algebra, numerical solution of PDE's, Poisson
equation. h 2h Nh
Fig. 1. N X M uniform rectangular mesh.
FAST Poisson solvers have evolved during the last ten
years [1]-[3], and they consist of noniterative tech- Letu= u(th,ik) andapproximate()bythefive-pointdif-
niques for solving finite difference approximations to Poisson's ference equation, that S,
equation on a rectangle. These techniques are significant be-1 1
cause of their efficiency. For example, the Buneman-Poisson vu)ij [2uij - (ui+i, j + Ui-1, )] + [2uij
solver will usually solve the discrete equation in ' th to jl th
of the time required by successive overrelaxation (SOR). In - (uj,j+1 +ui,j-1)]. (2)
this paper we will show that one of these techniques, the ma-
trix~ ~
trix deopsto Poso solver
decomaposition Poisson (MD), offers tremendous The
sovr(D,ofesteedu myb
difference equations for the points along the line x = h
rte
opportunity for parallel computation. Although the MD algo- y
rithm is quite general, we will only develop it for the five-point 2-p2U12
difference approximation to Poisson's equation on a rectangle 2 2
with a uniform mesh. This approach will exhibit the program- p U11 + 6U12 P U13
ming details of the algorithm and emphasize its simplicity.
However, a summary of the various generalizations of MD and . . .

a machine comparison of MD with SOR are included.


Consider a finite difference approximation to P-2UU,M-1 +Sui,MM
-V2U=f (1) r-u2 h2f11 +p2gO +g0
-

in arectanlgle Rwith u g(x,y) on the boundary 3R. We will | 22 | h2f12 +go2l


assume N discretizations in the horizontal direction and M + _.(3)
discretizations in the vertical direction. See Fig. 1..

Manuscript received October 7, 1972; revised March 21, 1973. This _U2M h2flM +g0M + P gi,M+i
work was done under the auspices of the U.S. Atomic Energy
Commission. .............
.. p = h/k and 6 = 2(1 + p2). If we let ui be the M-dimen-
where
The author iS with the Los Alamos Scientific Laboratory, University sinlvco
of California, Los Alamos, N. Mex. 87544.sinlvco

Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.

You might also like