1973 A Parallel Algorithm For The Efficient Solution of A General Class of Recurrence Equations
1973 A Parallel Algorithm For The Efficient Solution of A General Class of Recurrence Equations
8, AUGUST 1973
[43] J. S. Tyler, J. D. Powell, and R. K. Mehra, "The use of smoothing can Automatic Control Council. He was also named the Outstanding
and other advanced techniques for VTOL aircraft parameter iden- Young Electrical Engineer for 1969 by Eta Kappa Nu. He has served
tification," Cornell Aeronaut. Lab., Final Rep., Naval Air Syst. the IEEE Control Systems Society in a number of capacities, and is
Command Contract N00019-69-6-0534, June 1970. ,, presently the Vice President of the Society, Chairman of its Informa-
[441 S. R. McReynolds, "Parallel filtering and smoothing algorithms, tion Dissemination Committee, and an elected AdCom member. He
presented at the 3rd Symp. Nonlinear Estimation Theory, San w
Diego, Calif., Sept. 1972. will also be Program Chairman for the 1973 Jot Automatic Control
Confercnce.
Abstract-An mnth-order recurrence problem is defined as the compu- rirst order, all linear mth-order recurrence equations can be cast into
tation of the series xI, x2, ** , XN, where xi =fh(x_j,. *. ',Xjim) for this form. Suitable applications indude linear recurrence equations,
some function f1. This paper uses a technique called recursive doubling polynomial evaluation, several nonlinear problems, the determination
in an algorithm for solving a large class of recurrence problems on paral- of the maximum or minimum of N numbers, and the solution of tri-
lel computers such as the lIliac IV. diagonal linear equations. The resulting algorithm computes the entire
Recursive doubling involves the splitting of the computation of a series x 1, * * *, XN in time proportional to [log2 NJ on a computer with
function into two equally complex subfunctions whose evaluation can N-fold parallelism. On a serial computer, computation time is propor-
be performed simultaneously in two separate processors. Successive tional to N.
splitting of each of these subfunctions spreads the computation over
more processors. Index Terms-Parallel algorithms, parallel computation, recurrence
This algorithm can be appled to any recurrence equation of the form problems, recursive doubling.
xi = f(b1, g(a1, xi-, )) where f and g are functions that satisfy certain
distributive and associative4ike properties. Although this recurrence is
INTRODUCTION
Manuscript received October 7, 1972; revised March 21, 1973. This A. Definition of Problem
work was supported in part by NSF Grant GJ 1180 and in part by an T FREQUENTLY occurs in applied mathematics that the
IBM Corporation fellowship. E
P. M. Kogge was with the Department of Electrical Engineering, I solution to some problem is a sequence xI, x2, , XN,
Digital Systems Laboratory, Stanford University, Stanford, Calif. He is where each xi is a function of the previous m x's, namely
now with the Systems Architecture Department, IBM Corporation,
Owego, N.Y. 13827. x-,,* , Xim A common example of such a problem is a
H. S. Stone is with the Department of Electrical Engineering and the time-varying linear system, where the state of the system at
Department of Computer Science, Digital Systems Laboratory, Stan-
ford University, Stanford, Calif. time i is xi, and can be computed from the equations
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
KOGGE AND STONE: PARALLEL ALGORITHM 787
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
788 IEEE TRANSACTIONS ON COMPUTERS, AUGUST 1973
=a8~~~~~~~~~~~~~Q(8,7)
=aQ(6,5)
A
A(43) X 2,1)x2
=aQ(4.3) =(2Q(,1) x
* I ~o as(7,7)
TT=lZ Q-7, 06 040Q(3,3) -020(1, 1)
Q+(8,8) +8(6,6) Q+(4,4) +8(2,2)
In general
2i b8 b7 b6 b5 b4 b3 b2 b,
Q(2i, 1) = x2 n a Q(i, 1) + Q(2i, i + 1).
A
where bi and ai are arbitrary constants and f and g are index- = g(h (a, h (b, c)), d).
independent functions that satisfy the following restrictions. Hence, iterated compositions of h when used as the first
Restriction 1: f is associative. f(x, f(y, z)) =f(f(x,y), z). argument of the function g can be evaluated as if h were as-
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
KOGGE AND STONE: PARALLEL ALGORITHM 789
TABLE I
T=O T= 1 T= 2 T=3
Processor A(i) B(i) A(i) B(i) A(i) B(i) A(i) B(i)
1 ** b, =X1 =Q(1, 1) ** X1 =Q(1, 1) ** X1 =Q(1, 1) ** X
2 a2 b2 = Q(2, 2) a2 a2x1 + b2 =x2= Q(2, 1) a2 X2 = Q(2, 1) a2 X2
3 a3 b3 = Q(3, 3) a3a2 a3b2 + b3= Q(3, 2) a3a2 (a3a2)Xl + (a3b2 + b3) = X3=Q(3, 1) a3a2 X3
4 a4 b4 = Q(4, 4) a4a3 a4b3 + b4 = Q(4, 3) a4a3a2 (a4a3)X2 + (a4b3 + b4) = X4 = Q(4, 1) a4a3a2 X4
a5 a4(a3 b2 + b3) + a5 b4 + b5 = Q (5, 2)
5 a5 b5 =Q(5, 5) a5a4 a5b4 + b5 = Q(5, 4) a5a4a3a2 (b a al Xs
w=2 m=w+i M=2
sociative without altering the output value of g. In all interest- are scalar multiplication, then the Q(m, n) defined previously
ing practical problems discovered thus far, the function h is is exactly the same as the Q(m, n) defined for the example in
associative. Section II-A.
The similarities between Q and Q carry even further. The
C. Parallel Algorithm function Q(i, 1) is the solution of the general recurrence equa-
The principle of recursive doubling can be applied in a tion (6), that is,
natural way to any recurrence equation that satisfies the re- = Q(i, 1, XiV 1 < i N. (7)
strictions of Section II-B. In fact, the resulting general algo-
rithm bears a very strong resemblance to the example of Sec- Also, as in the example, we can derive a formula computing
tion II-A. Before giving the algorithm, however, we first give Q(2i, 1) strictly in terms of two equally complex subterms,
two definitions. namely,
Definition: For any function q of two arguments define the Q(2i, 1) f(Q(2i, i + 1), g(h Q(i, 1))). (8)
generalized composition of q as qg=) (a1), where Q(2i,I'
q1J=n(aB)
q(n) )= aan, for n > I Both (7) and a more general version of (8) are proved in the
fonlAppendix.
q5r,) (a1) = q =(a q(nm) (a)), for m > n=' 1 . Equation (8) is a perfect candidate for recursive doubling.
= q(am, q(am-1 , q(an+2, q (an+l, an))... Q(2i, i + 1) and Q(i, 1) are identical in terms of the number of
unique a's and b's referenced and require the same sequence of
If we let q(a, b) = a + b (scalar addition), then f, g, and h function calls to evaluate them. As with the second
q(m) (a)
qj=n 1
= (a
m
+ (am
m-l
+ + (an+2
n+2 (an+1
+ +
n .)
nal an))
example, the only hindrance in implementing (8) directly as a
recursive doubling algorithm is the correction term, the h com-
m position. However, since h can be treated as an associative
= £ a1 function, we can use a scheme similar to Fig. 2 to compute
l =n
these correction terms exactly as they are needed.
Likewise, if q (a, b) = a b (scalar multiplication), then Fig. 3 is a computation graph using (8) and the h composi-
m tion algorithm to compute x8. Despite its increased com-
q(mn) (aj) = [l a1. plexity, the general structure of this graph is identical to Figs.
j=n ~~~~~~1
and 2 and can be extended to solve for all elements of the
Definition: Define Q(m, n) as sequence x1,.. , Xy in parallel.
.,n)=(m) rzh(m.)I
Q(,)-j ( I9=+1 (a) b ')
a)b] E We can now state the complete algorithm for solving our
general recurrence equations. The detailed proof of the cor-
where we define rectness of this algorithm is given in the Appendix.
g(J4m) + (ar),bj)=b,. Algorithm A -Genleral Algorithm: This algorithm solves for
'~~~ ~~~~~ X, X2, * * *,XN where xi = f(b1, g(aj, X 1 )) and f and g satisfy
If we consider the case where / is scalar addition, and g and h the restrictions of Section lI-B.
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
790 IEEE TRANSACTIONS ON COMPUTERS, AUGUST 1973
Q(8,1) =Xs number of available processors is greater than about 3N/2, an-
T=3> -f[Q(85)g9(h(8)(o.)Q0(4I1)] other method can avoid these extra instructions. The N pro-
=5
\ cessors with the highest indices are allocated to the solving of
Xl , XN, and the next N/2 processors are initialized so that
when one of the top N processors references their data, the
7=2 f[Q(8,7),g(h(ao87),Q(6,5))] =f [Q(4,3),g(h(a403),Q(2, ))] values returned cause no change in the higher processor's
values for A(i) and B(i). These bottom N/2 processors are
completely masked off initially so that these initial values
Q(8,7) Q(6,5) Q(4,3) Q(2,1) never change. These initial values are
IQ1f[(8,8)0
uge,Q(7,7))]
fQ66)i[(,)
g(a60(5,5))] g (o4,Q(3,3))]
=f[Q(2,2),
g(o2,Q(I,I))]=x1
A(i) =I,
B(i) = Z, for -N/2 . i . 0
for -N/2 . i .< 1
where for all a and b, h (a, I) =a and f(b, g(a, Z)) b. For the
T=bOb, bb7 bb4
b4
b6 b b
b5
b3 b
b2 b,b, xi example of Section II-A, I is simply 1, and Z is 0.
Q(8,8) Q(7,7) Q(5,5)
0(4,4) Q(3,3) Q(22) Q(1,1)=xl
Q(6,6) LI. APPLICATIONS
Fig. 3. Parallel computation of x8 from the general recurrence equation.
A. Various First-Order Problems
As has been mentioned before, Algorithm A is applicable to
The algorithm requires two vectors A and B of N elements a rather wide class of problems. Table II gives a collection of
The ith component of each vector, namely A(i) and B(i), is such problems that satisfy the functional constraints stated in
stored in the memory of processor (i). The actual data struc- earlier sections.
ture required to represent A(i) and B(i) depends on the defini- An interesting case occurs when we constrain all the as of
tion of the domain of the entities as and bi in the basic equa- Example 1 in Table II to be the same number z as indicated in
tion (8) and may be scalars, matrices, lists, etc., depending on Example 5 in Table II. We then get the recursion
the problem.
Let A(k)(i) and B(k)(i) represent respectively, the contents Xi=zxi 1 +b
of A(i) and B(i) after the kth step of the following algorithm, which, if we solve for XN, yields
Initialization Step (k = 0J.N-i +'+
B(0)(i)=bifor 1 .i <N. XN = b1z + b2ZN2 + +bN-tz+bN
A(O)(i)=ai for 1 <i.N. But this is simply the evaluation of the polynomial
A(l) is never referenced and may be initialized arbitrarily. blxNl +- + bN at x = z. In fact, Algorithm A in this case
is simply the parallel evaluation of polynomials (Munro and
Recursion Steps: For k = 1, 2, rlog2 N] do each of the Paterson [3])
following assignment statements:
B(k) (i) f(B(k-) (i), g(A(k-1) (i), B (k-1) (i - 2k-,)))
= B. Extension to mth-Order Equations
for 2k- 1 < i <N. (9) The algorithm given in the previous sections is applicable to
a class of first-order recurrence equations. However, a little
A(k) (i) = h(A (kl) (i), A(k) (i - 2k1)) manipulation of the description of a problem can often con-
for 2k-i 1 iA N. (10)
+ < vert an mth-order recurrence equation into a first-order equa-
tion with a slightly more complicated data structure. The
Each statement is assumed dto be evaluated simultaneously clu to howuteths smon cn e oudinthotirueamlei
by all processors whose indices lie in the specified interval. clue to how this is done can be found in the third example in
After the [log2 Nth] step, B(i) contains xi for 1 < i < N. Table II, a matrix or "state variable" problem.
As an example, consider the problem
End of Algorithm A.
Several things should be noted about any implementation xi = ai, 1 xi-1 + +ai,m xi-m + bi. (1 1)
of Algorithm A. First, when the ith processor executes (9) and We wish to reformulate it in a form amenable to
(10) in that order, it must have the old values of B(i - 2k-1 Algorithm A.
2k-i can be obtained from The first step is to see that we can collapse the m x's that are
an A(i
and A( -2 k,), which can only obtained from processor needed
only ofbe the -. ...
(i - 2k1i-', ). TThus whi
at the beginning
at pkrsioses
kth recursion step, all able n in (11)asinto a single
fllows. new "variable" by using state varn-
processors must shift their values of A and B to the processors Let
with index 2k-1 greater than their own. Exactly how this data
routing is performed depends on the processor interconnection x
pattern available in a given computer system.
Another problem with implementing Algorithm A lies in Z= . .(12)
limiting the processors that execute (9) and (10) to just those.
with the proper indices. The masking feature (Section I-B) is -+
the most direct way. This, however, requires executing ex-
plicit mask instructions during each recurrence step. If the Now we can rewrite (11) as
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
KOGGE AND STONE: PARALLEL ALGORITHM 791
TABLE II
APPLICATIONS OF ALGORITHM A
xi-1~~~~~~~~~~~~~~
bi ~~~~~~~X(k+l)m
[ i...am Xk+l ((k+l)m
o..o 1 oj L jkm+i
ziIoo o + (13)Xk+
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
792 IEEE TRANSACTIONS ON COMPUTERS, AUGUST 1973
even though the time is always proportional to [log2 Ni. The End of proof.
constant of proportionality depends on the time it takes to We can now state a theorem that demonstrates the validity
evaluate f, g, and h. These functions can be as simple as a mag- of Algorithm A.
nitude comparison or floating-point addition and can be as Theorem 3: For all I 6 i 6 N, 0 6 k 6 [10g2 Ni,
complex as a matrix multiplication, with very large differences rh() (r), I < i 2kk +
in their respective constants of proportionality. a) A(k) (i) = r= r,
The power of Algorithm A comes from the generalization of ah(lDi2 k+I (ar), 2k + I <i 6N
the technique of recursive doubling. This technique seems to Q(i, 1), I <i<2k
hold an important key to understanding exactly how parallel- b) B(k) (i) =
ism can be extracted from what appear to be highly serial 1Q(i i - 2" + 1), 2k <i6N.
problems. The major results of this paper indicate that the Proof: Directly by induction and Theorems 1 and 2. The
class of serially stated problems that are amenable to parallel proof of Theorem 3 is direct but tedious; we omit it here.
solutions is a large one, and includes some problems that have Using part b) of Theorem 3 we get the immediate result.
been thought to be poorly suited to parallel processors. Corollary: After the [log2 NIth iteration of Algorithm A,
APPENDIX B(i) contains xi for 1 6 i 6 N.
VALIDITY OF ALGORITHm A Thus we have shown* *that not only does Algorithm A com-
pute the solution x1, *, XN to (6), but also that it termi-
This Appendix contains some basic theorems that establish nates in exactly [log2 NJ iterations.
the validity of Algorithm A. We assume we are solving equa-
tions of the form of (6), where the functions f, g, and h all ACKNOWLEDGMENT
satisfy the restrictions of Section Il-B. We also assume that Recursive doubling solutions to the first-order problem of
the concept of generalized composition and the definition of Section II-A were discovered independently by H. R. Downs
the function Q(m, n) carry over from Section II-C. of Systems Control, Inc., and H. Lomax of NASA Ames Re-
Theorem 1: For any i, k such that I < k < i N then for search Center. Recursive doubling solutions to second-order
any j such that 1 6 j < k linear recurrences have been known to J. J. Sylvester as early
Q(i, i - k) =f(Q(i, i - i + 1), g(h()_,1 (ar), Q(i - j, i - k))). as 1853. The authors wish to thank D. Knuth for pointing out
Sylvester's work and for several stimulating suggestions while
Proof: Assume I < i < k. Then this research was in progress and the referees for pointing out
that the h function need not be associative.
f(Q(i, i - + i-+)r'=)_-+l i(ar),
fgh Q(i - j,i
(ar) Q(i- , i -- k))) )))After this paper had been reviewed and accepted for publica-
-f(Q(i, i - i + 1),g(hr(.)..1+ (ar), f(mq) (g(h(i-,)+1 (ar),bm)))) tion, the authors encountered the report by Trout [5], which
[by definition of Q(i - j, i - k)] has several similar results. His work was done independently
of the work reported here and carries the research beyond the
=f(Q(i, i-i+ l),f(,~Mqk ..Q(i,(g(hr'=.1+i
ij (ar),9(h(r=m)+.
(Or) .) i (ar),bm)))) limitsof this paper.
[g distributes over f]
=f(Q(i,i-j+1), fmi)- (g(h=)m+l (ar), bm))) REFERENCES
[11 0. Buneman, "A compact non-iterative Poisson solver," Stanford
[g is semiassociative] Univ. Inst. Plasma Res., Stanford, Calif., Rep. 294, 1969.
r4*{i) () 1) (r B. L. Buzbee, G. H. Golub, and C. W. Nelson, "On direct methods
fo21
0)
=fvm=a-+ (g (h~+M=1-1+1 (,,bm)), fmiti- (g(h(',,,+
r=m+l r= I (a,), bin))) for solving Poisson's equations," SIAM J. Numner. Anal., vol. 7,
~~~~~~~~~pp.
627-656, Dec. 1970.
[definition of Q(i, i j 1)] [3] 1. Munro and M. Paterson, "Optimal algorithms for paraUel poly-
- +
nomial evaluation," in Conf Rec., 1971 12th Annu. Symp. Switch-
=-f, M=i-k l1 (a,), bm))
-k(g(h^ r=m+l [associativity of f] ing and Automata Theory, IEEE Publ. 71 C 45-C, pp. 132-139.
[4] H. S. Stone, "An efficient parallel algorithm for the solution of a
=-Q(Q, i k).
- tridiagonal linear system of equations," J. Ass Comput. Mach.,
vol. 20, pp. 27-38, Jan. 1973.
End of proof. [51 H. R. G. Trout, "Parallel techniques," Dep. Comput. Sci., Univ.
Theorem 2: For 1 i <N, xi = Q(i, 1). IUinois, Urbana, Rep. UIUCDCS-R-72-549, Oct. 1972.
Proof: By induction on i.
Basis Step: i = L.
Q(l, 1) = b, = xi [by definition].
Induction Step: Assume Q(j, 1) = xi for j < i. Then Peter M. Kogge (S'65-M'68) was born in Wash-
ington, D.C., on December 3, 1946. He received
xi = f(bi, g(ai, xi- I)) [recurrence equation (6)] l the B.S.E.E. degree from the University of
f(Q(i, i), g(h,i) (a,), Q(i 1, 1))) l lNotre Dame, Notre Dame, nd., in 1968, the
f(Qi, ),g14'1(r), Qi11))M.S. degree in systems and information sciences
=
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-22, NO. 8, AUGUST 1973 793
ently in the Systems Architecture Department, Owego, N.Y., where he sign of advanced computer systems, with particular emphasis on the
is involved with the definition of advanced computer architecture and introduction of parallelism into computer design and problem solutions.
organizations. From 1970 to 1972 he was at Stanford University under
an IBM Resident Fellowship. Prior to that time he was involved with
the design and organization of a large multiprocessor system for aero- Harold S. Stone (S'61-M'63), for a photograph and biography, see this
space applications. His present interests include the definition and de- issue, p. 710.
Manuscript received October 7, 1972; revised March 21, 1973. This _U2M h2flM +g0M + P gi,M+i
work was done under the auspices of the U.S. Atomic Energy
Commission. .............
.. p = h/k and 6 = 2(1 + p2). If we let ui be the M-dimen-
where
The author iS with the Los Alamos Scientific Laboratory, University sinlvco
of California, Los Alamos, N. Mex. 87544.sinlvco
Authorized licensed use limited to: Uskudar Universitesi. Downloaded on January 30,2023 at 11:34:18 UTC from IEEE Xplore. Restrictions apply.