Performance Analysis of Partition Algorithms For Parallel Solution of Nonlinear Systems of Equations

This document summarizes a research paper that analyzes the performance of partition algorithms for solving large-scale nonlinear systems of equations in parallel. It describes a block Broyden algorithm that uses a diagonal matrix as an iterative matrix. The paper then discusses different partitioning schemes for parallelizing the algorithm, including linear, random, scattered, MPABLO, and multilevel partitioning. It presents some numerical results testing the performance of the partitioning schemes and analyzes how the convergence of the block Broyden algorithm is influenced by the chosen partitioning method.

Uploaded by

BlueOneGauss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views4 pages

Performance Analysis of Partition Algorithms For Parallel Solution of Nonlinear Systems of Equations

Uploaded by

BlueOneGauss

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Performance Analysis of Partition Algorithms for Parallel Solution of

Nonlinear Systems of Equations

Geng Yang', Chunming Rong'
'Department of Computer Science and Technology,
Nanjing University of Posts and Telecommunications, Nanjing, 210003, China,
Email [email protected]
'Department of Computer Science, Stavanger University College, N-4068 Stavanger, Norway
Email: chunmine.roneQ)tn.his.no, Webhttp://www.ux.his.no/-chun

Absbnct-In this paper, we discuss performance of algorithm were discussed in [SI and some alterative
partition algorithms for parallel solution of large-scale versions of the algorithm were described in [6,7]. The BB
nonlinear systems of equations. We describe first a block algorithm can be easily down in parallel program, because
Broyden algorithm for solving a nonlinear system in which it contains only products of matrix and vectors. Morwver,
a diagonal matrix is used as an iterative matrix. Then, we for a nonlinear system with dimension N, the BB
analyze the parallelism of the algorithm and discuss in algorithm needs only N2/q memory storage, instead of N',
details different partitioning schemes. Finally, we give where q is the number of blocks. In practice, an optimal
some numerical results and analyze performance of the choice for N/q is about 25[4]. Therefore, the BB algorithm
partitioning schemes. The numerical results show that the determines significant storage shaving. This issue plays a
algorithms combining block Broyden method with very important role in scientific computing of large-scale
partitioning techniques are effective, and that they can be systems. However, the convergence performance is not as
used in the large-scale problems arising from scientific and good as that of GMRES(m) method, and block partitioning
engineering computing. algorithm influences the convergence performance,
Keywords parallel computation, block partitioning, Therefore, in order to get an optimal BB algorithm and
supercomputing, nonlinear systems, balance tasks amony CPU processors, it is necessary to
study partitioning algorithms and analyze influence of the
I. Introduction . partitioning algorithms on the BB algorithm.
FOI algorithm solving large-scale nonlinear systems we This paper is organized as follows. In Section 2, a
face two challenges. One is to reduce storage requirement. general padlelizable BB-method is described, including
Because dimension of the system is usually very large, we some specific remarks concerning the BB-GMRES(m)
have to develop an algorithm with low memory method. Section 3 is related to block partition strategies,
requirement, making it possible to be used in practice. The where several partitioning schemes are discussed.
other is about parallel performance. The algorithm should Numerical experiments are presented in Section 4. In
have a high parallel performance in order to get numerical particular, the relative importance of several parameters on
solutions in less CPU times or in real-time. It makes no the convergence of the BB-GMRES(m) method is
sense in some practical applications for an algorithm to numerically studied. Finally, some concluding remarks are
take a lot of CPU time to achieve solutions, such as given in Section 5.
computational fluid dynamics, numerical weather forecast,
etc. II. Block Bmyden Algorithm
Several algorithms have been developed up to now, such The BB algorithms meet the two criterions mentioned in
as Newton methods and quasi-Newton method[l-3]. But Section 1[4-51. We describe only the algorithm here(see
because of storage requirement or parallel performance, [4-51 for details).
only a few of them are used in practice for large-scale For a nonlinear system F(x)=O, where F(x) is a
systems. A nonlinear GMFSS(m) (Generalized Minimal function kom R" to R". the BB algorithm- is defined as
RESidual) algorithm was developed in [I], which is widely followings:
used recently. The algorithm has some advantages in both 1, For some given xn and Bo ,calculate residual rn=F(x4.
memory need' and parallelization. After that a block 2. For k=O,l,..., untilconvergencedo:
Broyden (BB) algorithm was proposed in [4]. This simple 2.1 Solve linear system in parallel B's'=-~'.
algorithm needs low storage and has good parallel 2.2 Calculate in parallel xk+'=x'+sk.
performance. The complexity and performance of the 2.3 Calculate the residual ?+'=F(x'+'), if accurate
enough, stop. Otherwise, IupdateB"'
-This work was supported by the National Supercomputing using (1)-(4) in parallel. Set k=k+l, and go to step 2.1,
Foundation of China under the @ant No. 9927. where x and F(x) are divided into q blocks as
0-7803-7840-7/03/$17.00 0 2 0 0 3 IEEE.

684-
F(x)=(FI,...,FJT, x=(xI,.._, xJT. Let ni be the number of D. MPABLO Partitioning
equations in ith block, the updating formulas are: A modification of the PABLO algorithm
& xF'- x: (PArameteriaed BLock Ordering) was used here [8], in
order to obtain diagonal blocks of comparable size. The
crux of this method is to choose groups of nodes in G(A),
the graph induced by the matrix A, so that the
corresponding diagonal blocks are either full or very dense.
III. Some Block PartitioningStrategies -The algorithm runs in a time O(N+NE).
Given an NxN matrix A = {Ao) with a symmetric
zermonzero structure, it is possible to define an induced E. Multilevel Partitioning
undirected graph C(A) = <V,E> without loops or multiple In the mulrilevel algorithm used in this work [91, the
edges, or more simply, a gmph. The set V bas N vertices graph is approximated by a sequence of increasingly
(a,, ....aN) and E is a collection of unordered pairs of smaller graphs. The smallest graph is then partitioned in
elements of V such that < *,aj> E E if i# j and if Aij# 0. Let power-of-two sets, using a spectral method, and Ihis
q 3 be a given positive integer. Using a graph partitioning partition is propagated back through the hierarchy of
algorillun, Vcould be decomposed into q disjoint subsets V, graphs. A variant of the Kermighm-Lin algorithm is
of comparable size N;, i.e. such that each Vi has about N/q applied periodically to refine the partition. A C-version of
vertices (i=l,,,.,q), this partitioning scheme is included in a software package
For completeness and to facilitate comparisons, we will called Chaco, which was kindly offered to us by the
study five partitioning schemes: linear partitioning, random authors for research purposes. The cost of constructing
partitioning, scattered partitioning. MAPLO partitioning (a coarse graphs and of the local improvement algorithms
Modification of the PArameterized BLock Ordering), and runs in a time proportional to the number of edges in the
Multilevel partitioning. graph.

A. Linear Partitioning IV. Numerical Results

In the linear scheme, vertices are assigned in order to All numerical tests in this section were run on a shared
blocks in accordance with their numbering in the original memory Power Challenge XL with 8 CPU R8000, 1.5 Go
graph, i.e. the first N/q vertices are assigned to block VI, RAM, 64 hits. All computations used 2 processors, except
the next N/q vertices to the block VZ,and so on. This where mentioned. The programs were written in
algorithm is simple and often produces surprisingly good FORTRAN 77 using double precision floating point
results because data locality is implicit in the numbering of numbers with automatic parallel environment. In the
the graph. For a graph with N vertices and NE edges, the following discussion, CPU time for a parallel case consists
algorithm has a O(N) complexity. of the maximum CPU time among processors for all
parallel computations, including communication time. It
B. Scanttered Partitioning did not include the computational cost of the pmitioning
In the scattered scheme vertices are dealt out in card strategies. This preliminary step is performed only once,
fashion to the q sets in the order they are numbered. Note and its influence on the total computational work is
that the neighbor vertices are assigned into different blocks. insignificant, particularly for large-scale scientific and
Hence, it often gives had performance in parallel engineering computing.
computing for systems arising from real problems, such as The nonlinear partial differential equation for Bratu
fluid dynamic computing by a finite element method. The problem can be written as [SI:
scattered scheme runs also in a time O(Nj.
-Au+u,+re" =f,
C. Randon Partitioning
In the random scheme, vertices are assigned randomly
{ U Ian = 0, (x, Y) E Q = [0,11 X[O,II
We take f=e, A =I, and divide two sides of domain Q into
to q sets. Because of the randomicity it is difficulty to
control the matrix B. In other words. the matrix B may be N-I. We use five-point and forward difference schema to
different in each time. Usually the random ordering discrete first and second order derivatives. We obtain then
produces partitions with a quality between those of the a nonlinear system of N'dimension: We will consider two
linear and scattered partitioning algorithms. The grids: M, and Mz with N=30, 50 respectively. The
complexity of the random scheme is O(N). dimensions of the nonlinear system are 900 and 2500. The
stop criteria is I?I<IO~. w e use LU elimination method for
B'=[: '.. :]=dia.(.: ,....Bi) (4) linear systems[S]. So it forms a nonlinear solver BB-LU.
Let initial approximation xo be zero and Bo be Jacobian
matrix.
Five block partitioning schemes are considered:

485-
multilevel, MPABLO, Linear, random and scatfaed performance as the multilevel scheme.
schemes. Figures 1 and 2 show convergence histories log Figure 3 presents the influence of the number q of
1F-M b/(y)l versus number of nonlinear iterations) of diagonal blocks on the convergence of BB-LU. The
BB-LU for the block structure generated by different partitioning scheme used is MPABLO. Figure 3 sbows that
partitioning schemes, on grids M1 and M2. The number BB-LU converges, even for a large q. For example, for the
of blocks is q=64 on MI and q=128 on M2. grid M1 and q=300, each block contains only about 3
unknowns. Therefore, the memory requirements for
storing the BB-matrix could be adjusted according to the
es of the computer. For large-scale scientific
computation, such as biological computation or fluid
dynamic computation, the dimension of system N is very
large. It is impossible to store an entire matrix B. That is
also the reason why to make an optimal partition scheme
and use parallel computing..Figure 4 shows the CPU time
used by BB-LU, with 2 processors, for different choices of
q. In this case, the CPU time increases with the number of
blocks, because the number of iteration increases too. Fig.4
Influence of block numbers on CPU times
Table 1 gives the parallel performance of the BB-LU
algorithm. The CPU times are normalized by those
obtained with only one processor. Accelerate rate is
division of the sequential CPU time by the parallel CPU
times, and efficiency value is division of the accelerate rate
by the number of processors. Table 1 shows that the
BB-LU algorithm is effective. With 8 processors, it gives a
speedup of 6 times faster than with one processor, and the
efficiency values are about 0.625, 0.658 for the grids MI,
M2 respectively. This is an important advantage for
large-scale scientific computing with parallel computers.

Fig. 2 Caovegencebeha*a for p.rtiuoningsckms wih M2

FigtGes show that MPABLO and multilevel schemes
perform very well. BB-LU takes about 1.13 times more
iterations with the MPABLO scheme than with the
multilevel scheme on both grids, because those two
schemes make partitions without loss information between
unknowns.
The linear partition scheme allows a satisfactory
partition in this case, since the nodes in a block are the
neighboring nodes in the geometry. They have a close
relationship. The behavior of.the random scheme is
between those of the linear and the Scattered ones. The
scattered scheme is the worst. We observe some oscillation Of I
phenomenon because it destroys the coupling information
between unknowns. Thus, it could be worthwhile to select
the appropriate partitioning scheme (which is used only 1 I
CPU
time
1.00
1I 1I
ACc.
Rate
1.00
Effici'
1.000
1I J1
CPU
times
1.00
ACC.
Rate
1.00
I1 Effici.
1.000
once, as a preliminary step of the computations), in order
to make significant CPU time saving. Moreover, because
I 0.66 I 1.52 I 0.758 I 0.64 I 1.56 I 0.781

of the power-of-two partitioning limitation of the I 0.35 I 2.86 I 0.714 I 0.35 I

1
2.86 I
I
0.715
multilevel scheme, it is not convenient to use the
multilevel scheme in practice. We propose to use the
* I 1 0.20 5.00 0.625 I 0.19 I 5.26 I 0.658

MPABLO scheme, since it gives almost the same

486-
Applications, Vol. 39, No.7, pp.81-94,April, 2000.
131 E. Babolian and J. Biazar, Solution of Nonlinear
Equations by Modified a Domain Decomposition
method, Applied Marhemtics and Compubtion,
Vol. 132, No.1, pp.167-172, October, 2002.
[41 G Yang, Analysis of Parallel Algorithms for
Solving Nonlinear Systems of Equations, Chinese
Journal of Computer, Vol. 23, No. 10, pp.1035-1039,
October, 2000 (in Chinese).
U] G Yang, S. D. Wang and R. C. Wang, Analysis of
Parallel Algorithms for Solving Nonlinear Systems
of Equations based-on Dawan IOWA Computers,
Chinese Journal of Computer, Vol. 25, No. 4,
pp.397-402, April, 2002 (in Chinese).
[6] J. J. Xu, Convergence of Partially Asynchronous
Block Quasi-Newton Methods for Nonlinear
V. Conclusion Systems of Equations, Journal of Computational
In this paper a parallelizable block Broyden method and Applied Marhemarirs, Vol. 103, No.2,
for solving nonlinear systems is presented. Combining it pp.307-321, March, 1999.
with some iterative or direct linear solvers, it is possible to [7] Y. R. Chen and D. Y. Cai, Inexact Overlapped Block
obtain a family of nonlinear solvers. The algorithm Broyden Methods for Solving Nonlinear Equations,
parallelizes well, allowing a speedup of about 6 on an Applied Mathematics and Computation, Vol. 136,
%processor system. It was successfully used to solve No.2, pp.215-228, April, 2002.
nonlinear systems arising from the Brabl problem. [SI J.ONeil and D.B. Szyld, A Block Ordering Method
Several partitioning algorithms have been evaluated. for Sparse Mauices, SIAM J. Sri. Star. Compur.,
The partitions of blocks influence the convergence Vol.11, No.7,pp.811-823, July, 1990.
performance of the BB methods. The linear scheme is very [9] B. Hendrickson and T. Kolda, Partitioning
simple. It gives sometimes a good performance. The Rectangual and Structurally Unsymmetric Sparse
scattered scheme is worst among the five schemes because Matrices for Parallel Processing, SIAM. J. Scientific
the nodes in a block have less relative information a b u t Compuring, Vol. 21, No. 6, pp.2048-2072, June,
each other. The multilevel scheme and the MPRBLO 2000.
scheme give almost the same performance. However,
because of the power-of-two partitioning limitation of the
multilevel scheme, we prefer the MPRBLO scheme to
multilevel scheme to combine with the BB algorithm. It is
clear that the BB method needs less memory storage and
bas a good parallel convergence performance. Therefore, it
could be used to solve large nonlinear systems of equations,
in particular, systems arising !?om real engineering
problems.
For future work, it is intended to apply some
parallelizable preconditioning techniques to the
BB-methods, in order to accelerate the convergence of
such methods, and also to increase their robustness.
Moreover, it is interesting to apply the BB method in some
real problems arising from engineering domains.

References
[I] P. Brown and Y.Saad, Hybrid Krylov Methods for
Nonlinear Systems of Equations. SIAM J Scientific
Srarisric Computation, Vol.11, No. 5, pp.450481,
May. 1999.
[2] Z . Z. Bai, A Class of Asynchronous Parallel
Iterations for the Systems of Nonlinear Algebraic
Equations , Computers & Mathematics with

-687-