0% found this document useful (0 votes)
12 views

Multilevel Graph Partitioning Schemes

This document discusses different approaches for graph partitioning, including spectral partitioning, geometric partitioning, and multilevel graph partitioning. It introduces a new multilevel graph partitioning scheme that produces partitions of similar or better quality than spectral partitioning but runs 10-35 times faster. The new scheme also outperforms other common ordering algorithms for sparse matrices.

Uploaded by

javad.hsadeghi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Multilevel Graph Partitioning Schemes

This document discusses different approaches for graph partitioning, including spectral partitioning, geometric partitioning, and multilevel graph partitioning. It introduces a new multilevel graph partitioning scheme that produces partitions of similar or better quality than spectral partitioning but runs 10-35 times faster. The new scheme also outperforms other common ordering algorithms for sparse matrices.

Uploaded by

javad.hsadeghi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MULTILEVEL GRAPH PARTITIONING SCHEMES ∗

George Karypis and Vipin Kumar


Department of Computer Science, University of Minnesota, Minneapolis, MN 55455
{karypis, kumar}@cs.umn.edu

Abstract – In this paper we present experiments with a class lead to high degree of concurrency in the factorization phase
of graph partitioning algorithms that reduce the size of the graph [26, 9]. The multiple minimum degree ordering used almost
by collapsing vertices and edges, partition the smaller graph, and exclusively in serial direct methods is not suitable for par-
then uncoarsen it to construct a partition for the original graph. allel direct methods, as it provides very little concurrency
We investigate the effectiveness of many different choices for all in the parallel factorization phase.
three phases: coarsening, partition of the coarsest graph, and The graph partitioning problem is NP-complete. How-
refinement. In particular, we present a new coarsening heuristic ever, many algorithms have been developed that find a
(called heavy-edge heuristic) for which the size of the partition of reasonably good partition. Spectral partitioning methods
the coarse graph is within a small factor of the size of the final are known to produce excellent partitions for a wide class
partition obtained after multilevel refinement. We also present a of problems, and they are used quite extensively [33, 20].
new scheme for refining during uncoarsening that is much faster However, these methods are very expensive since they re-
than the Kernighan-Lin refinement. We test our scheme on a large quire the computation of the eigenvector corresponding to
number of graphs arising in various domains including finite el- the second smallest eigenvalue (Fiedler vector). Execu-
ement methods, linear programming, VLSI, and transportation. tion of the spectral methods can be speeded up if compu-
Our experiments show that our scheme consistently produces par- tation of the Fiedler vector is done by using a multilevel
titions that are better than those produced by spectral partitioning algorithm [2]. This multilevel spectral bisection algorithm
schemes in substantially smaller timer (10 to 35 times faster than (MSB) usually manages to speedup the spectral partition-
multilevel spectral bisection). Also, when our scheme is used ing methods by an order of magnitude without any loss
to compute fill reducing orderings for sparse matrices, it sub- in the quality of the edge-cut. However, even MSB can
stantially outperforms the widely used multiple minimum degree take a large amount of time. In particular, in parallel direct
algorithm. solvers, the time for computing ordering using MSB can be
several orders of magnitude higher than the time taken by
1 Introduction the parallel factorization algorithm, and thus ordering time
can dominate the overall time to solve the problem [14].
Graph partitioning is an important problem that has exten-
The execution time of MSB can be further speeded up by
sive applications in many areas, including scientific com-
computing the Fiedler vector in parallel. The algorithm for
puting and VLSI design. The problem is to partition the
computing the Fiedler vector, is iterative and in each itera-
vertices of a graph in p roughly equal parts, such that the
tion it performs a matrix-vector multiplication of a matrix
number of edges connecting vertices in different parts is
whose graph is identical to the one we are trying to partition.
minimized. For example, the solution of a sparse system of
These matrix-vector products can be performed efficiently
linear equations Ax = b via iterative methods on a parallel
on a parallel computer only if a good partition of the graph
computer gives rise to a graph partitioning problem. A key
is available—a problem that MSB is trying to solve in the
step in each iteration of these methods is the multiplication
first place. As a result, parallel implementation of spectral
of a sparse matrix and a (dense) vector. The problem of
methods exhibit poor efficiency since most of the time is
minimizing communication in this step is identical to the
spent in performing communication [21, 1].
problem of partitioning the graph corresponding to the ma-
trix A [26]. If parallel direct methods are used to solve a Another class of graph partitioning techniques uses the
sparse system of equations, then a graph partitioning algo- geometric information of the graph to find a good partition.
rithm can be used to compute a fill reducing ordering that Geometric partitioning algorithms [17, 28, 29] tend to be
fast but often yield partitions that are worse than those ob-
∗ This work is sponsored by the AHPCRC under the auspices of the tained by spectral methods. Among the most prominent
DoA, ARL cooperative agreement number DAAH04-95-2-0003/contract
number DAAH04-95-C-0008, the content of which does not necessar- of these scheme is the algorithm described in [28]. This
ily reflect the position or the policy of the government, and no official algorithm produces partitions that are provably within the
endorsement should be inferred. Access to computing facilities was pro- bounds that exist for some special classes of graphs. How-
vided by Cray Research Inc. Related papers are available via WWW at
URL: http://www.cs.umn.edu/users/kumar/papers.html ever, due to the randomized nature of these algorithms,

1
multiple trials are often required to obtain solutions that tially smaller timer (10 to 35 times faster than multilevel
are comparable in quality to spectral methods. Multiple spectral bisection). Compared with the scheme of [20],
trials do increase the time [13], but the overall runtime our scheme is about twice as fast, and is consistently bet-
is still substantially lower than the time required by the ter in terms of cut size. Much of the improvement in run
spectral methods. However, geometric graph partitioning time comes from our faster refinement heuristic. We also
algorithms have limited applicability because often the ge- used our graph partitioning scheme to compute fill reduc-
ometric information is not available, and in certain problem ing orderings for sparse matrices. Surprisingly, our scheme
areas (e.g., linear programming), there is no geometry as- substantially outperforms the multiple minimum degree al-
sociated with the graph. Recently, an algorithm has been gorithm [27], which is the most commonly used method for
proposed to compute geometry information for graphs [4]. computing fill reducing orderings of a sparse matrix.
However this algorithm is based on computing spectral in- Even though multilevel algorithms are quite fast com-
formation, which is expensive and dominates the overall pared with spectral methods, they can still be the bottleneck
time taken by the graph partitioning algorithm. if the sparse system of equations is being solved in parallel
Another class of graph partitioning algorithms reduce [26, 14]. The coarsening phase of these methods is easy
the size of the graph (i.e., coarsen the graph) by collaps- to parallelize [23], but the Kernighan-Lin heuristic used in
ing vertices and edges, partition the smaller graph, and the refinement phase is very difficult to speedup in parallel
then uncoarsen it to construct a partition for the origi- computers [12]. Since both the coarsening phase and the re-
nal graph. These are called multilevel graph partitioning finement phase with Kernighan-Lin heuristic take roughly
schemes [3, 5, 15, 20, 7, 30]. Some researchers investi- the same amount of time, the overall scheme cannot be
gated multilevel schemes primarily to decrease the parti- speeded up significantly. Our new faster methods for re-
tioning time, at the cost of somewhat worse partition qual- finement reduce this bottleneck substantially. In fact our
ity [30]. Recently, a number of multilevel algorithms have parallel implementation [23] of this multilevel partitioning
been proposed [3, 20, 5, 15, 7] that further refine the par- is able to get a speedup of as much as 56 on a 128-processor
tition during the uncoarsening phase. These schemes tend Cray T3D for moderate size problems.
to give good partitions at reasonable cost. In particular, the
work of Hendrickson and Leland [20] showed that multi- 2 Graph Partitioning
level schemes can provide better partitions than the spectral The k-way graph partitioning problem is defined as follows:
methods at lower cost for a variety of finite element prob- Given a graph G = (V , E) with |V | = n, partition V
lems. Their scheme uses random maximal matching to suc- into k subsets, V1 , V2 , . . . , 
Vk such that Vi ∩ V j = ∅ for
cessively coarsen the graph until it has only a few hundred i = j , |Vi | = n/ k, and i Vi = V , and the number
vertices. Then it partitions this small graph using the spec- of edges of E whose incident vertices belong to different
tral methods. Now it uncoarsens the graph level by level, subsets is minimized. A k-way partition of V is commonly
and applies Kernighan-Lin refinement periodically. How- represented by a partition vector P of length n, such that
ever, even-though multilevel algorithms have been shown for every vertex v ∈ V , P[v] is an integer between 1 and
to be good alternatives to both spectral and geometric al- k, indicating the partition at which vertex v belongs. Given
gorithms, there is no comprehensive study today on their a partition P, the number of edges whose incident vertices
effectiveness on a wide range of problems. belong to different subsets is called the edge-cut of the
In this paper we experiment with various parameters partition.
of multilevel algorithms, and their effect on the quality of The efficient implementation of many parallel algorithms
partition and ordering. We investigate the effectiveness usually requires the solution to a graph partitioning problem,
of many different choices for all three phases: coarsening, where vertices represent computational tasks, and edges
partition of the coarsest graph, and refinement. In particular, represent data exchanges. A k-way partition of the com-
we present a new coarsening heuristic (called heavy-edge putation graph can be used to assign tasks to k processors.
heuristic) for which the size of the partition of the coarse Because the partition assigns equal number of computa-
graph is within a small factor of the size of the final partition tional tasks to each processor the work is balanced among
obtained after multilevel refinement. We also present a new k processors, and because it minimizes the edge-cut, the
scheme for refining during uncoarsening that is much faster communication overhead is also minimized.
than the Kernighan-Lin refinement used in [20]. Another important application of recursive bisection is to
We test our scheme on a large number of graphs arising find a fill reducing ordering for sparse matrix factorization
in various domains including finite element methods, linear [9, 26, 16]. This type of algorithms are generally referred
programming, and VLSI. Our experiments show that our to as nested dissection ordering algorithms. Nested dissec-
scheme consistently produces partitions that are better than tion recursively splits a graph into almost equal halves by
those produced by spectral partitioning schemes in substan- selecting a vertex separator until the desired number of par-

2
titions are obtained. The vertex separator is determined by sum of the weights of the vertices in Viv . Also, in order to
first bisecting the graph and then computing a vertex sepa- preserve the connectivity information in the coarser graph,
rator from the edge separator. The vertices of the graph are the edges of v are the union of the edges of the vertices in
numbered such that at each level of recursion, the separator Viv . In the case where more than one vertex of Viv , contain
vertices are numbered after the vertices in the partitions. edges to the same vertex u, the weight of the edge of v is
The effectiveness and the complexity of a nested dissection equal to the sum of the weights of these edges. This is use-
scheme depends on the separator computing algorithm. In ful when we evaluate the quality of a partition at a coarser
general, small separators result in low fill-in. graph. The edge-cut of the partition in a coarser graph will
The k-way partition problem is most frequently solved be equal to the edge-cut of the same partition in the finer
by recursive bisection. That is, we first obtain a 2-way graph.
partition of V , and then we further subdivide each part Two main approaches have been proposed for obtaining
using 2-way partitions. After log k phases, graph G is coarser graphs. The first approach is based on finding a
partitioned into k parts. Thus, the problem of performing a random matching and collapsing the matched vertices into
k-way partition is reduced to that of performing a sequence a multinode [3, 20, 2], while the second approach is based
of 2-way partitions or bisections. Even though this scheme on creating multinodes that are made of groups of vertices
does not necessarily lead to optimal partition, it is used that are highly connected [5, 15, 7]. The later approach is
extensively due to its simplicity [9, 16]. suited for graphs arising in VLSI applications, since these
graphs have highly connected components. However, for
3 Multilevel Graph Bisection graphs arising in finite element applications, most vertices
The graph G can be bisected using a multilevel algorithm. have similar connectivity patterns (i.e., the degree of each
The basic structure of a multilevel algorithm is very simple. vertex is fairly close to the average degree of the graph). In
The graph G is first coarsened down to a few hundred the rest of this section we describe the basic ideas behind
vertices, a bisection of this much smaller graph is computed, coarsening using matchings.
and then this partition is projected back towards the original Given a graph G i = (Vi , E i ), a coarser graph can be
graph (finer graph), by periodically refining the partition. obtained by collapsing adjacent vertices. Thus, the edge
Since the finer graph has more degrees of freedom, such between two vertices is collapsed and a multinode consist-
refinements usually decrease the edge-cut. ing of these two vertices is created. This edge collapsing
Formally, a multilevel graph bisection algorithm works idea can be formally defined in terms of matchings. A
as follows: Consider a weighted graph G 0 = (V0 , E 0 ), with matching of a graph, is a set of edges, no two of which are
weights both on vertices and edges. A multilevel graph incident on the same vertex. Thus, the next level coarser
bisection algorithm consists of the following three phases. graph G i+1 is constructed from G i by finding a matching of
G i and collapsing the vertices being matched into multin-
Coarsening Phase The graph G0 is transformed into odes. The unmatched vertices are simply copied over to
a sequence of smaller graphs G1 , G 2 , . . . , G m such that G i+1 . Since the goal of collapsing vertices using matchings
|V0 | > |V1 | > |V2 | > · · · > |Vm |. is to decrease the size of the graph Gi , the matching should
Partitioning Phase A 2-way partition Pm of the graph be of maximal size. That is, it should contain all possible
G m = (Vm , E m ) is computed that partitions V m into two the edges, no two of which are incident on the same vertex.
parts, each containing half the vertices of G0 . The matching of maximal size is called maximal matching.
Note that depending on how matchings are computed, the
Uncoarsening Phase The partition Pm of G m is pro-
size of the maximal matching may be different.
jected back to G 0 by going through intermediate parti-
tions Pm−1 , Pm−2 , . . . , P1 , P0 . In the remaining sections we describe four ways that
we used to select maximal matchings for coarsening. The
3.1 Coarsening Phase complexity of all these schemes is O(|E|).
During the coarsening phase, a sequence of smaller graphs, Random Matching (RM) A maximal matching can be
each with fewer vertices, is constructed. Graph coarsen- generated efficiently using a randomized algorithm. In our
ing can be achieved in various ways. In most coarsening experiments we used a randomized algorithm similar to that
schemes, a set of vertices of Gi is combined together to described in [3, 20]. The random maximal matching algo-
form a single vertex of the next level coarser graph Gi+1 . rithm is the following. The vertices are visited in random
Let Viv be the set of vertices of G i combined to form vertex order. If a vertex u has not been matched yet, then we ran-
v of G i+1 . We will refer to vertex v as a multinode. In order domly select one of its unmatched adjacent vertices. If such
for a bisection of a coarser graph to be good with respect to a vertex v exists, we include the edge (u, v) in the match-
the original graph, the weight of vertex v is set equal to the ing and mark vertices u and v as being matched. If there

3
is no unmatched adjacent vertex v, then vertex u remains ing by collapsing vertices that have high edge density. Thus,
unmatched in the random matching. this scheme computes a matching whose edge density is
maximal. The motivation behind this scheme is that sub-
Heavy Edge Matching (HEM) While performing the graphs of G 0 that are cliques or almost cliques will most
coarsening using random matchings, we try to minimize likely not be cut by the bisection. So, by creating multin-
the number of coarsening levels in a greedy fashion. How- odes that contain these subgraphs, we make it easier for
ever, our overall goal is to find a bisection that minimizes the partitioning algorithm to find a good bisection. Note
the edge-cut. Consider a graph G i = (Vi , E i ), a match- that this scheme tries to approximate the graph coarsening
ing Mi that is used to coarsen G i , and its coarser graph schemes that are based on finding highly connected com-
G i+1 = (Vi+1 , E i+1 ) induced by Mi . If A is a set of edges, ponents [5, 15, 7].
define W (A) to be the sum of the weights of the edges in A. As in the previous schemes for computing the matching,
It can be shown that W (E i+1 ) = W (E i ) − W (Mi ). Thus, we compute the heavy clique matching using a randomized
the total edge weight of the coarser graph is reduced by the algorithm. Note that HCM is very similar to the HEM
weight of the matching. Hence, by selecting a matching Mi scheme. The only difference is that HEM matches vertices
that has a maximal weight, we can maximize the decrease that are only connected with a heavy edge irrespective of
in the edge weight of the coarser graph. Now, since the the contracted edge-weight of the vertices, whereas HCM
coarser graph has smaller edge weight, it is more likely to matches a pair of vertices if they are both connected using
have a smaller edge-cut. a heavy edge and if each of these two vertices have high
Finding a matching with maximal weight is the idea be- contracted edge-weight.
hind the heavy-edge matching. A maximal weight match-
ing is computed using a randomized algorithm similar to 3.2 Partitioning Phase
that for computing a random matching described in Sec-
The second phase of a multilevel algorithm is to compute a
tion 3.1. The vertices are again visited in random order.
minimum edge-cut bisection Pm of the coarse graph Gm =
However, instead of randomly matching a vertex u with
(Vm , E m ) such that each part contains roughly half of the
one of its adjacent unmatched vertices, we match u with the
vertex weight of the original graph.
vertex v such that the weight of the edge (u, v) is maximum
A partition of G m can be obtained using various algo-
over all valid incident edges (heavier edge). Note that this
rithms such as (a) spectral bisection [33, 2, 18], (b) geo-
algorithm does not guarantee that the matching obtained
metric bisection [28] (if coordinates are available), and (c)
has maximum weight, but our experiments has shown that
combinatorial methods [25, 8, 9]. Since the size of the
it works very well.
coarser graph Gm is small (i.e., |Vm | < 100), this step takes
a small amount.
Light Edge Matching (LEM) Instead of minimizing the
We implemented three different algorithms for partition-
total edge weight of the coarser graph, one might try to
ing the coarse graph. The first algorithm uses the spectral bi-
maximize it. This is achieved by finding a matching Mi that
section [33], and the other two use graph growing heuristics.
has the smallest weight, leading to a small reduction in the
The first graph-growing heuristic (GGP) randomly selects a
edge weight of G i+1 . This is the idea behind the light-edge
vertex v and grows a region around it in a breadth-first fash-
matching. It may seem that the light-edge matching does
ion until half of the vertex-weight has been included. The
not perform any useful transformation during coarsening.
second graph-growing heuristic (GGGP) also starts from a
However, the average degree of Gi+1 produced by LEM is
randomly selected vertex v but it includes vertices that lead
significant higher than that of G i . Graphs with high average
to the smaller increase in the edge-cut. Since the quality
degree are easier to partition using certain heuristics such
of the partitions obtained by GGP and GGGP depends on
as Kernighan-Lin [3].
the choice of v, a number of different partitions are com-
puted starting from different randomly selected vertices and
Heavy Clique Matching (HCM) A clique of an un-
the best is used as the initial partition. In the experiments
weighted graph G = (V , E) is a fully connected subgraph
in Section 4.1 we selected 10 vertices for GGP and 5 for
of G. Consider a set of vertices U of V (U ⊂ V ). The
GGGP. We found all of these partitioning schemes to pro-
subgraph of G induced by U is defined as G U = (U, E U ),
duce similar partitions with GGGP consistently performing
such that E U consists of all edges (v1 , v2 ) ∈ E such that
better.
both v1 and v2 belong in U . Looking at the cardinality of
U and E U we can determined how close U is to a clique. 3.3 Uncoarsening Phase
In particular, the ratio 2|E U |/(|U |(|U | − 1)) goes to one if
U is a clique, and is small if U is far from being a clique. During the uncoarsening phase, the partition Pm of the
We refer to this ratio as edge density. coarser graph Gm is projected back to the original graph,
The heavy clique matching scheme computes a match- by going through the graphs G m−1 , G m−2 , . . . , G 1 . Since

4
each vertex of Gi+1 contains a distinct subset of vertices of In the next section we describe three different refinement
G i , obtaining Pi from Pi+1 is done by simply assigning the algorithms that are based on the KL algorithm but differ in
vertices collapsed to v ∈ G i to the partition Pi+1 [v]. the time they require to do the refinement.
Even though Pi+1 is a local minima partition of G i+1 ,
the projected partition Pi may not be at a local minima with Kernighan-Lin Refinement The idea of Kernighan-Lin
respect to G i . Since Gi is finer, it has more degrees of refinement (KLR) is to use the projected partition of G i+1
freedom that can be used to improve Pi , and decrease the onto G i as the initial partition for the Kernighan-Lin algo-
edge-cut. Hence, it may still be possible to improve the rithm. The KL algorithm has been found to be effective
projected partition of G i−1 by local refinement heuristics. in finding locally optimal partitions when it starts with a
For this reason, after projecting a partition, a partition re- fairly good initial partition [3]. Since the projected parti-
finement algorithm is used. The basic purpose of a partition tion is already a good partition, KL substantially decreases
refinement algorithm is to select two subsets of vertices, one the edge-cut within a small number of iterations. Further-
from each part such that when swapped the resulting parti- more, since a single iteration of the KL algorithm stops as
tion has smaller edge-cut. Specifically, if A and B are the soon as x swaps are performed that do not decrease the
two parts of the bisection, a refinement algorithm selects edge-cut, the number of vertices swapped in each iteration
A ⊂ A and B  ⊂ B such that A\A ∪ B  and B\B  ∪ A is is very small. Our experimental results show that a single
a bisection with a smaller edge-cut. iteration of KL terminates after only a small percentage of
the vertices have been swapped (less than 5%), which re-
A class of algorithms that tend to produce very good
sults in significant savings in the total execution time of this
results are those that are based on the Kernighan-Lin (KL)
refinement algorithm.
partition algorithm [25, 6, 20]. The KL algorithm is iterative
in nature. It starts with an initial partition and in each
Greedy Refinement Since we terminate each pass of the
iteration it finds subsets A and B  with the above properties.
KL algorithm as soon as no further improvement can be
If such subsets exist, then it moves them to the other part
made in the edge-cut, the complexity of the KLR scheme
and this becomes the partition for the next iteration. The
described in the previous section is dominated by the time
algorithm continues by repeating the entire process. If it
required to insert the vertices into the appropriate data struc-
cannot find two such subsets, then the algorithm terminates.
tures. Thus, even though we significantly reduced the num-
The KL algorithm we implemented is similar to that ber of vertices that are swapped, the overall complexity
described in [6] with certain modifications that significantly does not change in asymptotic terms. Furthermore, our
reduce the run time. The KL algorithm, computes for each experience shows that the largest decrease in the edge-cut
vertex v a quantity called gain which is the decrease (or is obtained during the first pass. In the greedy refinement
increase) in the edge-cut if v is moved to the other part. The algorithm (GR), we take advantage of that by running only
algorithm then proceeds by repeatedly selecting a vertex v a single iteration of the KL algorithm [3]. This usually re-
with the largest gain from the larger part and moves it to duces the total time taken by refinement by a factor of two
the other part. After moving v, v is marked so it will not be to four (Section 4.1).
considered again in the same iteration, and the gains of the
vertices adjacent to v are updated to reflect the change in Boundary Refinement In both the KLR and GR algo-
the partition. The algorithm terminates when the edge-cut rithms, we have to insert the gains of all the vertices in
does not decrease after x number of vertex moves. Since, the data structures. However, since we terminate both al-
the last x vertex moves did not decrease the edge-cut they gorithms as soon as we cannot further reduce the edge-cut,
are undone. The choice of x = 50 works quite well for all most of this computation is wasted. Furthermore, due to
our graphs. the nature of the refinement algorithms, most of the nodes
The efficient implementation of the above algorithm re- swapped by either the KLR and the GR algorithm are along
lies on the method used to compute the gains of successive the boundary of the cut, which is defined to be the vertices
finer graphs and the use of appropriate data structure to that have edges that are cut by the partition.
store these gains. Our algorithm computes the gains of In the boundary refinement algorithm, we initially insert
the vertices during the projection of the partition. In do- into the data structures the gains for only the boundary
ing so, it utilizes the computed gains for the vertices of vertices. As in the KLR algorithm, after we swap a vertex
the coarser graph and it only needs to compute the gains v, we update the gains of the adjacent vertices of v not yet
of the vertices that are along the boundary of the partition. being swapped. If any of these adjacent vertices become
The data structure used to store the gains is a hash table a boundary vertex due to the swap of v, we insert it into
that allow insertions, updates, and extraction of the vertex the data structures if the have positive gain. Notice that the
with maximum gain in constant time. Details about the boundary refinement algorithm is quite similar to the KLR
implementation of the KL algorithm can be found in [22]. algorithm, with the added advantage that only vertices are

5
inserted into the data structures as needed and no work is
Matrix Name Order Nonzeros Description
wasted. BCSSTK28 (BC28) 4410 107307 Solid element model
As with KLR, we have a choice of performing a sin- BCSSTK29 (BC29) 13992 302748 3D Stiffness matrix
BCSSTK30 (BC30) 28294 1007284 3D Stiffness matrix
gle pass (boundary greedy refinement (BGR)) or multiple BCSSTK31 (BC31) 35588 572914 3D Stiffness matrix
passes (boundary Kernighan-Lin refinement (BKLR)) un- BCSSTK32 (BC32) 44609 985046 3D Stiffness matrix
BCSSTK33 (BC33) 8738 291583 3D Stiffness matrix
til the refinement algorithm converges. As opposed to the BCSPWR10 (BSP10) 5300 8271 Eastern US power network
non-boundary refinement algorithms, the cost of perform- BRACK2 (BRCK) 62631 366559 3D Finite element mesh
CANT (CANT) 54195 1960797 3D Stiffness matrix
ing multiple passes of the boundary algorithms is small, COPTER2 (COPT) 55476 352238 3D Finite element mesh
since only the boundary vertices are examined. CYLINDER93 (CY93) 45594 1786726 3D Stiffness matrix
FINAN512 (FINC) 74752 335872 Linear programming
To further reduce the execution time of the boundary 4ELT (4ELT) 15606 45878 2D Finite element mesh
refinement while maintaining the refinement capabilities INPRO1 (INPR) 46949 1117809 3D Stiffness matrix
LHR71 (LHR) 70304 1528092 3D Coefficient matrix
of BKLR and the speed of BGR one can combine these LSHP3466 (LS34) 3466 10215 Graded L-shape pattern
schemes into a hybrid scheme that we refer to it as BKLGR. MAP (MAP) 267241 937103 Highway network
MEMPLUS (MEM) 17758 126150 Memory circuit
The idea behind the BKLGR policy is to use BKLR as long ROTOR (ROTR) 99617 662431 3D Finite element mesh
as the graph is small, and switch to BGR when the graph is S38584.1 (S33) 22143 93359 Sequential circuit
SHELL93 (SHEL) 181200 2313765 3D Stiffness matrix
large. The motivation for this scheme is that single vertex SHYY161 (SHYY) 76480 329762 CFD/Navier-Stokes
swaps in the coarser graphs lead to larger decrease in the TROLL (TROL) 213453 5885829 3D Stiffness matrix
WAVE (WAVE) 156317 1059331 3D Finite element mesh
edge-cut than in the finer graphs. So by using BKLR at these
coarser graphs better refinement is achieved, and because Table 1: Various matrices used in evaluating the multilevel graph parti-

these graphs are very small (compared to the size of the tioning and sparse matrix ordering algorithm.
original graph), the BKLR algorithm does not require a lot
of time. For all the experiments presented in this paper, if used the GGGP algorithm for the initial partition phase and
the number of vertices in the boundary of the coarse graph is the BKLGR as the refinement policy during the uncoars-
less than 2% of the number of vertices in the original graph, ening phase. For each matching scheme, Table 2 shows
refinement is performed using BKLR, otherwise BGR is the edge-cut, the time required by the coarsening phase
used. (CTime), and the time required by the uncoarsening phase
(UTime). UTime is the sum of the time spent in partition-
4 Experimental Results ing the coarse graph (ITime), the time spent in refinement
(RTime), and the time spent in projecting the partition of a
We evaluated the performance of the multilevel graph par- coarse graph to the next level finer graph (PTime).
titioning algorithm on a wide range of matrices arising in
different application domains. The characteristics of these
RM HEM LEM HCM
matrices are described in Table 1. All the experiments were BCSSTK31 14489 84024 412361 115471
performed on an SGI Challenge, with 1.2GBytes of mem- BCSSTK32 184236 148637 680637 153945
BRACK2 75832 53115 187688 69370
ory and 200MHz Mips R4400. All times reported are in CANT 817500 487543 1633878 521417
seconds. Since the nature of the multilevel algorithm dis- COPTER2 69184 59135 208318 59631
cussed is randomized, we performed all experiments with CYLINDER93 522619 286901 1473731 354154
4ELT 3874 3036 4410 4025
fixed seed. INPRO1 205525 187482 821233 141398
ROTOR 147971 110988 424359 98530
4.1 Graph Partitioning SHELL93 373028 237212 1443868 258689
TROLL 1095607 806810 4941507 883002
WAVE 239090 212742 745495 192729
As discussed in Sections 3.1, 3.2, and 3.3, there are many
alternatives for each of the three different phases of a mul- Table 3: The edge-cut for a 32-way partition when no refinement was
tilevel algorithm. It is not possible to provide an exhaus- performed, for the various matching schemes.
tive comparison of all these possible combinations without
making this paper unduly large. Instead, we provide com- In terms of the size of the edge-cut, there is no clear cut
parison of different alternatives for each phase after making winner among the various matching schemes. The value of
a reasonable choice for the other two phases. 32EC for all schemes are within 10% of each other. Out of
these schemes, RM does better for 2 matrices, HEM does
Matching Schemes We implemented the four matching better for six matrices, LEM for three, and HCM for one.
schemes described in Section 3.1 and the results for a 32- The time spent in coarsening does not vary significantly
way partition for some matrices is shown in Table 2. These across different schemes. But RM requires the least amount
schemes are (a) random matching (RM), (b) heavy edge of time for coarsening, while LEM and HCM require the
matching (HEM), (c) light edge matching (LEM), and (d) most (upto 38% more time than RM). This is not surprising
heavy clique matching (HCM). For all the experiments, we since RM looks for the first unmatched neighbor of a vertex

6
RM HEM LEM HCM
32EC CTime UTime 32EC CTime UTime 32EC CTime UTime 32EC CTime UTime
BCSSTK31 44810 5.93 2.46 45991 6.55 1.95 42261 7.65 4.90 44491 7.48 1.92
BCSSTK32 71416 9.21 2.91 69361 10.26 2.34 69616 12.13 6.84 71939 12.06 2.36
BRACK2 20693 6.86 3.41 21152 7.54 3.33 20477 7.90 4.40 19785 8.07 3.42
CANT 323.0K 20.34 8.99 323.0K 22.39 5.74 325.0K 27.14 23.64 323.0K 26.19 5.85
COPTER2 32330 5.18 2.95 30938 6.39 2.68 32309 6.94 5.05 31439 7.25 2.73
CYLINDER93 198.0K 16.49 5.25 198.0K 18.65 3.22 199.0K 21.72 14.83 204.0K 21.61 3.24
4ELT 1826 0.82 0.76 1894 0.91 0.78 1992 0.92 0.95 1879 1.08 0.74
INPRO1 78375 10.40 2.90 75203 11.56 2.30 76583 13.46 6.25 78272 13.34 2.30
ROTOR 38723 12.94 5.60 36512 14.31 4.90 37287 15.51 8.30 37816 16.59 5.10
SHELL93 84523 36.18 10.24 81756 40.59 8.94 82063 46.02 16.22 83363 48.29 8.54
TROLL 317.4K 67.75 14.16 307.0K 74.21 10.38 305.0K 93.44 70.20 312.8K 89.14 10.81
WAVE 73364 20.87 8.24 72034 22.96 7.24 70821 25.60 15.90 71100 26.98 7.20

Table 2: Performance of various matching algorithms during the coarsening phase. 32EC is the edge-cut of a 32-way partition, CTime is the time spent
in coarsening, and RTime is the time spent in refinement.

(the adjacency lists are randomly permuted). On the other lected the HEM as our matching scheme of choice because
hand, HCM needs to find the edge with the maximum edge of its consistent good behavior.
density, and LEM produces coarser graphs that have vertices
with higher degree than the other three schemes; hence, Initial Partition Algorithms As described in Section 3.2,
LEM requires more time to both find a matching and also a number of algorithms can be used to partition the coarse
to create the next level coarser graph. The coarsening time graph. We have implemented the following algorithms: (a)
required by HEM is only slightly higher (upto 10% more) spectral bisection (SBP), (b) graph growing (GGP), and (c)
than the time required by RM. greedy graph growing (GGGP). Due to space limitations we
Comparing the time spent during uncoarsening, we see do not report the results here but they can be found in [22].
that both HEM and HCM require the least amount of time, In summary, the results in [22] show that GGGP consistently
while LEM requires the most. In some cases, LEM re- finds smaller edge-cuts than the other schemes at slightly
quires as much as 7 times more time than either HEM or better run time. Furthermore, there is no advantage in
HCM. This can be explained by results shown in Table 3. choosing spectral bisection for partitioning the coarse graph.
This table shows the edge-cut of 32-way partition when no
refinement is performed (i.e., the final edge-cut is exactly Refinement Policies As described in Section 3.3, there
the same as that found in the initial partition of the coars- are different ways that a partition can be refined during the
est graph). Table 3 shows that the edge-cut of LEM on uncoarsening phase. We evaluated the performance of five
the coarser graphs is significantly higher than that for ei- refinement policies, both in terms of how good partitions
ther HEM or HCM. Because of this, all three components they produce and also how much time they require. The re-
of UTime increase for LEM relative to those of the other finement policies that we evaluate are (a) Greedy refinement
schemes. The ITime is higher because the coarser graph (GR), (b) Kernighan-Lin refinement (KLR), (c) boundary
has more edges, RTime increases because a large number Greedy refinement (BGR), (d) boundary Kernighan-Lin re-
of vertices need to be swapped to reduce the edge-cut, and finement (BKLR), and (e) the combination of BKLR and
PTime increases because more vertices are along the bound- BGR (BKLGR).
ary; which requires more computation [22]. The time spent The result of these refinement policies for partitioning
during uncoarsening for RM is also higher than the time re- graphs corresponding to some of the matrices in Table 1 in
quired by the HEM scheme by upto 50% for some matrices 32 parts is shown in Table 4. These partitions were produced
for somewhat similar reasons. by using the heavy-edge matching (HEM) during coarsen-
ing and the GGGP algorithm for initially partitioning the
From the discussion in the previous paragraphs we see
coarser graph.
that UTime is much smaller than CTime for HEM and HCM,
A number of interesting conclusions can be drawn out
while UTime is comparable to CTime for RM and LEM.
of Table 4. First, for each of the matrices and refinement
Furthermore, for HEM and HCM, as the problem size in-
policies, the size of the edge-cut does not vary significantly
creases UTime becomes an even smaller fraction of CTime.
for different refinement policies. For each matrix the edge
As discussed in introduction, this is of particular importance
cut of every refinement policy is within 15% of the best
when the parallel formulation of the multilevel algorithm is
refinement policy for that particular matrix. On the other
considered.
hand, the time required by some refinement policies does
As the experiments show, HEM is a good matching
vary significantly. Some policies require up to 20 times
scheme that results in good initial partitions, and requires
more time than others. KLR requires the most time while
little refinement. Even though it requires slightly more time
BGR requires the least.
than RM, it produces consistently smaller edge-cut. We se-

7
GR KLR BGR BKLR BKLGR
32EC RTime 32EC RTime 32EC RTime 32EC RTime 32EC RTime
BCSSTK31 45267 1.05 46852 2.33 46281 0.76 45047 1.91 45991 1.27
BCSSTK32 66336 1.39 71091 2.89 72048 0.96 68342 2.27 69361 1.47
BRACK2 22451 2.04 20720 4.92 20786 1.16 19785 3.21 21152 2.36
CANT 323.4K 3.30 320.5K 6.82 325.0K 2.43 319.5K 5.49 323.0K 3.16
COPTER2 31338 2.24 31215 5.42 32064 1.12 30517 3.11 30938 1.83
CYLINDER93 201.0K 1.95 200.0K 4.32 199.0K 1.40 199.0K 2.98 198.0K 1.88
4ELT 1834 0.44 1833 0.96 2028 0.29 1894 0.66 1894 0.66
INPRO1 75676 1.28 75911 3.41 76315 0.96 74314 2.17 75203 1.48
ROTOR 38214 4.98 38312 13.09 36834 1.93 36498 5.71 36512 3.20
SHELL93 91723 9.27 79523 52.40 84123 2.72 80842 10.05 81756 6.01
TROLL 317.5K 9.55 309.7K 27.4 314.2K 4.14 300.8K 13.12 307.0K 5.84
WAVE 74486 8.72 72343 19.36 71941 3.08 71648 10.90 72034 4.50

Table 4: Performance of five different refinement policies. All matrices have been partitioned in 32 parts. 32EC is the number of edges crossing
partitions, and RTime is the time required to perform the refinement.

Comparing GR with KLR, we see that KLR performs finer graph, and there is room for significant improvement
better than GR for 8 out of the 12 matrices. For these 8 not just along the boundary. This is the reason why LEM
matrices, the improvement is less than 5% on the average; requires the largest refinement time among all the match-
however, the time required by KLR is significantly higher ing schemes, irrespective of the refinement policy. Since
than that of GR. Usually, KLR requires two to three times boundary refinement schemes consider only boundary ver-
more time than GR. tices, they may miss sequences of vertex swaps that involve
Comparing the GR and KLR refinement schemes against non boundary vertices and lead to a better partition. To
their boundary variants, we see that the time required by the compare the performance of the boundary refinement poli-
boundary policies is significantly less than that required by cies against their non-boundary counterparts, for both RM
their non-boundary counterparts. The time of BGR ranges and LEM, we performed another set of experiments sim-
from 29% to 75% of the time of GR, while the time of BKLR ilar to those shown in Table 4. For the RM coarsening
ranges from 19% to 80% of the time of KLR. This seems scheme, BGR outperformed GR in 5 matrices, and BKLR
quite reasonable, given that BGR and BKLR are simpler outperformed KLR only in 5 matrices. For the LEM coars-
versions of GR and KLR, respectively. But surprisingly, ening scheme, BGR outperformed GR only in 4 matrices
BGR and BKLR lead to better edge-cut (than GR and KLR, and BKLR outperformed KLR only in 3 matrices.
respectively) in many cases. BGR does better than GR in 6
out of the 12 matrices, and BKLR does better than KLR in
10 out the 12 matrices. Thus, the quality of the boundary Comparing BGR with BKLR we see that the edge-cut
refinement policies is similar if not better than their non- is better for BKLR for 11 matrices, and they perform simi-
boundary counterparts. larly for the remaining matrix. Note that the improvement
Even though BKLR appears to be just a simplified ver- performed by BKLR over BGR is relatively small (less than
sion of KLR, in fact they are two distinct schemes. In each 4% on the average). However, the time required by BKLR
scheme, a set of vertices from the two parts of the partition is always higher than that of BGR (in some cases upto four
is swapped in each iteration. In BKLR, the set of vertices times higher). Again we see here that marginal improve-
to be swapped from either part is restricted to be only along ments in the partition quality come at a significant increase
the boundary, whereas in the KLR it can potentially be in the refinement time. Comparing BKLGR against BKLR
any subset. BKLR performs better in conjunction with the we see that its edge-cut is on the average within 2% of that
HEM coarsening scheme, because for HEM the first parti- of BKLR, while its runtime is significantly smaller than that
tion of the coarsest graph is quite good (consistently better of BKLR and somewhat higher than that of BGR.
than the partition that can be obtained for other coarsening
schemes such as RM and LEM), and it does not change
significantly with each uncoarsening phase. Note that by In summary, when it comes to refinement policies, a
restricting each iteration of KL on the boundary vertices, relatively small decrease in the edge-cut usually comes at
more iterations are needed for the algorithm to converge to a significant increase in the time required to perform the
a local minima. However, these iterations take very little refinement. Both the BGR and the BKLGR refinement
time. Thus, BKLR provides the very precise refinement policies require little amount of time and produce edge-
that is needed by HEM. cuts that are fairly good when coupled with the heavy-edge
For the other matching schemes, and for LEM in par- matching scheme. We believe that the BKLGR refinement
ticular, the partition of the coarse graph is far from being policy strikes a good balance between small edge-cut and
close to a local minima when it is projected in the next level fast execution.

8
Our Multilevel vs Multilevel Spectral Bisection (MSB)
4.2 Comparison with Other Partitioning Schemes Baseline: MSB
64 parts
1.2 128 parts
The multilevel spectral bisection (MSB) [2] has been shown 1.1
256 parts

1
to be an effective method for partitioning unstructured prob- 0.9

lems in a variety of applications. The MSB algorithm 0.8


0.7
coarsens the graph down to a few hundred vertices using 0.6
0.5
random matching. It partitions the coarse graph using spec- 0.4

tral bisection and obtains the Fiedler vector of the coarser 0.3
0.2
graph. During uncoarsening, it obtains an approximate 0.1
0
Fiedler vector of the next level fine graph by interpolat- BC30 BC32 BRCK CANT COPT CY93 FINC LHR MAP MEM ROTR S38 SHEL SHYY TROL WAVE

ing the Fiedler vector of the coarser graph, and computes


Figure 1: Quality of our multilevel algorithm compared to the multilevel
a more accurate Fiedler vector using the SYMMLQ. The
spectral bisection algorithm. For each matrix, the ratio of the cut-size of
MSB algorithm computes the Fiedler vector of the graph
our multilevel algorithm to that of the MSB algorithm is plotted for 64-,
using this multilevel approach. This method is much faster
128- and 256-way partitions. Bars under the baseline indicate that the
than computing the Fiedler vector of the original graph di-
multilevel algorithm performs better.
rectly. Note that MSB is a significantly different scheme
Our Multilevel vs Multilevel Spectral Bisection with Kernighan-Lin (MSB-KL)
than the multilevel scheme that uses spectral bisection to 1.4
Baseline: MSB-KL
1.3
partition the graph at the coarsest level. We used the MSB 1.2
64 parts
128 parts
256 parts
algorithm in the Chaco [19] graph partitioning package to 1.1
1
produce partitions for some of the matrices in Table 1 and 0.9
0.8
compared them against the partitions produced by our mul- 0.7

tilevel algorithm that uses HEM during coarsening phase, 0.6


0.5

GGGP during partitioning phase, and BKLGR during the 0.4


0.3
uncoarsening phase. 0.2

Figure 1 shows the relative performance of our multi- 0.1


0
BC30 BC32 BRCK CANT COPT CY93 FINC LHR MAP MEM ROTR S38 SHEL SHYY TROL WAVE
level algorithm compared to MSB. For each matrix we plot
the ratio of the edge-cut of our multilevel algorithm to the Figure 2: Quality of our multilevel algorithm compared to the multilevel
edge-cut of the MSB algorithm. Ratios that are less than spectral bisection algorithm with Kernighan-Lin refinement. For each
one indicate that our multilevel algorithm produces better matrix, the ratio of the cut-size of our multilevel algorithm to that of the
partitions than MSB. From this figure we can see that for MSB-KL algorithm is plotted for 64-, 128- and 256-way partitions. Bars
almost all the problems, our algorithm produces partitions under the baseline indicate that our multilevel algorithm performs better.
that have smaller edge-cuts than those produced by MSB.
In some cases, the improvement is as high as 60%. For the lems. However, KL refinement further increases the run
cases where MSB does better, the difference is very small time of the overall scheme as shown in Figure 4; thus,
(less than 1%). However the time required by our multi- increases the gap in the run time of MSB-KL and our mul-
level algorithm is significantly smaller than that required by tilevel algorithm.
MSB. Figure 4 shows the time required by the MSB algo- The graph partitioning package Chaco implements its
rithm relative to that required by our multilevel algorithm. own multilevel graph partitioning algorithm that is mod-
Our algorithm is usually 10 times faster for small problems, eled after the algorithm by Hendrickson and Leland [20, 19].
and 15 to 35 times faster for larger problems. The relative This algorithm, which we refer to as Chaco-ML, uses ran-
difference in edge-cut between MSB and our multilevel al- dom matching during coarsening, spectral bisection for par-
gorithm decreases as the number of partitions increases. titioning the coarse graph, and Kernighan-Lin refinement
This is a general trend, since as the number of partitions every other coarsening level during the uncoarsening phase.
increase both schemes cut more edges, to the limiting case Figure 3 shows the relative performance of our multilevel
in which |V | partitions are used in which case all |E| edges algorithms compared to Chaco-ML. From this figure we
are cut. can see that our multilevel algorithm usually produces par-
One way of improving the quality of MSB algorithm titions with smaller edge-cut than that of Chaco-ML. For
is to use the Kernighan-Lin algorithm to refine the parti- some problems, the improvement of our algorithm is be-
tions (MSB-KL). Figure 2 shows the relative performance tween 10% to 50%. Again for the cases where Chaco-ML
of our multilevel algorithm compared against the MSB- does better, it is only marginally better (less than 2%). Our
KL algorithm. Comparing Figures 1 and 2 we see that algorithm is usually two to six times faster than Chaco-
the Kernighan-Lin algorithm does improve the quality of ML (Figure 4). Most of the savings come from the choice
the MSB algorithm. Nevertheless, our multilevel algorithm of refinement policy (we use BKLGR) which is usually
still produces better partitions than MSB-KL for many prob- four to six times faster than the Kernighan-Lin refinement

9
Our Multilevel vs Chaco Multilevel (Chaco-ML)

Baseline: Chaco-ML
factorization. The number of operations required is usually
64 parts
1.2 128 parts
256 parts
related to the number of nonzeros in the Cholesky factors.
1.1
1
The fewer nonzeros usually lead to fewer operations. How-
0.9
0.8
ever, since the number of operations is the square of the
0.7 number of nonzeros, similar fills may have different op-
0.6
0.5
eration counts. For this reason, all comparisons in this
0.4
0.3
section are only in terms of the number of operations. On
0.2 a parallel computer, a fill reducing ordering, besides mini-
0.1
0
mizing the operation count, should also increase the degree
BC30 BC32 BRCK CANT COPT CY93 FINC LHR MAP MEM ROTR S38 SHEL SHYY TROL WAVE
of concurrency that can be exploited during factorization.
Figure 3: Quality of our multilevel algorithm compared to the multilevel In general, nested dissection based orderings exhibit more
Chaco-ML algorithm. For each matrix, the ratio of the cut-size of our concurrency during factorization than minimum degree or-
multilevel algorithm to that of the Chaco-ML algorithm is plotted for 64-, derings [10, 27] that have been found to be very effective
128- and 256-way partitions. Bars under the baseline indicate that our for serial factorization.
multilevel algorithm performs better. The minimum degree [10] ordering heuristic is the most
Relative Run-Times For 256-way Partition widely used fill reducing algorithm that is used to order
40
Baseline: Our Multilevel
Chaco-ML
sparse matrices for factorization on serial computers. The
MSB

35
MSB-KL minimum degree algorithm has been found to produce very
30
good orderings. The multiple minimum degree algorithm
25
[27] is the most widely used variant of minimum degree due
20 to its very fast runtime.
15 The quality of the orderings produced by our multilevel
10
nested dissection algorithm compared to that of MMD is
5
shown in Figure 5. For our multilevel algorithm, we used
0
BC30 BC32 BRCK CANT COPT CY93 FINC LHR MAP MEM ROTR S38 SHEL SHYY TROL WAVE the HEM scheme during coarsening, the GGGP scheme
for partitioning the coarse graph and the BKLGR refine-
Figure 4: The time required to find a 256-way partition for Chaco-ML,
ment policy during the uncoarsening phase. Looking at this
MSB, and MSB-KL relative to the time required by our multilevel algorithm.
figure we see that our algorithm produces better orderings
for 11 out of the 18 test problems. For the other seven
implemented by Chaco-ML. Note that we are able to use
problems MMD does better. However, for many of these 7
BKLGR without much quality penalty only because we use
matrices, MMD does only slightly better than MLND. The
the HEM coarsening scheme. In addition, the GGGP used
only exception is BCSPRW10 for which all nested dissec-
in our method for partitioning the coarser graph requires
tion schemes perform poorly.
much less time than the spectral bisection which is used in
Chaco-ML. However, for the matrices arising in finite element do-
mains, MLND does consistently better than MMD, and is
4.3 Sparse Matrix Ordering some cases by a large factor (two to three times better for
CANT, ROTR, SHEL, and WAVE). Also, from Figure 5
The multilevel graph partitioning algorithm can be used to we see that MLND does consistently better as the size of
find a fill reducing ordering for a symmetric sparse matrix the matrices increases and as the matrices become more
via recursive nested dissection. Let S be the vertex sep- unstructured. When all 18 test matrices are considered,
arator and let A and B be the two parts of the vertex set MMD produces orderings that require a total of 702 billion
of G that are separated by S. In the nested dissection or- operations, whereas the orderings produced by MLND re-
dering, A is ordered first, B second, while the vertices in quire only 293 billion operations. Thus, the ensemble of 18
S are numbered last. Both A and B are ordered by recur- matrices can be factored roughly 2.4 times faster if ordered
sively applying nested dissection ordering. In our multilevel with MLND.
nested dissection algorithm (MLND) a vertex separator is However, another, even more important, advantage of
computed from an edge separator by finding the minimum MLND over MMD, is that it produces orderings that ex-
vertex cover [31]. The minimum vertex cover has been hibit significantly more concurrency than MMD. The elim-
found to produce very small vertex separators. ination trees produced by MMD (a) exhibit little concur-
Overall quality of a fill reducing ordering depends on rency (long and slender), and (b) are unbalanced so that
whether or not the matrix is factored on a serial or parallel subtree-to-subcube mappings lead to significant load im-
computer. On a serial computer, a good ordering is the balances [26, 9, 14]. One the other hand, orderings based
one that requires the smaller number of operations during on nested dissection produce orderings that have both more

10
Multilevel Nested Disection vs Multiple Minimum Degree and Spectral Nested Disection

3.8 MMD
SND
[3] T. Bui and C. Jones. A heuristic for reducing fill in sparse matrix
3.6
3.4
Baseline: MLND factorization. In 6th SIAM Conf. Parallel Processing for Scientific
3.2 Computing, pages 445–452, 1993.
3
2.8
2.6
[4] Tony F. Chan, John R. Gilbert, and Shang-Hua Teng. Geometric
2.4 spectral partitioning (draft). Technical Report In Preparation, 1994.
2.2
2 [5] Chung-Kuan Cheng and Yen-Chuen A. Wei. An improved two-way
1.8
1.6 partitioning algorithm with stable performance. IEEE Transactions
1.4
1.2
on Computer Aided Design, 10(12):1502–1511, December 1991.
1
0.8
[6] C. M. Fiduccia and R. M. Mattheyses. A linear time heuristic for
0.6
0.4
improving network partitions. In In Proc. 19th IEEE Design Au-
0.2 tomation Conference, pages 175–181, 1982.
0
LS34 BC28 BSP10 BC33 BC29 4ELT BC30 BC31 BC32 CY93 INPR CANT COPT BRCK ROTR WAVE SHEL TROLL [7] J. Garbers, H. J. Promel, and A. Steger. Finding clusters in VLSI cir-
cuits. In Proceedingsof IEEE InternationalConference on Computer
Figure 5: Quality of our multilevel nested dissection relative to the mul- Aided Design, pages 520–523, 1990.
tiple minimum degree, and the spectral nested dissection algorithm. The [8] A. George. Nested dissection of a regular finite-element mesh. SIAM
Journal on Numerical Ananlysis, 10:345–363, 1973.
matrices are displayed in increasing number of equations. Bars above
[9] A. George and J. W.-H. Liu. Computer Solution of Large Sparse
the baseline indicate that the MLND algorithm performs better. Positive Definite Systems. Prentice-Hall, Englewood Cliffs, NJ, 1981.
[10] A. George and J. W.-H. Liu. The evolution of the minimum degree
ordering algorithm. SIAM Review, 31(1):1–19, March 1989.
concurrency and better balance [24, 16]. Therefore, when
[11] Madhurima Ghose and Edward Rothberg. A parallel implementtaion
the factorization is performed in parallel, the better utiliza- of the multiple minimum degree ordering heuristic. Technical report,
tion of the processors can cause the ratio of the run time Old Dominion University, Norfolk, VA, 1994.
[12] J. R. Gilbert and E. Zmijewski. A parallel graph partitioning algo-
of parallel factorization algorithms running ordered using rithm for a message-passing multiprocessor. Internation Journal of
MMD and that using MLND to be substantially higher than Parallel Programming, (16):498–513, 1987.
the ratio of their respective operation counts. [13] John R. Gilbert, Gary L. Miller, and Shang-Hua Teng. Geometric
mesh partitioning: Implementation and experiments. In Proceedings
The MMD algorithm usually takes two to three times less of International Parallel Processing Symposium, 1995.
time to order the matrices in Table 1 than the time required [14] Anshul Gupta, George Karypis, and Vipin Kumar. Highly scalable
parallel algorithms for sparse matrix factorization. Technical Report
by MLND. However, efforts to parallelize the MMD algo- 94-63, Department of Computer Science, University of Minnesota,
rithm have had no success [11]. In fact, the MMD algorithm Minneapolis, MN, 1994. Submitted for publication in IEEE Trans-
actions on Parallel and Distributed Computing. Available on WWW
appears to be inherently serial in nature. On the other hand, at URL ftp://ftp.cs.umn.edu/users/kumar/sparse-cholesky.ps.
the MLND algorithm is amenable to parallelization. In [23] [15] Lars Hagen and Andrew Kahng. A new approach to effective circuit
we present a parallel formulation of our MLND algorithm clustering. In Proceedings of IEEE International Conference on
Computer Aided Design, pages 422–427, 1992.
that achieves a speedup of 57 on 128-processor Cray T3D. [16] M. T. Heath, E. G.-Y. Ng, and Barry W. Peyton. Parallel algorithms
Spectral nested dissection (SND) [32] is a widely used for sparse linear systems. SIAM Review, 33:420–460, 1991. Also
appears in K. A. Gallivan et al. Parallel Algorithms for Matrix Com-
ordering algorithm for ordering matrices for parallel factor- putations. SIAM, Philadelphia, PA, 1990.
ization. As in the case of MLND, the minimum vertex cover [17] M. T. Heath and P. Raghavan. A Cartesian nested dissection al-
gorithm. Technical Report UIUCDCS-R-92-1772, Department of
algorithm was used to compute a vertex separator from the Computer Science, University of Illinois, Urbana, IL 61801, 1992.
edge separator. The quality of the orderings produced by To appear in SIAM Journal on Matrix Analysis and Applications,
1994.
our multilevel nested dissection algorithm compared to that
[18] Bruce Hendrickson and Rober Leland. An improved spectral graph
of the spectral nested dissection algorithm is also shown in partitioning algorithm for mapping parallel computations. Technical
Figure 5. From this figure we can see that MLND produces Report SAND92-1460, Sandia National Laboratories, 1992.
[19] Bruce Hendrickson and Rober Leland. The chaco user’s guide,
orderings that are better than SND for 17 out of the 18 test version 1.0. Technical Report SAND93-2339, Sandia National Lab-
matrices. The total number of operations required to fac- oratories, 1993.
tor the matrices ordered using SND is 378 billion which is [20] Bruce Hendrickson and Rober Leland. A multilevel algorithm for
partitioning graphs. Technical Report SAND93-1301, Sandia Na-
30% more than the of MLND. Furthermore, as discussed tional Laboratories, 1993.
in Section 4.2, the runtime of SND is substantially higher [21] Zdenek Johan, Kapil K. Mathur, S. Lennart Johnsson, and Thomas
J. R. Hughes. Finite element methods on the connection machine
than that of MLND. Also, SND cannot be parallelized any cm-5 system. Technical report, Thinking Machines Corporation,
better than MLND; therefore, it will always be slower than 1993.
MLND. [22] G. Karypis and V. Kumar. Multilevel graph partitioning
schemes. Technical report, Department of Computer Science,
University of Minnesota, 1995. Available on WWW at URL
References ftp://ftp.cs.umn.edu/users/kumar/mlevel serial.ps.
[23] G. Karypis and V. Kumar. Parallel multilevel graph partition-
[1] Stephen T. Barnard and Horst Simon. A parallel implementation ing. Technical report, Department of Computer Science, Uni-
of multilevel recursive spectral bisection for application to adaptive versity of Minnesota, 1995. Available on WWW at URL
unstructured meshes. In Proceedings of the seventh SIAM conference ftp://ftp.cs.umn.edu/users/kumar/mlevel parallel.ps.
on Parallel Processing for Scientific Computing, pages 627–632, [24] George Karypis, Anshul Gupta, and Vipin Kumar. A
1995. parallel formulation of interior point algorithms. In Su-
[2] Stephen T. Barnard and Horst D. Simon. A fast multilevel imple- percomputing 94, 1994. Available on WWW at URL
mentation of recursive spectral bisection for partitioning unstructured ftp://ftp.cs.umn.edu/users/kumar/interior-point.ps.
problems. In Proceedings of the sixth SIAM conference on Parallel [25] B. W. Kernighan and S. Lin. An efficient heuristic procedure for
Processing for Scientific Computing, pages 711–718, 1993. partitioning graphs. The Bell System Technical Journal, 1970.

11
[26] Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis.
Introduction to Parallel Computing: Design and Analysis of Algo-
rithms. Benjamin/Cummings Publishing Company, Redwood City,
CA, 1994.
[27] J. W.-H. Liu. Modification of the minimum degree algorithm by
multiple elimination. ACM Transactions on Mathematical Software,
11:141–153, 1985.
[28] Gary L. Miller, Shang-Hua Teng, and Stephen A. Vavasis. A unified
geometric approach to graph separators. In Proceedings of 31st
Annual Symposiumon Foundationsof Computer Science, pages 538–
547, 1991.
[29] B. Nour-Omid, A. Raefsky, and G. Lyzenga. Solving finite element
equations on concurrent computers. In A. K. Noor, editor, American
Soc. Mech. Eng, pages 291–307, 1986.
[30] R. Ponnusamy, N. Mansour, A. Choudhary, and G. C. Fox. Graph
contraction and physical optimization methods: a quality-cost trade-
off for mapping data on parallel computers. In International Confer-
ence of Supercomputing, 1993.
[31] A. Pothen and C-J. Fan. Computing the block triangular form of a
sparse matrix. ACM Transactions on Mathematical Software, 1990.
[32] Alex Pothen, H. D. Simon, and Lie Wang. Spectral nested dissection.
Technical Report 92-01, Computer Science Department, Pennsylva-
nia State University, University Park, PA, 1992.
[33] Alex Pothen, Horst D. Simon, and Kang-Pu Liou. Partitioning sparse
matrices with eigenvectors of graphs. SIAM Journal of Matrix Anal-
ysis and Applications, 11(3):430–452, 1990.

12

You might also like