Part PDF
Part PDF
207
T001015d.207
The tutorial introduces the partitioning with applications to VLSI circuit designs. The
problem formulations include two-way, multiway, and multi-level partitioning,
partitioning with replication, and performance driven partitioning. We depict the
models of multiple pin nets for the partitioning processes. To derive the optimum
solutions, we describe the branch and bound method and the dynamic programming
method for a special case of circuits. We also explain several heuristics including the
group migration algorithms, network ¯ow approaches, programming methods,
Lagrange multiplier methods, and clustering methods. We conclude the tutorial with
research directions.
1
I207T001015 . 207
T001015d.207
system performance. With the advance of fabrica- the netlist and construct functional modules out
tion technologies, the cost of a transistor drops of the clusters.
while the cost of input/output pads remains fairly
While partitioning is a tool required to manage
constant. Consequently, the size of the interface
huge systems in many ®elds such as ecient
between partitions, e.g., between chips, determines
storage of large databases on disks, data mining,
a signi®cant portion of the manufacturing expenses.
and etc., in this tutorial, we focus our eorts on
And the quality of the partitioning has strong eect
partitioning with applications to VLSI circuit
on production cost. Furthermore, in submicron
designs. In the next section, we describe the
designs, interconnection delays tend to dominate
notations for the tutorial. In section three, the
gate delays [8]; therefore system performance is
formulations of the partitioning problems are
greatly in¯uenced by the partitions.
stated. Section four covers the models for multiple
Partitioning has been applied to solve the
pin nets. Section ®ve depicts the partitioning
various aspects of VLSI design problems [5, 36]:
algorithms. The tutorial is concluded with research
Physical packaging Partitioning decomposes directions.
the system in order to satisfy the physical
packaging constraints. The partitioning con-
2. PRELIMINARIES
forms to a physical hierarchy ranging from
cabinets, cases, boards, chips, to modular blocks.
In this section, we establish notations used and
Divide and conquer strategy Partitioning is used
formulate the partitioning problems addressed in
to tackle the design complexity with a divide and
our approaches. A circuit is represented by a
conqure strategy [21]. This strategy is adopted to
hypergraph, H(V, E ), where the vertex set
decompose the project between team members,
V={vi j i=1, 2, . . . , n} denotes the set of modules
to construct a logic hierarchy for logic synthesis,
and the hyperedge set E={ej j j=1, 2, . . . , m} de-
to transform the netlist into physical hierarchy
notes the set of nets. Each net ej is a subset of V
for ¯oorplanning, to allocate cells into regions
with cardinality jejj 2. The modules in ej are
for placement and RLC extraction, and manip-
called the pins of ej.
ulate hierarchies between logic and layout for
The hypergraph representation for a circuit with
simulation.
9 modules and 6 signal nets is shown in Figure 1,
System emulation and rapid prototyping One
where nets e1, e3 and e5 are two-pin nets, net e6 is a
approach for system emulation and prototyping
three-pin net, and nets e2 and e4 are four-pin nets.
is to construct the hardware with ®eld program-
When the circuit has only two pin nets, we can
mable gate arrays. Usually, the capacity of these
simplify the representation to a graph G(V, E ). A
®eld programmable gate arrays is smaller than
net connecting modules vi and vj is represented by
current VLSI designs. Thus, these prototyping
eij with a connectivity cij. We set cij=0 if there is no
machines are composed of a hierarchical struc-
net connecting modules vi and vj. We shall show
ture of ®eld programmable gate arrays. A
later that for certain formulations we replace
partitioning tool is needed to map the netlist into
multiple pin nets with models of two pin nets.
the hardware [110].
The replacement is performed when the partition-
Hardware and software codesign For hardware
ing algorithm is devised for graph models.
and software codesign, partitioning is used to de-
compose the designs into hardware and software. (i) Module Size and Net Connectivity Each mod-
Management of design reuse For huge designs ule vi is attached with a size si in R+, positive real
P
especially system-on-a-chip, we have to manage numbers. We de®ne S
Vj vi 2Vj si to be the size
design reuse. Partitioning can identify clusters of of a partition Vj. Each net ei is attached with a
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 3
connectivity ci in R+. By default, ci=1. For a bus of the net and bi V are the sink pins of the net.
of multiple signal lines, we can represent the bus We assume that jai [ bij 2, jaij 1 and jbij 1.
with a net ei of connectivity ci equal to the number Usually, each net has one source pin and multiple
of lines. We can also assign higher weights for sink pins. However, some nets may have multiple
some important nets, this will enable us to keep the sources which share the same interconnect line.
modules of these nets in the same partition. Furthermore, one pin can be both a source pin and
In this tutorial, we will assume that circuits are sink pin of the same net. Therefore, ai and bi may
represented as hypergraphs except when stated have a nonempty intersection.
otherwise, hence, the terms circuit, netlist, and For two disjoint vertex sets X and Y, we shall use
hypergraph are used interchangeably throughout E(X ! Y ) to denote the directed cut set from X to
the tuorial. Y. Net set E(X ! Y ) contains all the nets ei= (ai, bi)
such that X intersects the source pin set ai and Y
(ii) Partitions and Cuts The set of hyperedges
intersects the sink pin set bi, i.e., E(X ! Y )=
connecting any two-way partition (V1, V2) of two
{ei j ei=(ai, bi), ai \ X 6 ;, bi \ Y 6 ;}. We use the
disjoint vertex sets V1 and V2 is denoted by a cut
function C(X ! Y ) to denote the total cut count
E(V1, V2)={ej 2 E j 0 < jej \ V1j and 0 < jej \ V2j},
of the nets in E(X ! Y ), i.e., C
X ! Y
i.e., ej 2 E(V1, V2) if there exist some pins of ej in V1 P
ei 2E
X!Y ci .
and some dierent pins of ej in V2. We de®ne
P
C
V1 ; V2 ei 2E
V1 ;V2 ci to be the cut count of (iv) Performance Driven Partitioning In perfor-
the partition (V1, V2). mance driven partitioning [106], modules are
For a multiway partition (V1, V2, . . . , Vk) distinguished into two types: combinational ele-
where k > 2, a cut E(V1, V2, . . . , Vk)={ej 2 Ej 9 i ments and globally clocked registers. In illustra-
s.t. 0 < jej \ Vij < jejj}. For each subset Vi, we tion, we shall use circles to represent the com-
denote its external cut set E(Vi)={ej 2 E j0 < j binational elements and rectangles to represent the
ej \ Vij < jejj}. We denote its adjacent net set to be registers in ®gures (Fig. 13). Each module vi has an
the nets with some pin contained in Vi, i.e., associated delay di.
I(Vi)={ei j jei \ Vij > 0}. A path of length k from a module vi to a module
vj is a sequence hvi0 ; vi1 ; . . . ; vik i of modules such
(iii) Replication Cuts and Directed Cuts For
that vi vi0 , vj vik and for each l 2 {1, 2, . . . , k},
replication cuts and performance driven partition-
modules vilÿ1 and vil are a souce pin and a sink pin
ing, the direction of the nets makes a dierence in
of a net in E, respectively.
the process. We characterize the pins of each net
into two types: source and sink. A directed net ei is (v) Clustering Given a hypergraph H(V, E ),
denoted by (ai, bi) where ai V are the source pins highly connected modules in V can be grouped
I207T001015 . 207
T001015d.207
together to form some single supermodules called jV j equally spaced slots on a striaght line (Fig. 2).
clusters. After this process, a clustering ÿ={V1, Modules vs and vt are ®xed at the two extreme
V2, . . . , Vk} of the original hypergraph H is ends, i.e., vs on the ®rst slot (left end ) and vt on the
obtained and a contracted (i.e., coarser) hypergraph last slot (right end ). The goal is to assign all
Hÿ(Vÿ, Eÿ ) is induced, where Vÿ fv ÿ1 ; v ÿ2 ; . . . ; modules to distinct slots to minimize the total wire
v ÿk g. For every ej 2 E, the contracted net e ÿj 2 Eÿ if length. Let us use xi to denote the coordinate of
je ÿj j 2, where e ÿj fv ÿi jej \ Vi 6 ;g, that is, e ÿj module vi after it is assigned to the slot. The length
spans the set of clusters containing modules of ej. A of a net ei can be expressed as the dierence of the
contracted hypergraph, of course, can be used to maximum coordinate and the minimum coordi-
induce another coarser contracted hypergraph nate of the modules in the net, i.e., maxvj 2ei xj ÿ
based on the same clustering process. On the other minvk 2ei xk . The total wire length can be expressed
hand, a contracted hypergraph Hÿ(Vÿ, Eÿ ) can be as follows.
unclustered to return to a ®ner hypergraph H(V, E ). X
maxvj 2ei xj ÿ minvj 2ei xj
2
ei 2E
3. PROBLEM FORMULATIONS
The relation between partitioning and place-
In this section, we describe dierent formulations ment can be derived under the assumption that all
of the partitioning problems addressed in this nets are two pin nets [50].
tutorial. We will cover two-way partitioning, THEOREM 3.1 Given a graph G(V, E ) with modules
multiway partitioning, multiple level partitioning, vs and vt in V, let (V1, V2) be a min-cut partition
partitioning with replication, and performance separating modules vs and vt. Let vs and vt be the two
driven partitioning. modules locating at the two extreme ends of a linear
placement. Then, there exists an optimal linear
3.1. Two-way Partitioning or Bipartitioning placement solution such that all modules in V2 are
on the slots right of all modules in V1 (Fig. 2).
We consider several possible variations on the size
Thus, we can use the min-cut to partition a linear
constraints and cost functions in the formulation.
Additionally, in certain formulations, we ®x two
modules vs and vt to be on the opposite sides of the
cut as two seeds.
VLSI PARTITIONING 5
placement into two smaller problems and still C
A; V ÿ A ÿ fvs g ÿ C
A; fvs g
3
maintain optimality. Conceptually, we can conceive S
A
that modules in V1 or V2 have stronger internal
where vertex set A does not contain vs and vt.
connection within the set than its mutual connec-
Vertex set A is non-empty, i.e., S(A) > 0.
tion to the other set. Thus, if the span of modules in
Cost ratio cut is also strongly related to a linear
V1 and in V2 are mixed in a linear placement, we can
placement. Assuming that all nets are two pin nets,
slide all modules in V1 to the left and all modules in
we can derive the following theorem [22]:
V2 to the right to reduce the total wire length. In
fact, this is the procedure to prove the theorem. THEOREM 3.2 Given a graph G(V, E ) with modules
The min-cut with no size constraints can be vs and vt in V, let (V1, V2) be an optimal cost ratio
found in polynomial time using classical maximum cut partition. There exists an optimal linear
¯ow techniques [1]. However, it may happen that placement solution such that all modules in A are
the optimal solution separates only vs or vt from on the slots left of all modules in V ÿ A ÿ {vs}.
the rest of the modules, i.e., V1={vs} or V2={vt}.
Conceptually, we can conceive that C(A, V ÿ
This result is very likely to happen because most
A ÿ {vs}) is the force to pull A to the right and
VLSI basic modules have very small degrees of
C (A, {vs}) is the force to push A to the left. The
connecting nets (e.g., the degree of a 3-input
denominator S(A) is the inertia of the set A. A set A
NAND gate=4).
with the minimum cost ratio moves with the fastest
acceleration toward left end of the slots
3.1.2. Minimum Cost Ratio Cut
Example In Figure 3, the circuit contains six
The cost ratio cut formulation supplies a partition modules. The optimum cost ratio cut solution has
dierent from the min-cut that separates two ®xed A={v1, v2, v3} The cost ratio value is
modules. Thus, if the min-cut cannot provide any
nontrivial solution, we may adopt the cost ratio C
A; V ÿ A ÿ fvs g ÿ C
A; fvs g 4 ÿ 3 1
:
cut to perform another trial. S
A 3 3
In cost ratio cut, we ®x two modules vs and vt at
4
two dierent sides. Our objective is to ®nd a vertex
set A to minimize a cost ratio function: The cost ratio value of any other choice of set A is
larger than expression 4.
VLSI PARTITIONING 7
There are dierent ways to formulate the cut no bound on the size of each subset. Furthermore,
cost because of the dierent criteria used to count the number of partitions, k, is not ®xed, and
the cost of multiple pin nets. In the following we instead is part of the objective function.
list a few possible objective functions.
C
V1 ; V2 ; . . . ; Vk
(i) Minimize the cut count, RC mink>1 P P
13
1ikÿ1 ji S
Vi S
Vj
X
C
V1 ; V2 ; . . . ; Vk ci
10
Note that we can rewrite the denominator to
ei 2E
V1 ;V2 ;...;Vk
reduce complexity of the derivation.
(ii) Minimize the sum of cut counts of all vertex
sets. Let us denote the cut count of vertex set C
V1 ; V2 ; . . . ; Vk
P RC mink>1 P
Vi to be C
Vi ei 2E
Vi ci . The sum of cut
1=2 1ik S
Vi S
V ÿ S
Vi
counts of all subsets can be expressed as
14
X
k X
k X If the number of partitions is one, the denomi-
C
Vi cj
11 nator becomes zero. Thus, k is restricted to be
i1 i1 ej 2E
Vi
larger than one.
Thus, the cost of a net connecting three Example Figure 6 shows a ®fteen module circuit.
subsets is more expensive than the same net The modules are of unit size and the nets are of
connecting two subsets. unit connectivity. The square dot in the ®gure
(iii) Minimize the maximum cut count of all represents a hypernet. The partition shown by the
subsets, i.e., dashed line is a minimum cluster ratio cut. The
cost of the cut is
max1ik C
Vi
12
C
V1 ;V2 ;...;V4
P
1=2 1i4S
Vi S
VÿS
Vi
3.2.2. Cluster Ratio Cut 4 1
1=24
15ÿ43
15ÿ34
15ÿ44
15ÿ4 21
Cluster ratio cut is an extension of ratio cut from
two-way partition to multiway partition. There is
15
VLSI PARTITIONING 9
The physical intuition of cluster ratio can be reach the leaves. Thus, the leaves are ranked level
explained using a random graph model [10]. Let G zero. Each node is one level above the maximum
be a uniformly distributed random graph. We level of its children. When the level of the root is
construct the nets connecting each pair of modules only one, the problem is degenerated to two-way
with identical independent probability f. Since the or multiway partitioning.
nets are uniformly distributed, the probability of Each net ei spans a set of leaves. Given a set of
®nding a subgraph which is signi®cantly denser leaves, there is a unique lowest common ancestor.
than the rest of the graph is very small, meaning The level of the lowest ancestor is de®ned to be the
that there is no distinct cluster structure in G. level l(ei) of the net.
Consider a cut E(V1, V2, . . . , Vk), the expected The cost of a net ei is de®ned to be the
value of C (V1, V2, . . . , Vk) equals multiplication of its connectivity ci and the weight
w(l(ei)) of level l(ei) for net ei to communicate, i.e.,
k X
X kÿ1
ciw(l(ei)). The cost of the multi-level partition is
Expec
C
V1 ; V2 ; . . . ; Vk f jVi j jVj j P
ij1 j1
the sum of the cost of all nets, i.e., ei 2E ci w
l
ei .
16
3.3.1. J-level K-way Partitioning
and the expected value of cluster ratio equals
When the root of the partitioning tree is level j and
!
C
V1 ; V2 ; . . . ; Vk the number of branches of each node is no more
Expec
RC Expec Pk Pkÿ1 than k, we say it a j-level k-way partition. We can set
ij1 j1 jVi j jVj j
Pk Pkÿ1 dierent communication weights for each level.
f ij1 j1 jVi j jVj j Usually, the function is monotone, i.e., w(l) is larger
Pk Pkÿ1 f
17
when level l increases. The vertex set Vi of each leaf i
ij1 j1 jVi j jVj j
has its size bounded by Sl S(Vi) Su.
Since f is a constant, all cuts have the same For electronic packaging, the tree is bounded by
expected cluster ratio value. Therefore, if we use the number of external connections. We call a leaf
cluster ratio as the metric, all cuts would be is covered by a node if there is a directed path from
equally favored, which is consistent with the fact the node to the leaf in the tree representation. For
that G has no distinct clusters. However, in a each node ni, we de®ne Ti to be the union of the
general circuit, dierent cuts generate dierent modules in the leaves covered by node ni. Let E(Ti)
ratio values. Cuts that go through weakly con- be the external nets of Ti, i.e., E(Ti) ={ei j 0 < j
nected groups correspond to smaller ratio values. ei \ Ti j < jeij}. The cut count of each node should
The minimum of all cuts according to their cluster not exceed the capacity of the external connection
ratio values de®nes the cluster structure of the of the packaging, i.e.,
circuit since this cut deviates the most from the X
cuts of a uniformly distributed graph. C
Ti cj Cap
l
ni
18
ej 2E
Ti
3.3.2. Generic Binary Tree Example Figure 8 illustrates a generic binary tree
for partitioning. In this ®gure, the root is at level
A generic binary tree structure [110] is proposed to
three. Each node has at most two children.
simplify the multi-level partitioning. There is only
one constant Su to set in the binary tree. Thus, it is
3.4. Replication Cut
much easier to make a fair comparison between
dierent algorithms. In the replication cut problem, a subset of the
In a generic binary tree, each internal node has circuit may be replicated to reduce the cut count of
exactly two children. The weight of each level is a partition [54, 64, 82]. In this section, we use a
de®ned to be w(l)=2l. Thus, we have the objective two-way partition to illusturate the problem. We
function ®x two modules vs and vt at two sides of the cut.
X We use three vertex sets to represent the partition,
min ci 2l
ei
V1, V2 , and R, where V1, V2 , and R are disjoint
ei 2E
and V1 [ V2 [ R=V, vs 2 V1, vt 2 V2. Subsets V1
subject to the constraint on the capacity of the and V2 are separated by the cut and subset R is to
leaves, i.e., S(Vi) Su where Vi is the vertex set of be replicated at both sides (Fig. 9).
leaf i. The level of the root is adjusted according to Each copy of R needs to collect a complete set of
the minimization of the objective function. input signals in order to compute the function
VLSI PARTITIONING 11
FIGURE 9 Replication cut problem: (a) the three sets of nodes V1, R and V2; (b) the duplicated circuit with R being replicated.
Let Sl and Su denote the size limits on the two 3.5. Performance Driven Partitioning
partitioned subsets. We state the Replication Cut
Problem as follows: The goal of performance driven partitioning is to
Given a directed circuit G, we want to ®nd a generate a partition that satis®es some timing
replication cut R(V1, V2) with an objective constraints. Due to the physical geometric distance
and interface technology limitations, inter-parti-
X
min CR
V1 ; V2 ci
19 tion delay contributes the dominant portion of
ei 2R
V1 ;V2 signal propagation delay. Consequently, instead of
minimizing the number of the crossing nets as the
subject to the size constraints only objective during partitioning, we should take
into account the interpartition delay to satisfy the
Sl S
V1 [ R Su and Sl S
V2 [ R Su ,
timing constraints.
and the feasible condition Clock period is a major measurement for circuit
performance. It is determined by the longest signal
propagation delay between registers. Each cross-
I207T001015 . 207
T001015d.207
1 [ E
V2 ! V
FIGURE 10 An interpretation of the replication cut, R
V1 ; V2 E
V1 ! V 2 .
ing net is associated with an interpartition delay bounds of sizes Sl and Su, and interpartition delay ,
determined by VLSI technologies. Given a path p ®nd a partition (V1, V2) with the minimum cut count,
from one register to another register with no subject to Sl S(V1) Su, Sl S(V2) Su, and
interleaving registers, let dp be the sum of maxp dp dbp T.
combinational block delays and dbp be the sum of
Example In Figure 11, path p starts at register vi
interpartition delays along path p. The longest
and ends at register vj. The path crosses between
delay dp dbp among all paths p should be smaller
the partition (V1, V2) three times. Thus, the
than the clock period T, i.e.:
interpartition delay dbp 3.
max dp dbp T:
20 Replication can improve the performance of the
p
partitioned results [83]. In Figure 12(a), vertex set
Now we state the performance-driven partition- R locates at the side of V2. Path p crosses between
ing problem as follows: the partition (V1, R [ V2) three times. By replicat-
Given hypergraph H(V, E ), clock period T, two
VLSI PARTITIONING 13
latency bounds can achieve better clock period and which is smaller than the iteration bound before
system latency by using retiming. Therefore, we replication.
want to generate a partition with small iteration
and latency bounds. 3.6. Clustering
Statement of the Problem Now we state the
Clustering [6] is similar to multiway partitioning in
performance-driven partitioning problem as fol-
that the process groups modules into k subsets.
lows:
However, for clustering the number of subsets is
Given hypergraph H(V, E ), two numbers ~J and M,~
usually much greater than for a typical multiway
bounds of sizes Sl and Su, and interpartition delay ,
partitioning problem, e.g., k 10.
®nd a partition (V1, V2) with the minimum number
Often, a clustering process is used as part of a
of cut count, subject to Sl S(V1) Su, Sl
divide and conquer approach. Thus, it is impor-
S(V2) Su, J
V1 ; V2 ~ ~
J, and M
V1 ; V2 M.
tant to choose an objective function that ®ts the
Example Figure 13 illustrates the eect of repli- target application. If the goal is to reduce problem
cation on the iteration bound. Let us assume that complexity, we set the objective function to be:
the interpartition delay is =4. Before replication,
Xk
C
Vi
the iteration bound is dominated by loop l1. The min ;
26
bound is equal to C
i1 I
Vi
VLSI PARTITIONING 15
4.1. Shift Model Otherwise, the move has no eect on the cut
The shift model [101] for multiple pin net is useful count and potential cost.
when we perturb the partition by shifting one 2. If the revised pin count ki=1, the shift of the
module to a dierent vertex set or by swapping last pin of ei in V1 will decrease the cut count by
two modules between dierent vertex sets. Let us ci. We then update the potential cost of this last
simplify the description by assuming only one pin.
module is shifted to a dierent vertex set. A swap 3. If ki=0, the cut count reduces by ci. However,
of a pair of modules can be treated as two steps of the shift of any pin vk 2 ei from V2 to V1 will
module shifting. increase the cut count. Thus, in this case, we
For each shift, we want to update the cut count. re¯ect the cost of potential shift on the pins of
We also want to update the potential change in ei, which takes O(jeij) operations.
cost for each module if it were to be shifted, so that
we can rank the modules for the next move. Such 4.2. Clique of Two Pin Nets
cost revision can be expensive if the circuit has
Some researchers use cliques of two pin nets to
large nets which contain huge numbers of pins,
model multiple pin nets. Given a multiple pin net
e.g., hundreds of thousand pins.
ei, we construct a clique of (1/2)jeij(jeij ÿ 1) two
The shift model reduces the complexity of the
pin nets to connect all pairs of pins in the net. The
cost revision by utilizing the property that for huge
clique model maintains the symmetric relation of
nets most shifts of its pins do not change the cost
the modules of the same net in the sense that the
of the other pins in the net.
order of the pins in the net has no eect on the
Let us simplify the description by considering a
cost.
two way partitioning. The model can be extended to
The weight of two pin nets in the clique module
multiple way partitioning according to the choice of
is adjusted by some factor. One approach is to use
objective functions. Let module vj be shifted from
2/jeij to scale down the connectivity. The total
vertex set V1 to V2. The con®guration of nets
weight of all the nets in the clique is (2/jeij) (1/2)
ei 2 E({vj}) connecting module vj is revised. For each
jeij(jeij ÿ 1)ci=(jeij ÿ 1)ci. Note that it takes jeij ÿ 1
net ei, we denote ki to be the number of pins of ei in
two pin nets to form a spanning tree of jeij
V1 and jeij ÿ ki the number of pins of ei in V2 (Fig.
modules.
14). With respect to net ei, we update the pin
Other factor has been proposed such as 1/
numbers ki and jeij ÿ ki after module vj is shifted.
(jeij ÿ 1) which is based on a dierent probability
We also update the cost of modules in nets ei.
model. However, no factor can exactly re¯ect the
1. If the revised ki 2, the potential cost of pins cost of a multiple pin net model.
due to net ei is zero. For the case that Complexity of the Clique Model The complex-
jeij ÿ ki=1, we increase the cut count by ci ity of the clique model is high. There are O(jeij2)
and set the potential cost of pins in ei. two pin nets in a clique model. Suppose the
I207T001015 . 207
T001015d.207
process of each two pin net takes a constant time. to the sequence is one. The model remains correct
It takes O(jeij2) operations to process a multiple even if any two consecutive modules in the
pin net ei. Therefore, in practice, if the pin number sequence swap their order.
is larger than a threshold, the net is ignored in the
process. 4.5. Flow Model
For the network ¯ow approach, we consider each
4.3. Star of Two Pin Nets
net ei as a pipe. A set of saturated pipes forms a
A star model introduces less complexity than a bottleneck of the ¯ow. The union of the saturated
clique model. Given a net ei, we create a dummy pipes becomes the cut of the circuit. In such a
module ~vi . The dummy module ~vi connects every model, we set the capacity of the pipe equal to the
pin in ei with a two pin net. This module maintains corresponding connectivity ci [52].
the symmetry of the net. However, we need only Let xiu be the amount of ¯ow from pin vi to net
jeij two pin nets. eu and xuj be the amount of ¯ow from net eu to pin
For the clique and star models, the cost of the vj (Fig. 16). The total ¯ow injected into the net
partition depends on the number of pins on the should be smaller than or equal to its capacity and
two sides of the partition. The cost is higher when the incoming ¯ow is equal to the outgoing ¯ow,
the pins are distributed more evenly on the two i.e.,
sides of the cut. Thus, these models discourage X
xiu cu ;
27
even partitioning of the pins in the nets. vi 2eu
X X
xiu ÿ xui 0:
28
4.4. Loop Model of Two Pin Nets vi 2eu vi 2eu
VLSI PARTITIONING 17
group migration, network ¯ow, nonlinear pro- placed on each of the two dierent sides. A path in
gramming, Lagrangian, and clustering methods. the tree from the root to a leaf corresponds to one
The group-migration approach is a popular assignment for the partition.
method in practice due to its ¯exibility and We use a depth ®rst search approach to traverse
eectiveness. The network ¯ow method gives us the binary tree. We prune the search space
a dierent view of the partitioning problem by according to the size constraint and a partial cut
transforming the minimization of the cut count count. In the binary tree, a node at level k along
into the maximization of the ¯ow via a duality in with the path from the root to the node represents
linear programming. This approach derives ex- a partition assignment of the ®rst k modules. Let
cellent results with respect to certain objective V1 and V2 be the two vertex sets of the partitions
functions. The nonlinear programming method of the ®rst k modules. If S(Vi) > Su for i=1 or 2,
provides a global view of the whole problem. The the size constraint is violated, and there is no need
Lagrangian method is a useful approach for to proceed. Thus, we prune the branches below.
performance driven problems. Finally, we depict We also use a partial cut count to prune the
a clustering method for the partitioning. binary tree. The cut of the partial partition is
In most cases, we illustrate the method in expressed as: E(V1, V2)={ei j jei \ V1j > 0 and
question using two-way partitioning as the target jei \ V2j > 0}. The partial cut count is described
P
problem. However, many methods can be ex- as: C
V1 ; V2 ei 2E
V1 ;V2 ci . If the partial cut
tended to other problems or dierent objective count C(V1, V2) is larger than the cut count of a
functions. For example, we can apply group known solution, the partition results below this
migration to multiway [98, 99] or multiple level node are going to be worse than the existing
partitioning problems [68, 67] with modi®cation to solution. We prune the branches of such a node.
the cost of the moves. Furthermore, some methods Complexity of the Method Suppose the circuit
may be combined to solve a problem. For has unit size si=1 on each module and the
example, we can use clustering to reduce the size constraint requires an even size Sl=Su=jVj/2
of an input circuit and then use group migration to (assuming that jVj is even). Applying Stirling's
®nd a partition of the reduced circuit with much approximation [63], we have the number of
greater eciency [24, 59]. In fact, this strategy possible partitions:
derives the best results in terms of CPU time and s
cut count in recent benchmark [2]. jVj! 2 jVj
2 :
29
jVj=2!2 jVj
5.1. Branch and Bound Method
Although the number of combinations is huge,
The branch and bound method is an exhaustive we have found that the application to small circuits
search technique that may be eectively applied to is practical. We improve the eciency of the
the min-cut problem with size constraints for small pruning by ordering the modules according to their
cases. In the branch and bound process, the degrees, i.e., the number of nets connecting to the
modules are ®rst ordered in a sequence. For each modules, in a descending order. With an elegant
module, we try placing it to either side of the cut. implementation, we can ®nd optimal solutions
The process can be represented by a complete when the number of modules is small, e.g., jVj 60.
binary tree with jVj levels. The root of the tree is
the ®rst module in the sequence. The nodes in the
5.2. Dynamic Programming for a Serial
kth level of the tree correspond to the kth module
and Parallel Graph
in the sequence. The two branches at each node
represent the two trials where the kth module is For the special case where the circuit can be
I207T001015 . 207
T001015d.207
represented by a serial and parallel graph of unit and the sink module vt2 of G2 (Fig. 17(b)). The
module size, we can ®nd a minimum two way merged source module and merged sink module
partition (V1, V2) with size constraints in poly- become the source module vs and the sink module
nomial time. In this section, we ®rst describe the vt of graph G, respectively.
serial and parallel graph. We then depict a Dynamic Programming The dynamic program-
dynamic programming algorithm that solves the ming algorithm performs a bottom up process
partitioning problem on this class of graphs. We according to the construction of the serial and
assume that all modules are of unit size, i.e., si=1. parallel graph. It starts from the basic serial and
A serial and parallel graph can be constructed parallel graph. For each graph G(V, E ), we derive
from smaller serial and parallel graphs by serial or two tables.
parallel process. Each serial and parallel graph has
a(i, j): the minimum cut count with i modules on
a source module vs and a sink module vt. A graph
the left hand side and j modules on the
G(V, E ) with two modules, V={vs, vt} and one
right hand side under the condition that
edge E={e}, e={vs, vt} is a basic serial and parallel
source module vs is on the left hand side
graph. A serial and parallel graph is constructed
and sink module vt is on the right hand
from the basic graph by a series of serial and
side.
parallel processes.
b(i, j): the minimum cut count with i modules on
Serial Process Given two serial and parallel
the left hand side and j modules on the
graphs, G1(V1, E1) and G2(V2, E2), we construct a
right hand side under the condition that
serial and parallel graph G(V, E ) by merging the
both source module vs and sink module vt
sink module vt1 of G1 and the source module vs2 of
are on the left hand side.
G2 (Fig. 17(a)). The source module vs1 of graph G1
becomes the source module of graph G, i.e., Let graph G(V, E ) be constructed with
vs=vs1. The sink module vt2 of graph G2 becomes G1(V1, E1) and G2(V2, E2) by one of the serial
the sink module of graph G, i.e., vt=vt2. and parallel processes. Let a1, b1 be the tables of
Parallel Process Given two serial and parallel graph G1 and a2, b2 be the tables of graph G2. We
graphs, G1(V1, E1) and G2(V2, E2), we construct a construct the tables a, b of graph G(V, E ) as
serial and parallel graph G(V, E ) by merging the follows.
source module vs1 of G1 and the source module vs2 Table Formulas for Parallel Process
of G2 and by merging the sink module vt1 of G1
a
i; j minkmjV2 j a1
i 1 ÿ k; j 1 ÿ m
a2
k; m; 8i j jVj;
30
b
i; j minkmjV2 j b1
i 2 ÿ k; j ÿ m
b2
k; m; 8i j jVj:
31
VLSI PARTITIONING 19
Table Formula for Serial Process jump out of local minima, and so the optimum
solution will not be found. The progress of the
a
i; j min
minkmjV2 j a1
i ÿ k; j 1 ÿ m
method has de®nitely pushed the envelope further.
b2
k; m; minkmjV2 j In this section, we concentrate on two-way min-
b1
i 1 ÿ k; j ÿ m cut with size constraints. The method is ¯exible
a2
k; m; 8i j jVj;
32 and can be extended to other partitioning pro-
blems with modi®cations of the moves and the cost
b
i; j min
minkmjV2 j a1
i ÿ k; j 1 ÿ m
function.
a2
m; k; minkmjV2 j The algorithm performs a series of passes. At
b1
i 1 ÿ k; j ÿ m the beginning of a pass, each module is labeled
b2
k; m; 8i j jVj:
33 unlocked. Once a module is shifted, it becomes
locked in this pass. The group migration algorithm
For table a(i, j), we try all combinations of iteratively interchanges a pair of unlocked modules
tables a1 and b2 and all combinations of tables b1 or shifts a single module to a dierent side with the
and a2. For the combinations of tables a1 and b2, largest reduction (gain) of the cost function. This
the merged module (by merging vt1 and vs2) is on continues until all modules are locked. The lowest
the right hand side. For the combinations of tables cost along the whole sequence of swapping is
b1 and a2, the merged module is on the left hand recorded. The group migration takes the subse-
side. For table b(i, j), we try all combinations of quence that produces the lowest cut count and
tables a1 and a2 and all combinations of tables b1 undoes the moves after the point of the lowest
and b2. For the combinations of tables a1 and a2, cost. This partitioning result is then used as the
the merged module is on the right hand side. In initial solution for the next pass. The algorithm
terms of G2, its source module vs2 is on the right terminates when a pass fails to ®nd a result with a
hand side and its sink module vt2 is on the left cost lower than the cost of the previous pass.
hand side. Thus, the indices of table a2 are Group Migration Algorithm Input: Hypergraph
reversed, i.e., a2(m, k) instead of a2(k, m). For the H(V, E ) and an initial partition. Cost function and
combinations of tables b1 and b2, the merged size constraints.
module is on the left hand side.
1. One pass of moves.
5.3. Group Migration Algorithms 1.1. Choose and perform the best move.
1.2. Lock the moved modules.
The group migration algorithm was ®rst proposed 1.3. Update the gain of unlocked modules.
by Kernighan and Lin [60] in 1970. Since then, 1.4. Repeat Steps 1.1 ± 1.3 until all modules are
many variations [15, 26, 27, 33, 39, 45, 49, 84, 97 ± locked or no move is feasible.
99, 108, 111, 116] have been reported to improve 1.5. Find and execute the best subsequence of
the eciency and eectiveness of the method. the move. Undo the rest of the sequence.
Today, it is still a popular method in practice.
The probability of ®nding the optimum solution 2. Use the previous result as an initial partition.
in a single trial drops exponentially as the size of 3. Repeat the pass (Steps 1 and 2) until there is no
the circuit increases [60]. Using the original more improvement.
version, Kernighan and Lin showed that the Figure 18 illustrates the cost of a sequence of
probability of obtaining an optimal solution is a moves. This algorithm escapes from local optima
function of the problem size, p(jVj )=2ÿn/30. In by a whole sequence of the moves even when a
other words, if the circuit size is large, then the single move may produce a negative gain.
heuristic Kernighan ± Lin algorithm is unlikely to
I207T001015 . 207
T001015d.207
In the following, we discuss variations of several The choice of data structure strongly depends on
parts in the process: basic moves (Step 1.1), data the cost functions, gains, and the characteristic of
structure, gains (Steps 1.1 and 1.3). At the end of VLSI circuitry. A sorting structure such as heap or
this subsection, we introduce a net based move and AVL tree is a natural choice to sort for the top
a simulated annealing approach. modules. However, for the case that the gain
diers by a very limited quantities, an array struc-
ture can simplify the coding and the complexity.
5.3.1. Basic Moves
(i) Heap or AVL Tree We can use a heap or
Basic moves cover the shifting of a single module AVL tree to sort the modules according to
and the swapping of a pair of modules. A their shift gain. Each side of the partition
swapping can be conceived as two consecutive keeps a heap. The top of the heap is the
shifts, however, with consideration of the mutual module of the maximum gain. The sorting of
eect between the two shifts. each module takes O(jVjlog(jVj )) operations.
(i) Module Shifting For each unlocked module, (ii) Array (Bucket) of Link List Figure 19
we check its gain: the cost function reduction illustrate a bucket list data structure. The gain
by shifting the module to a dierent side is transformed to the index of the bucket [40].
assuming that the rest of the modules are Modules of the same gain are stored in the
®xed. To select the best module to shift, we same bucket by a link list. A bucket is an
order on each side the modules according to eective data structure when the objective
their shift gains. If the size constraints are function is the cut count. The gain of cut
violated after the shift, the move is not count is limited by the maximum degrees of
P
feasible. We search for the best feasible the modules, i.e., degmax maxvi 2V e2E
fvi g
module to move [40]. ce . Thus, the dimension of the bucket is set to
(ii) Pairwise Swapping We exchange two mod- be 2degmax.
ules in two vertex sets of the partition. Note For VLSI applications, the degree of modules is
that the gain of the swap is not equal to the much smaller than the number of modules. Thus,
sum of the gains of two shifts. The mutual the dimension of the bucket is small. It is very
eect between the two modules needs to be ecient to search and revise the module order in
included when we derive the gain. Thus, the the bucket structure. In fact, it is proven that using
best pair may not be the two modules on the the bucket structure and cut count as the objective
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 21
function, it takes linear time proportional to the (iii) (a) Levels with Priority The ®rst level gain is
total number of pins to perform each pass [40]. identical to the shift gain of cut count. The second
level gain is equal to the number of nets that have
one more pins on the same side. Thus, the kth level
5.3.3. Gains
gain is equal to the number of nets that have k
In this subsection, we use cut count as the more pins on the same side [65]. The pins on the
objective function. The extension to other cost other side will increase by one after the module is
functions is possible. However, we may loose shifted. Thus, the negative gain of level k is
eciency. contributed by the nets with k ÿ 1 pins on the
other side.
(i) Shift Gain We use shift model for multiple
Let us assume that module vi is in vertex set V1
pin nets. Given a module vi, we check the set
to simplify the notation. For each net ej 2 E({vi}),
E({vi}) of nets connecting to this module. The
we denote kj=jej \ V1j the number of pins in V1.
contribution of each net e 2 E({vi}) by shifting
Let us de®ne E(+, i, k) to be the set of nets
module vi is the gain ge(vi) of the net with respect
ej 2 E({vi}) with kj=k1 pins in V1 (the extra one
to module vi. The gain g(vi) of module vi is the total
is used to count module vi itself ) and nonzero pins
gains of all its adjacent nets, i.e.,
P in V2, i.e., jejj > kj. And E(ÿ, i, k) to be the set of
g
vi e2E
fvi g ge
vi
nets ej 2 E({vi}) with no other pins in V1 and k ÿ 1
(ii) Swap Gain The swap gain is the sum of the pins in V2, i.e., jejj=k and kj=1. Then, the kth
gains of two modules vi and vj, deducting the eect level gain of module vi, gi(k), is the weight
on common nets, i.e., g
vi g
vj ÿ dierence of the two sets, E(+, i, k) and E(ÿ, i, k).
P
z e2E
fvi g\E
fvj g
ge
vi ge
vj . X X
gi
k ce ÿ ce
34
(iii) Weights of Multipin Nets The sequence of e2E
;i;k e2E
ÿ;i;k
the move depends much on the gain calculation.
E
; i; k fej j ej 2 E
fvi g; kj k 1; jej j > kj g
For a circuit of 1,000,000 modules, suppose the
degree of most modules is less than 100 and each
35
net is of unit weight. We have roughly 1,000,000 E
ÿ; i; k fej j ej 2 E
fvi g; kj 1; jej j kg
modules/200 gain levels = 5,000 modules per gain
36
level. To dierentiate these 5,000 modules, we have
to adjust the weight of multiple pin nets.
I207T001015 . 207
T001015d.207
Q
We compare the modules with a priority on the to V2. Hence, ce j6i;vj 2e\V1 p
vj is the expected
lower level gain. In other words, we compare the gain if module vi is shifted. The second term
Q
®rst level ®rst. If the modules are equal at the ®rst vj 2e\V2 p
vj is the potential Q
that the pins in V2
level gain, we then compare the second level and so will shift to V1. Thus, ce vj 2e\V2 p
vj is the
on. In practice, we limit the number of levels by a expected loss if module vi is shifted.
threshold, e.g., l 3. The gain of a module vi is the total gains of the
adjacent nets with respect to this module, i.e.,
(iii) (b) Probabilistic Gain In probabilistic gain
model [37], each module vi is assigned a weight X
g
vi ge
vi :
39
p(vi). The weight p(vi) is a function of the gain g(vi) e2E
fvi g
of module vi to re¯ect the belief level ( potential)
that the shift of module vi will be executed at the Net gain ge(vi) and module potential p(vi) are
end of the pass. Thus, if module vi is unlocked, mutually dependent. We derive the values via
iterations. Initially, we use the plain shift gain (by
p
vi f
g
vi :
37 cut count) to derive the potential p(vi)=f (g(vi)).
From these initial potentials, we derive the
Otherwise, p(vi)=0. Figure 20 illustrates function
probabilistic net gain. The net gain is then used
f, which increases monotonically. The slope within
to derive the module gain. In practice, we stop
g0 and gup ampli®es the dierence of gains. The
after a limited number of cycles, e.g., two
slope is clamped at two ends pmax and pmin
iterations ([37]). Note that there is no guarantee
(0 pmin < pmax 1) which represent the maxi-
that the iteration will converge.
mum potential that the module will shift or stay.
After each move, the associated module poten-
For each net e 2 E({vi}), its contribution ge(vi) to
tial and probabilistic net gains are updated and the
the gain of module vi is the tendency that the whole
plain cut count is recorded. Exact cut count is used
net will shift with module vi to the other side. To
when we select the subsequence of move to
simplify the notation, let us assume that module vi
execute.
is in V1. Thus, we have the following expression.
It has been shown via benchmarks released by
! ACM/SIGDA, the probabilistic gain model pro-
Y Y
ge
vi ce p
vj ÿ p
vj
38 duces excellent partitioning results; it outperforms
j6i;vj 2e\V1 vj 2e\V2 the other gain models by wide margins.
Q
where vj 2S p
vj 1 if S is an empty set. The ®rst
Q 5.3.4. Net-based Move
term j6i;vj 2e\V1 p
vj in the parentheses is the
potential that all the pins will shift with module vi The net based process [115, 32] is similar to the
module based approach except that all operations
are based on the concept of the critical and
complementary critical sets. The main dierences
are (1) Instead of a single module, each move now
shifts one critical or complementary critical set,
depending on the type of objective function. For
convenience, we say a move is initiated by a net eu
if this move is composed of shifting the critical or
complementary critical set associated with eu. (2)
The locking mechanism is operated on a net, that
is, if the critical or complementary critical set of a
FIGURE 20 Function of probabilistic gain.
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 23
net has been moved then all the moves initiated by to the annealing temperature. As temperature
this net will be prohibited thereafter. drops, we gradually increase to enforce the size
Given a net eu and a vertex set Vb, let us de®ne balance.
the critical set of net eu with respect to set Vb as
VLSI PARTITIONING 25
X
jVj
p
X
jVj
p
In the solution of linear programming problem
xij xji cij ; 1 i; j jVj:
54 (52) ± (56), the nets with positive dij values parti-
p1 p1
tion V into vertex sets V1, V2, . . . , Vk. More speci-
We transform the above linear programming ®cally, nets connecting modules in dierent sets,
problem to its dual expression by assigning dual Vi, Vj, i 6 j, have the same distance dij values (we
p
variables i to module vi with respect to use dij to denote the distance between vertex sets Vi
commodity p Eq. (53), and distance dij to net eij and Vj when this does not cause confusion), while
Eq. (54), then we have: nets connecting only modules in the same sub-
X graph have zero distance, dij=0 (Fig. 22). We can
Obj : min cij dij
55
eij 2E
rewrite the denominator of the objective function
and state the problem as follows.
subject to Statement of Weighted Cluster Ratio Cut
p
p [103] Find the distance dij and the number of
dij i ÿ j ; 1 i; j; p jVj
56
partition k with an objective function of weighted
1 X X ÿ
p
jVj jVj
cluster ratio:
i ÿ
p
p 1
57
2 p1 i1;i6p
mindij ;k WC
V1 ; V2 ; . . . ; Vk
Pk Pkÿ1
The Properties of Shadow Prices The shadow ij1 j1 dij C
Vi ; Vj
mindij ;k Pk Pkÿ1
60
price dij can be viewed as bidirectional, i.e., dij=dji. ij1 j1 dij S
Vi S
Vj
It represents the distance of net eij, which
corresponds to the cost to transmit ¯ow through where distance dij is subject to the property of
p
eij. Variable i is the potential of module vi with triangular inequality.
respect to commodity p. According to the mechanism of the duality, the
From constraints (56), (57), we can derive two objective functions of the primal and dual
properties for distance function dij and potential formulations are equal when the solution is
p
i [71]. optimal [25].
Property I: Triangular Inequality The distance THEOREM 5.1 For feasible solutions, we have the
metric dij satis®es the triangular inequality: inequality f WC (V1, V2, . . . , Vk). The equality
holds when the solution is optimal, i.e., the
dij djk dik ; 8vi ; vj ; vk 2 V
58 maximum uniform multicommodity ¯ow equals the
p
Property II: Potential Function The term i ÿ
p
p in expression (56) is equal to the shortest
distance between modules vi and vp based on net
distances dij. In fact, from triangular inequality, we
p
p
obtain i ÿ p dip .
We normalize the objective function (55) with
the left hand side terms of inequality (57). The
objective function can be expressed as:
P
eij 2E cij dij
Obj : min PjVj PjVj ÿ
p
p
1=2 p1 i1;i6p i ÿ p
P
eij 2E cij dij
PjVj PjVj
59
1=2 p1 i1;i6p dip
FIGURE 22 Distance between clusters.
I207T001015 . 207
T001015d.207
minimum weighted cluster ratio of any cut, dierence between the cut from module vi 2 V1 to
maxxij f mindij ;k WC
V1 ; V2 ; . . . ; Vk . module vj 2 = V1. The potential of each module vi is
denoted by pi. For module vi in V1, pi=1, and for
Expression (60), weighted cluster ratio [103], is 1 , pi=0. Thus all nets eij 2
modules vi in V
similar to cluster ratio with a weighted metric dij.
E
V1 ! V 1 have wij=1. The remaining nets have
In general, the solution for the minimum weighted
wij=0.
cluster ratio does not directly correspond to the 2 , we
With respect to the directed cut E
V2 ! V
partition of optimum cluster ratio. However, if
use uji with a reversed subscript ji to denote the
distance dij is a constant value between all pairs of
potential dierence between the cut from module
vertex sets Vi and Vj then the weighted cluster ratio
vi 2 V2 to module vj 2 = V2 (Fig. 23). The potential of
provides the solution for cluster ratio.
each module vi is denoted by qi. For modules vi in
When the nets with positive distance dij form a 2 , qi=1, and for modules vi in V2, qi=0. The
V
two-way partition, we can show that the partition
potential dierence uji has a reverse direction with
de®nes the ratio cut. When the nets with positive 2 side high
net eij because we set the potential on V
distances form a k-way partition with k 4, we
and the potential on V2 side low. All nets
also ®nd that there exists a two-way partition that 2 have uji=1. The remaining nets
eij 2 E
V2 ! V
again de®nes the ratio cut [28].
have uji=0.
THEOREM 5.2 Let net set D={eijjdij > 0} de®ne a Primal Linear Programming Formulation The
cut that separates the circuit into k disconnected problem is to minimize the total weight of crossing
subsets. If k 4, then there exists a ratio cut that is nets:
a subset of D. X X
Obj : min cij wij cji uij
61
eij 2E eij 2E
pt 0 67
qt 0 68
VLSI PARTITIONING 27
it. We then apply the maximum ¯ow algorithm on that V2 is derived from the cut in vertex set V 0 . To
the constructed replication graph to derive an 0 to denote
simplify the notation, we shall use
X; X
optimum replication cut. The optimality of the the derived replication cut of G.
derived replication cut is proved by using a
Example Given a circuit in Figure 25, its replica-
network ¯ow approach.
tion graph G is constructed as shown in Figure 26.
Construction of Replication Graph Given a circuit
The maximum-¯ow minimum-cut of G derives
G(V, E ) and modules vs and vt, we construct 0
fv0s ;
fvs ; va g; fvb ; vc ; vt g and
X 0 ; X
X; X
another circuit G0 (V 0 , E 0 ) where j V 0 j=j V j with
v0a ; v0b ; v0c g; fv0t g with a ¯ow amount, 5 (Fig. 26).
each module v0i in V 0 corresponding to a module vi
Thus the sets V1={vs, va} and V2={vt} de®ne an
in V, and j E 0 j=j E j with each directed net eij in E 0
optimum replication cut R(V1, V2) with R={vb, vc}
in the reverse direction of net eij in E. We create
and a cut cost equal to 5 (Fig. 27).
super modules vs and vt and nets
vs ; vs ,
vs ; v0s ,
vt ; vt , and
v0t ; vt with in®nite capacity as shown The network ¯ow approach leads to the opti-
in Figure 24. From every module vi in V except vs mality of the solution as stated in the following
and vt, we add a directed net of in®nite capacity to theorem.
the corresponding module v0i in V 0 . We refer to the 0 derived
THEOREM 5.3 The replication cut R
X; X
combined circuit as G.
from the transformed circuit G generates the
Polynomial-time Algorithm The optimum repli- 0 (expression
minimum replication cut count CR
X; X
cation cut problem with respect to module pair vs
(19)).
and vt and without size constraints can be solved
by a maximum-¯ow minimum-cut solution of the
circuit G with vs as the source and vt as the sink of
the ¯ow (Fig. 24). Suppose the maximum-¯ow 5.4.5. Heuristic Flow Algorithms
minimum-cut ®nds partition
X; X of V with
We introduce the heuristic approaches that accel-
vs 2 X and vt 2 X 0 of V 0 with
and partition
X 0 ; X
0 erate the ¯ow calculation and take advantage the
v0s 2 X 0 and v0t 2 X . Then a replication cut (V1, V2)
optimality properties of the ¯ow methods. We ®rst
of the original circuit with V1=X, V2 fiji0 2 X 0g
introduce an approach that utilizes the maximum
and R=V ÿ V1 ÿ V2 is an optimum solution. Note ¯ow minimum cut method for the min cut with
VLSI PARTITIONING 29
FIGURE 26 The constructed replication graph of the circuit shown in Figure 25.
I207T001015 . 207
T001015d.207
ity to the vertex set in the other side. The result is 1.1. Saturate-Network (H, , ).
sensitive to the choice of the seeds. We can make 1.2. Select-Cut (H) until the clustering result
multiple trials and choose the best results. Other are satisfactory
methods such as programming approach can serve
2. Output clustering result.
as a guideline on the choice of the seeds [79, 80].
The method has shown to derive excellent results Procedure Saturate-Network (H, , )
with reasonable running time.
1. Set the distance of each net e to be one.
(ii) Approximation of Multiple Commodity Flow 2. While (H is connected ) do 2.1 to 2.3.
Based on the multicommodity ¯ow formulation
2.1. Randomly pick two distinct modules vs
[103], we try to solve a multiple way partitioning
by deriving approximate multiple commodity ¯ow and vt.
with a stochastic process [13, 55, 114, 117]. 2.2. Find the shortest path between vs and vt.
2.3. For each net e on the shortest path, let f (e)
Given a circuit H(V, E ), the ¯ow increment , and de be the ¯ow and distance of net e.
and the distance coecient , the algorithm starts 2.3.1. If n is not saturated, increase f (e) by
with procedure Saturate-Network to saturate the and set de=exp (( f (e))/ce).
circuit with ¯ows. A stochastic ¯ow injection 2.3.2. If e is saturated, set de to be 1.
algorithm is adopted to reduce the computational
complexity. Then, Select-Cut is activated to select 3. Output E with ¯ow informations.
a set of nets by the ¯ow values to constitute a cut. The initial distance of each net is one since there is
The conversion from weighted ratio cut to cluster no ¯ow being injected (see the distance formulation
ratio cut is performed by a Select-Cut routine in Step 2.3.1). Step 2.1 uses a random process with
which selects the subset of the cut derived from even distribution over all modules to pick two
Saturate-Network with a greedy approach. distinct modules, and Steps 2.2 ± 2.3 inject
Multiple Commodity Flow Approximation amount of ¯ows along the shortest path between
(H, , ) the modules. In Steps 2.3.1 ± 2.3.2, the distances of
1. Iterate the following procedures the nets whose ¯ow has been increased are
recomputed using an exponential function de=exp
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 31
(( f (e))/ce) to penalize the congested nets, where The two way partition (V1, V2) is represented by
de and f (e) are the distance and ¯ow of net e, a linear placement with only two slots at coordi-
respectively. Steps 2.1 ± 2.3 are iteratively executed nates ÿ 1 and 1. For an even sized partition, half
until a pair of modules are chosen where all possible of the modules are assigned to each slot. Let xi
paths between them are saturated by ¯ows. These denote the coordinate of module vi. If vi 2 V1,
saturated nets identify a partition of the circuit. xi=1, else xi=ÿ 1 for vi 2 V2. The cut count can be
Figure 28 shows a sample circuit saturated by expressed as follows.
¯ows after executing Saturate-Network with
=0.01 and =10. The ¯ow values are shown 1 1
C
V1 ; V2 cij
xi ÿ xj 2 X > BX
82
by the numbers right beside each net. The dashed 4 4
lines indicate the cut lines along the set of where X is a vector of xi, and X> is the transpose
saturated nets to form the three clusters. These of vector X. Matrix B has its entry bij=ÿ cij if i 6 j,
saturated nets de®ne an approximate weighted P
else bii 1 jjVj cij . Suppose we relax the slot
cluster ratio cut which are potential set of nets for a constraint by enforcing only the rules of the
selection of cluster ratio cut. gravity center and the norm. The constraint of
vector X can be expressed as:
5.5. Programming Approaches
1> X 0;
83
For programming approaches [7, 18, 35, 41, 46, 44],
we adopt two way minimum cut with size X > X jVj
84
constraints as the target problem. We assume that
the nets are two pin nets and thus, the circuit can Matrix B is symmetric and diagonally semido-
be described as a graph G(V, E ). We also assume minant. Thus, it is semipositive de®nite, i.e., all
the modules are of unit size, i.e., si=1. eigenvalues are nonnegative. And its eigenvectors
are orthogonal. Let us order its eigenvalues from
small to large, i.e., 0 1 jVjÿ1. The smal- tion of ®xed modules will destroy the nice
lest eigenvalue 0=0 with its eigenvector X0=1. structure based on which we have the eigenvalue
The second eigenvalue 1 is nonnegative with its and eigenvector as optimal solutions. Therefore, it
eigenvector orthogonal to the ®rst eigenvector, i.e., is dicult to utilize the approach recursively.
X0> X1 1> X1 0. Therefore, the second eigenvec-
For a general case, we can view the problem as
tor X1 is an optimal solution to objective function
nonlinear programming with Boolean quadratic
(82) with constraints (83) [46]. Since X>X=jVj Eq.
objective function. Nonlinear programming tech-
(84) the solution
niques are adopted to derive the results [16, 107].
1 > 1 1
X BX1 1 X1> X1 1 jVj;
85 5.6. A Lagrange Multiplier Approach for
4 1 4 4
Performance Driven Partitioning
which is a lower bound of the min-cut problem.
To push for a higher lower bound, we can adjust Lagrange multiplier is one useful tool for perfor-
the diagonal term of matrix B by adding constants mance optimization. In this section, we demon-
di. Let strate the usage of Lagrange multiplier for
performance driven partitioning. The problem is
X
~ 1 ; V2 C
V1 ; V2 1
C
V di x2i
to optimize the performance of a two-way parti-
4 1ijVj tion (V1, V2) with retiming [86].
1 X We ®rst introduce a vector of binary variables to
ÿ di
86 represent a partition. The performance-driven
4 1ijVj
! partitioning problem is thus represented by a
1 X Boolean quadratic programming formulation with
~ ÿ
X > BX di ;
4 nonlinear constraints. We then absorb the non-
1ijVj
linear constraints into the objective function as a
where matrix B ~ has its entry ~
bij bij if i 6 j, else Lagrangian. We use primal and dual subproblems
~bii bii di . Either xi=1 or xi=ÿ 1, the last two to decompose the Lagrangian and derive the
terms cancel each other. The modi®cation thus partitions. Lagrange multiplier is adjusted in each
does not alter the optimal partition solution. iteration via a subgradient method to monitor the
The new nonlinear programming problem is to timing criticality and improve the performance.
®nd the assignment of di to maximize the objective
function [11]: 5.6.1. Programming Formulation with Lagrange
! Multiplier
1 ~ X
1 jVj ÿ di
87 We assume that the circuit can be represented by a
4 1ijVj
graph G(V, E ) with two pin nets and unit module
size. The two-way partition is described by a vector
where ~1 is the second smallest eigenvalue of
~ The solution is an upper bound of the x=(x1,1, . . . , x1,n, x2,1, . . . , x2,n), where xb,i is 1 if
matrix B.
module vi is assigned to vertex set Vb, otherwise xb,i
partition. It is larger than 1 in the sense that 1
is 0. If modules vi and vj are in dierent vertex set,
can serve as an initial feasible solution to maximize
the value of the term x1,ix2, jx2,ix1, j is equal to 1.
expression (87).
This contributes one interpartition delay into the
Remarks The programming approach ®nds a delay of the net eij. Let gl (x) denote the delay to
global view of the problem [9, 79, 80, 118]. How- register ratio of loop l. Delay ratio gl (x) can be
ever, the formulation is very restricted. The written as the following formula:
extension to multiple pin nets and the incorpora-
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 33
P
d` eij 2l
x1;i x2; j x2;i x1; j the objective function (90). The Lagrangian-
gl
x
88
rl relaxed problem is as follows.
Given a path p, the total delays hp(x) of p is as max min L
x;
95
0 x
follows:
X subject to constraints C1 and C2, where
hp
x dp
x1;i x2; j x2;i x1; j
89 X
eij 2p L
x; cij
x1;i x2; j x2;i x1; j
eij 2E
To formulate the problem, we use an objective X
gl
gl
x ÿ ~J
96
function of cut count:
8 simple loop l
X X
~
hp
hp
x ÿ M
min cij
x1;i x2; j x2;i x1; j ;
90
eij 2E 8 IO-critical path p
subject to the following constraints: (i) The Dual Problem Given vector x, we can
represent (96) as a function of variable , i.e.,
C1 (Size Constraints)
Lx(). Thus, the dual problem can be written as:
X
jVj
max Lx
97
xb;i si Su 8 b 2 f1; 2g:
91 0
i1
(ii) The Primal Problem Let Fij and Qij denote the
C2 (Variable Assignment Constraints) sets of the simple loops and IO-critical paths
passing the net eij. The cost aij of net eij is
X
2
composed of connectivity cij and the penalty of
xb;i 1 8 vi 2 V:
92
the timing constraints.
b1
X X
C3 (Iteration Bound Constraints) aij cij gl hp
98
r
l2Fij l p2Qij
gl
x ~
J 8 loop l:
93
Given vector , we can represent (96) as a function
C4 (Latency Bound Constraints) of vector x, i.e., L(x). Thus, the primal problem
can be rewritten as:
~
hp
x M 8 IO-critical path p:
94 X
min L
x min aij
x1;i x2; j x2;i x1; j
Actually, we don't need to consider all loops in C3. eij 2E
Because all loops are composed of simple loops,
99
we have the following lemma:
subject to C1 and C2, where represents the
LEMMA 1 Given a number ~ J, if gl(x) is less than or constant contributed by .
equal to ~J for any simple loop l, then gl (x) is less
than or equal to ~J for all loops l. 5.6.2. Subgradient Method using Cycle Mean
Let c and p represent the number of the simple Method
loops and the number of IO-critical paths,
We solve the partitioning problem through primal
respectively. Let denote the vector
g1 ; . . . ;
and dual iterations on the Lagrangian. A Quad-
gc ; h1 ; . . . ; hp . Using Lagrangian Relaxation
ratic Boolean Programming, QBP, [16] is used to
[104], we absorb the constraints (93) and (94) into
I207T001015 . 207
T001015d.207
solve the primal problem and generate a solution x 5. Revise shadow price aij for all nets eij 2 E:
k1
k
(Step 2). aij aij ;
k1
k
For the dual problem based on x, we select the if net eij is in active loop, then aij aij
set of loops and paths that violates the timing t
pij ÿ ~J;
k
k1
k
constraints as active loops and paths. The nets if net eij is in active path, then aij aij
contained in the active loops or paths are termed ~
t
k
qij ÿ M.
active nets. 6. While k MaxNumIter, set k k1 and goto
Active Loops and Paths Given a solution x, a 2.
loop l is called active, if gl (x) is not less than ~
J. A
path p is called active, if hp(x) is not less than M. ~ 5.7. Clustering Heuristics
Active Nets Given a net e, we de®ne e to be an
We ®rst discuss the usage of clustering heuristics.
active net, if net e is covered by an active loop or
We then discuss top down clustering and bottom
an active path.
up clustering approaches. At the last, we discuss
We call a minimum cycle mean algorithm [57]
some variations of clustering metrics.
and an all-pairs shortest-paths algorithm to mark
all the nets on active loops and paths, respectively
(Step 3). For every net eij on active paths, we 5.7.1. Usage of Clustering Heuristics
record qij: the maximum path delay among all
paths passing through eij. For every net eij on The usage of clustering heuristics plays an
active loops, we record pij: the maximum delay-to- important role in determining the quality of the
register ratio among all loops passing through eij. ®nal results. In the following, we discuss the issue
We then calculate the subgradient on the marked in dierent topics. We use a two-way partitioning
nets and update the constants aij for the next with size constraints as the target problem.
primal dual iteration (Steps 4 ± 5). We increase the 1. Top Down Clustering versus Bottom Up
costs of active nets using subgradient approach Clustering: Top down clustering approach
[104]. The iteration proceeds until the bound of all provides a global view of the solution. The
loops and paths are within the given limits. operations are consistent with the target pro-
blem. However, it is more time consuming
Algorithm using Lagrange Multiplier Input: Con- because the clustering operates on the whole
~J; M;
~ 1:3 and an initial partition circuit [29]. Bottom up clustering is ecient.
ÿstants
0
0
V1 ; V2 . However, because the process operates locally,
the target solution is sensitive to the clustering
0
1. Initialize k 1; aij cij . heuristics [59].
ÿ
k
k
2. Run QBP [16] to ®nd a partition V1 ; V2 2. The Level of the Clustering: Suppose we
with represent the clustering results with a hierarch-
ÿ
k an
k object
P
to minimize cut count
k
C V1 ; V2 e2E
V
k ;V
k aij . ical tree structure. Let the root correspond to
1 2
3. Calculate the ÿiteration and latency bounds of the whole circuit, the leaves correspond to the
k
k
the partition V1 ; V2 , respectively. Stop if smallest clusters, and the internal nodes corre-
timing constraints are satis®ed. Otherwise, spond to the intermediate clusters. Hence, the
revise pij and qij for all nets eij. size of the clusters grows with the level of the
4. Compute nodes. Top down clustering creates clusters
ÿ
k
k ÿ
0
0 corresponding to nodes in high levels, while
k C V 1 ; V2 ÿ C V1 ; V 2 bottom up clustering creates clustering corre-
t P P
~ 2
pij ÿ J ~ 2
qij ÿ M sponding to nodes in low levels.
eij 2E eij 2E
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 35
For example, in [60], Kernighan and Lin Solution: The clustering operation has to be
proposed a top down clustering approach, consistent with the target solution. For example,
which divides the whole circuit into four clusters suppose the target is ®nding a two-way min-cut
only. In [59], Karypis et al., used a bottom up with size constraints. Then, it is natural to cluster
clustering which starts with clusters of two modules based on net connectivity because the
modules or a net. If we continue the application probability that a net is in an optimal cut set is
of bottom up clustering on intermediate clus- small (see the subsection of min-cut with size
ters, the quality of the clusters degenerates as the constraints in problem formulations). More-
size of the clusters grows bigger. over, it is important that the clustering follows
3. Iteration of Clustering and Unclustering: We go the current partitioning results, i.e., only mod-
through the iterations of clustering and unclus- ules in the same partition are clustered.
tering to improve the quality of the results. At
each level of the hierarchical tree, we derive an 5.7.2. Top Down Clustering Approach
intermediate target solution, e.g., a two-way for Partitioning
partition. In unclustering, we go down the level
of tree hierarchy to ®nd an expanded circuit with We use an application to two-way cut with size
more modules. In clustering, we go up the level constraints to illustrate the top down clustering
of tree hierarchy with a circuit of a smaller approach [24, 29]. The partitioning of huge designs
number of modules. The previous partitioning is complicated and the results can be erratic. Our
result becomes the initial of the new partitioning strategy (Fig. 29) is to reduce the circuit complex-
problem. Note that the hierarchical tree is ity by constructing a contracted hypergraph. The
constructed dynamically. For each clustering, clusters for the contracted hypergraph are
the modules can be grouped based on the current searched via a recursive top down partitioning
partitioning con®guration. method. The number of modules is much reduced
4. The Clustering Operations and the Target after we contract the clusters. Hence, a group
migration approach can derive excellent two way 5.7.3. Bottom Up Clustering Approaches
cut results on the contracted hypergraph with
In this section, we discuss bottom up clustering
much eciency. Furthermore, since the clusters
[90] with two applications: linear placement and
are grouped via a top down partitioning, concep-
performance driven designs. We then show two
tually a minimum cut on the hypergraph can take
strategies to perform the clustering: maximum
advantage of the previous results and generate
matching and maximum pairing. We will demon-
better solutions.
strate via examples the advantage of maximum
In this section, we describe a top down clustering
pairing over maximum matching.
algorithm. A ratio cut is adopted to perform the top
down clustering process. Other partition ap- (i) Linear Placement For linear placement, we
proaches can also be used to replace the ratio cut. reduce the complexity of the problem by a bottom
A group migration method is used to ®nd a up clustering approach [96, 100, 53]. The clustering
minimum cut of the contracted hypergraph with is based on the result of a tentative placement. We
size constraint. Finally, we apply a last run of the adopt a heuristic approach to generate tentative
group migration algorithm to the original circuit to placements throughout iterations. In each itera-
®ne tune the result. tion, we cluster modules only when they are in
Input a hypergraph H(V, E ), an integer k for consecutive order of the placement. We then
the number of expected clusters, an integer construct a contracted hypergraph. In the next
num_of_reps for repetition, and Sl, Su for the size iteration, the heuristic approach generates the
constraints of two resultant subsets. placement of the contracted hypergraph. For each
iteration, we either grow the size of the clusters or
1. Initialize ={V } and V =V.
construct new clusters adaptively.
2. Apply ratio cut [109] to obtain a partition
Inspired by the property of the minimum cut
(A, A0 ) of V =A [ A0 .
separating two modules (Theorem 3.1), we use a
3. Set =( ÿ V }) [ {A, A0 }. Set V to be a
density as a measure to ®nd the cluster. A density
vertex set in such that S
V maxVi 2 S
Vi .
d(i) at a slot i of a linear placement is the total
4. While S(V ) > ((S(V ))/k), repeat Steps 2, 3.
connectivity of nets connecting modules on the
5. Construct a contracted hypergraph Hÿ(Vÿ, Eÿ ).
dierent sides of the slot. The following algorithm
6. Apply num_of_reps times of a group migration
describes the clustering using a given placement.
algorithm to Hÿ with the size constraints Sl, Su.
Each cluster size is between L and U.
7. Use the best result from Step 6 to the circuit H
Input placement P, two parameters L and U.
as an initial partition. Apply a group migration
algorithm once to H with the size constraints Sl, 1. Initialize cluster boundary at slot p=1.
S u. 2. Scan placement P from slot p toward the
right end. Find slot i such that pL i
The choice of cluster number k It was shown
pU and density d(i) is minimum among
[24] that the cut count versus cluster number k is a
d( pL) d( pU ).
concave curve. When k is small, the quality is not
3. Cluster modules between slots p and i. Set
as good because the cluster is too coarse. When k
p=i1
is large, there are too many clusters. We lose the
4. Repeat Steps 2, 3 until the scan reaches the
bene®t of the clustering.
right end.
For the case that the circuit is large, we may
need to adopt multiple levels of clustering to push Remark The proposed clustering process and the
for the performance and eciency [58, 66]. criteria are consistent with the target linear
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 37
placement application. The whole process depends and l have no common nets but are merged because
on an ecient and eective linear placement. their choices are taken by others.
(ii) Performance Driven Clustering For perfor- Furthermore, as we proceed to the next level
mance driven clustering [31, 112], nets which maximum matching, the merge of pairs (c, l ) and
contribute to the longest delay are termed critical ( f, i) will enforce grouping modules into cluster
nets. Pins of the critical net are merged to form {a, b, c, j, k, l} and cluster {d, e, f, g, h, i}. If we
clusters. measure the quality of the results with cluster cost
For a special case that the circuit is a directed (expression (26)), the cost of the two clusters is
P
tree, we can ®nd optimal solution in polynomial i((C (Vi))/(CI (Vi)))=4/124/12=2/3. For this
time. Let us assume the tree has its leaves at the case, we can ®nd a better solution of clusters
input and its root at the output. We use a dynamic {a, b, c, d, e, f } and {g, h, i, j, k, l} of which the
programming approach to trace from the leaves cluster cost is equal to zero.
toward the root. Each module is not traced until Figure 31 shows another example of twelve
all its input modules are processed. For each modules with connectivities attached to the nets.
module, we treat it as a root of a subtree and ®nd The connectivity is 1 if not speci®ed. Figure 31(a)
the optimal clustering of the subtree. Since all the shows an optimum cut with cut count 6.6. If a
modules in the subtree except its root have been maximum matching [61] criterion is adopted in the
processed, we can derive an optimal solution of the bottom up clustering approach, then modules with
root in polynomial time. a net of weight 1.1 between them will be merged. A
minimum cut on the merged modules yields a cut
(iii) Maximum Matching The maximum match-
count of 18 (Fig. 31(b)). In general, a 2n module
ing pairs all modules into j V j /2 groups simulta-
circuit having a symmetric con®guration as in
neously. Given a measurement of pairing modules,
Figure 31 will have a cut count of n2/2 if the
we can ®nd a matching that maximizes the total
maximum matching criterion is applied to perform
pairing measurement in polynomial time.
the clustering; while the optimum solution will
We can call maximum matching recursively to
have a cut weight of 1.1 n. From this extreme
create clusters of equal sizes. However, this
case, we can claim the following theorem:
strategy may enforce unrelated pairs to merge.
The enforcement will sacri®ce the quality of ®nal THEOREM 5.4 There is no constant factor of error
clustering results. bound of the cut count generated by the maximum
matching approach, from the cut count of a
Example Figure 30 illustrates the clustering be-
minimum cut.
havior of maximum matching. The circuit contains
twelve modules of equal size. The ®rst level Proof As shown in the above example, the factor
maximum matching pairs modules (a, b), (d, e), of error bound is (n2/2)/(1.1 n)=n/2.2, which is
(g, h), ( j, k), (c, l ), and ( f, i). Modules in the ®rst not a constant. Q.E.D.
four pairs are strongly connected with their
(iv) Maximum Pairing The maximum pairing is
partners. However, the last two are not. Module c
VLSI PARTITIONING 39
P
2 e2jEj ce obtains signal from B, and modules A and B are
hij hji ;
100
ij similar.
Modules C and D become similar because It is desired to correlate the logic hierarchy with
module C obtains signal from A, module D the physical design hierarchy. The main reason is
the control of timing for huge designs. Currently,
the design turnaround takes 2 ± 8 months for ASIC
and much longer for custom designs. Throughout
the design process, designs keep on changing. We
don't want to lose control of timing as design
changes. A tight correlation of logic and physical
hierarchies makes timing predictable. Without this
kind of mechanism, the timing characteristics of a
¯oorplan may become erratic after iterations of
design changes.
we properly exploit the structure of the design [7] Alpert, C. J. and Yao, S. Z., ``Spectral partitioning: the
more eigenvectors, the better'', In: Proc. ACM/IEEE
hierarchy. The generic binary tree is a good Design Automation Conf., June, 1995, pp. 195 ± 200.
formulation to start with. [8] Bakoglu, H. B., Circuits, Interconnections, and Packaging
for VLSI, MA: Addison-Wesley, 1990.
The handling of a hierarchy tree gives rise to [9] Blanks, J. (1989). ``Partitioning by Probability Conden-
many fundamental research problems. For exam- sation'', ACM/IEEE 26th Design Automation Conf., pp.
758 ± 761.
ple, ®nding k shortest-paths or exploring the [10] Bollobas, B. (1985). Random Graphs, Academic Press
maximum-¯ow minimum-cut of the whole circuit Inc., pp. 31 ± 53.
[11] Boppana, R. B. (1987). ``Eigenvalues and Graph
[51] embedded in a hierarchical tree can be useful Bisection: An Average Case Analysis'', Annual Symp.
for interconnect analysis and optimization. Such on Foundations in Computer Science, pp. 280 ± 285.
research can also bene®t many dierent ®elds [12] Breuer, M. A., Design Automation of Digital Systems,
Prentice-Hall, NY, 1972.
which have to handle huge hierarchical systems. [13] Bui, T., Chaudhuri, S., Jones, C., Leighton, T. and
Sipser, M. (1987). ``Graph bisection algorithms with
good average case behavior'', Combinatorica, 7(2),
171 ± 191.
6.3. Performance Driven Partitioning [14] Bui, T., Heigham, C., Jones, C. and Leighton, T.,
``Improving the performance of the Kernighan-Lin and
For performance driven partitioning, we need a simulated annealing graph bisection algorithms'', In:
fast evaluation on the hierarchical tree structure. Proc. ACM/IEEE Design Automation Conf., June, 1989,
pp. 775 ± 778.
The analysis needs to be incremental with incor- [15] Buntine, W. L., Su, L., Newton, A. R. and Mayer, A.,
poration of signal integrity. ``Adaptive methods for netlist partitioning'', In: Proc.
IEEE Int. Conf. Computer-Aided Design, November,
The network ¯ow method is a potential 1997, pp. 356 ± 363.
approach for the partitioning with timing con- [16] Burkard, R. E. and Bonniger, T. (1983). ``A Heuristic for
straints. More eorts are needed to improve the Quadratic Boolean Programs with Applications to
Quadratic Assignment Problems'', European Journal of
speed and derive desired results. Operational Research, 13, 372 ± 386.
[17] Camposano, R. and Brayton, R. K. (1987). ``Partitioning
Before Logic Synthesis'', Int. Conf. on Computer-Aided
Design, pp. 324 ± 326.
Acknowledgements [18] Chan, P. K., Schlag, D. F. and Zien, J. Y., ``Spectral
k-way ratio-cut partitioning and clustering'', IEEE
The authors thank the editor for the encourage- Trans. Computer-Aided Design, 13(9), 1088 ± 1096, Sep-
ment of preparing this manuscript. The authors tember, 1994.
[19] Charney, H. R. and Plato, D. L., ``Ecient Partitioning
would also like to thank Ted Carson, Lung-Tien of Components'', IEEE Design Automation, July, 1968,
Liu, and John Lillis for helpful discussions. pp. 16.0 ± 16.21.
[20] Chatterjee, A. C. and Hartley, R., ``A new Simultaneous
Circuit Partitioning and Chip Placement Approach
based on Simulated Annealing'', In: Proc. ACM/IEEE
References Design Automation Conf., June, 1990, pp. 36 ± 39.
[21] Cheng, C. K. and Kuh, E. S., ``Module Placement Based
[1] Ahuja, R. K., Magnanti, T. L. and Orlin, J. B., Network on Resistive Network Optimization'', IEEE Trans. on
Flows, Prentice Hall, 1993. Computer-Aided Design, CAD-3, 218 ± 225, July, 1984.
[2] Alpert, C. J., ``The ISPD98 circuit benchmark suite'', Int. [22] Cheng, C. K., ``Linear Placement Algorithms and
Symp. on Physical Design, pp. 80 ± 85, April, 1998. Applications to VLSI Design'', Networks, 17, 439 ± 464,
[3] Alpert, C. J., Caldwell, A. E., Kahng, A. B. and Markov, Winter, 1987.
I. L., ``Partitioning with Terminals: a ``New'' Problem [23] Cheng, C. K. and Hu, T. C., ``Ancestor Tree for
and New Benchmarks'', Int. Symp. on Physical Design, Arbitrary Multi-Terminal Cut Functions'', Porc. Integer
pp. 151 ± 157, April, 1999. Programming/Combinatorial Optimization Conf., Univ.
[4] Alpert, C. J., Huang, J. H. and Kahng, A. B., ``Multi- of Waterloo, May, 1990, pp. 115 ± 127.
level circuit partitioning'', In: Proc. ACM/IEEE Design [24] Cheng, C. K. and Wei, Y. C. (1991). ``An Improved
Automation Conf., June, 1997, pp. 530 ± 533. Two-Way Partitioning Algorithm with Stable Perfor-
[5] Alpert, C. J. and Kahng, A. B., ``Recent directions in mance'', IEEE Trans. on Computer Aided Design, 10(12),
netlist partitioning: a survey'', Integration: The VLSI J., 1502 ± 1511.
19(1), 1 ± 81, August, 1995. [25] Cheng, C. K. (1992). ``The Optimal Partitioning of
[6] Alpert, C. J. and Kahng, A. B., ``A general framework Networks'', Networks, 22, 297 ± 315.
for vertex orderings with applications to circuit cluster- [26] Cherng, J. S. and Chen, S. J., ``A Stable Partitioning
ing'', IEEE Trans. VLSI Syst., 4(2), 240 ± 246, June, Algorithm for VLSI Circuits'', In: Proc. IEEE Custom
1996. Integrated Circuits Conf., May, 1996, pp. 9.1.1 ± 9.1.4.
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 41
Addison Wesley, 1997. [82] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C., ``A
[64] Kring, C. and Newton, A. R. (1991). ``A Cell-Replicating Replication Cut for Two-Way Partitioning'', IEEE
Approach to Mincut Based Circuit Partitioning'', Proc. Trans. Computer-Aided Design, May, 1995, pp. 623 ± 630.
IEEE Int. Conf. on Computer-Aided Design, pp. 2 ± 5. [83] Liu, L. T., Kuo, M. T., Cheng, C. K. and Hu, T. C.,
[65] Krishnamurthy, B., ``An Improved Min-Cut Algorithm ``Performance-Driven Partitioning Using a Replication
for Partitioning VLSI Networks'', IEEE Trans. Compu- Graph Approach'', In: Proc. ACM/IEEE Design Auto-
ters, C-33(5), 438 ± 446, May, 1984. mation Conf., June, 1995, pp. 206 ± 210.
[66] Krupnova, H., Abbara, A. and Saucier, G. (1997). ``A [84] Liu, L. T., Kuo, M. T., Huang, S. C. and Cheng, C. K.,
Hierarchy-Driven FPGA Partitioning Method'', Design ``A gradient method on the initial partition of Fiduccia-
Automation Conf., pp. 522 ± 525. Mattheyses algorithm'', In: Proc. IEEE Int. Conf.
[67] Kuo, M. T. and Cheng, C. K., ``A New Network Flow Computer-Aided Design, November, 1993, pp. 229 ± 234.
Approach for Hierarchical Tree Partitioning'', In: Proc. [85] Liu, L. T., Shih, M., Chou, N. C., Cheng, C. K. and Ku,
ACM/IEEE Design Automation Conf., June, 1997, pp. W., ``Performance-Driven Partitioning Using Retiming
512 ± 517. and Replication'', In: Proc. IEEE Int. Conf. Computer-
[68] Kuo, M. T., Liu, L. T. and Cheng, C. K., ``Network Aided Design, November, 1993 pp. 296 ± 299.
Partitioning into Tree Hierarchies'', In: Proc. ACM/ [86] Liu, L. T., Shih, M. and Cheng, C. K., ``Data Flow
IEEE Design Automation Conf., June, 1996, pp. Partitioning for Clock Period and Latency Minimiza-
477 ± 482. tion'', In: Proc. ACM/IEEE Design Automation Conf.,
[69] Kuo, M. T., Liu, L. T. and Cheng, C. K., ``Finite State June, 1994, pp. 658 ± 663.
Machine Decomposition for I/O Minimization'', In: [87] Matula, D. W. and Shahrokhi, F., ``The Maximum
Proc. IEEE Int. Symp. on Circuits and Systems, May, Concurrent Flow Problem and Sparsest Cuts'', Tech.
1995, pp. 1061 ± 1064. Report, southern Methodist Univ., 1986.
[70] Kuo, M. T., Wang, Y., Cheng, C. K. and Fujita, M., [88] McFarland, M. C., S.J.,``Computer-aided partitioning of
``BDD-Based Logic Partitioning for Sequential Cir- behavioral hardware descriptions'', In: Proc. ACM/
cuits'', In: Proc. ASP/DAC, Chiba, Japan, January, IEEE Design Automation Conf., June, 1983, pp. 472 ±
1997, pp. 607 ± 612. 478.
[71] Lomonosov, M. V. (1985). ``Combinatorial Approaches [89] Motwani, R. and Raghavan, P. (1995). Randomized
to Multi¯ow Problems'', Discrete Applied Mathematics, Algorithms, Cambridge University Press.
11(1), 1 ± 94. [90] Ng, T. K., Old®eld, J. and Pitchumani, V., ``Improve-
[72] Landman, B. S. and Russo, R. L., ``On a Pin Versus ments of a mincut partition algorithms'', In: Proc. IEEE
Block Relationship for Partitioning of Logic Graphs'', Int. Conf. Computer-Aided Design, November, 1987, pp.
IEEE Trans. on Computers, C-20, 1469 ± 1479, Decem- 470 ± 473.
ber, 1971. [91] Nijssen, R. X. T., Jess, J. A. G. and Eindhoven, T. U.,
[73] Lawler, E. L., Combinatorial Optimization: Networks and ``Two-Dimensional Datapath Regularity Extraction'',
Matroids, Holt, Rinehart and Winston, New York, 1976. Physical Design Workshop, April, 1996, pp. 111 ± 117.
[74] Leighton, T. and Rao, S. (1988). ``An Approximate [92] Parhi, K. K. and Messerschmitt, D. G. (1991). ``Static
Max-Flow Min-cut Theorem for Uniform Multicom- Rate-Optimal Scheduling of Iterative Data-Flow Pro-
modity Flow Problems with Applications to Approx- grams via Optimum Unfolding'', IEEE Trans. on
imation Algorithms'', IEEE Symp. on Foundations of Computers, 40(2), 178 ± 195.
Computer Science, pp. 422 ± 431. [93] Riess, B. M., Doll, K. and Johannes, F. M., ``Partition-
[75] Leighton, T., Makedon, F., Plotkin, S., Stein, C., ing very large circuits using analytical placement
Tardos, E. and Tragoudas, S., ``Fast Approximation techniques'', In: Proc. ACM/IEEE Design Automation
Algorithms for Multicommodity Flow Problems'', Tech. Conf., June, 1994, pp. 646 ± 651.
report no. STAN-CS-91-1375, Dept. of Computer [94] Roy, K. and Sechen, C., ``A Timing Driven N-Way Chip
Science, Stanford University. and Multi-Chin Partitioner'', Proc. IEEE/ACM Int.
[76] Leiserson, C. E. and Saxe, J. B. (1991). ``Retiming Conf. on Computer-Aided Design, pp. 240 ± 247, Novem-
Synchronous Circuitry'', Algorithmica, 6(1), 5 ± 35. ber, 1993.
[77] Lengauer, T. and Muller, R. (1988). ``Linear Arrange- [95] Russo, R. L., Oden, P. H. and Wol, P. K. Sr., ``A
ment Problems on Recursively Partitioned Graphs'', heuristic procedure for the partitioning and mapping of
Zeitschrift fur Operations Research, 32, 213 ± 230. computer logic graphs'', IEEE Trans. on Computers,
[78] Lengauer, T., Combinatorial Algorithms for Integrated C-20, 1455 ± 1462, December, 1971.
Circuit Layout, Wiley, 1990. [96] Saab, Y., ``A fast and robust network bisection
[79] Li, J., Lillis, J. and Cheng, C. K., ``Linear decomposition algorithm'', IEEE Trans. Computers, 44(7), 903 ± 913,
algorithm for VLSI design applications'', In: Proc. IEEE July, 1995.
Int. Conf. Computer-Aided Design, November, 1995, pp. [97] Saab, Y. and Rao, V. (1989). ``An Evolution-Based
223 ± 228. Approach to Partitioning ASIC Systems'', ACM/IEEE
[80] Li, J., Lillis, J., Liu, L. T. and Cheng, C. K., ``New 26th Design Automation Conf., pp. 767 ± 770.
Spectral Linear Placement and Clustering Approach'', [98] Sanchis, L. A., ``Multiple-Way Network Partitioning'',
In: Proc. ACM/IEEE Design Automation Conf., June, IEEE Trans. Computers, 38(1), 62 ± 81, January, 1989.
1996, pp. 88 ± 93. [99] Sanchis, L. A., ``Multiple-Way Network Partitioning
[81] Liou, H. Y., Lin, T. T., Liu, L. T. and Cheng, C. K., with Dierent Cost Functions'', IEEE Trans. on
``Circuit Partitioning for Pipelined Pseudo-Exhaustive Computers, pp. 1500 ± 1504, December, 1993.
Testing Using Simulated Annealing'', In: Proc. IEEE [100] Schuler, D. M. and Ulrich, E. G. (1972). ``Clustering and
Custom Integrated Circuits Con., May, 1994, pp. 417 ± Linear Placement'', Proc. 9th Design Automation Work-
420. shop, pp. 50 ± 56.
I207T001015 . 207
T001015d.207
VLSI PARTITIONING 43
[101] Schweikert, D. G. and Kernighan, B. W. (1972). ``A clustering using a stochastic ¯ow injection method'',
Proper Model for the Partitioning of Electrical Circuits'', IEEE Trans. Computer-Aided Design, 14(2), 154 ± 162,
Proc. 9th Design Automation Workshop, pp. 57 ± 62. February, 1995.
[102] Sechen, C. and Chen, D. (1988). ``An Improved Objec- [118] Zien, J. Y., Chan, P. K. and Schlag, M., ``Hybrid
tive Function for Mincut Circuit Partitioning'', Proc. Int. spectral/iterative partitioning'', In: Proc. IEEE Int. Conf.
Conf. on Computer-Aided Design, pp. 502 ± 505. Computer-Aided Design, November, 1997 pp. 436 ± 440.
[103] Shahrokhi, F. and Matula, D. W., ``The Maximum
Concurrent Flow Problem'', Journal of the ACM, 37(2),
318 ± 334, April, 1990.
[104] Shapiro, J. F. (1979). Mathematical Programming: Authors' Biographies
Structures and Algorithms, Wiley, New York.
[105] Sherwani, N. A. (1999). Algorithms for VLSI Physical Sao-Jie Chen has been a member of the faculty in
Design Automation, 3rd edn., Kluwer Academic. the Department of Electrical Engineering, Na-
[106] Shih, M., Kuh, E. S. and Tsay, R.-S. (1992). ``Perfor-
mance-Driven System Partitioning on Multi-Chip Mod- tional Taiwan University since 1982, where he is
ules'', Proc. 29th ACM/IEEE Design Automation Conf., currently a full professor. During the fall of 1999,
pp. 53 ± 56.
[107] Shih, M. and Kuh, E. S. (1993). ``Quadratic Boolean
he held a visiting appointment at the Department
Programming for Performance-Driven System Partition- of Computer Science and Engineering, University
ing'', Proc. 30th ACM/IEEE Design Automation Conf.,
of California, San Diego. His current research
pp. 761 ± 765.
[108] Shin, H. and Kim, C., ``A Simple Yet Eective interests include: VLSI circuits design, VLSI
Technique for Partitioning'', IEEE Trans. on Very Large physical design automation, and object-oriented
Scale Integration Systems, pp. 380 ± 386, September,
1993. software engineering. Dr. Chen is a member of the
[109] Wei, Y. C. and Cheng, C. K. (1991). ``Ratio Cut Association for Computing Machinery, the IEEE,
Partitioning for Hierarchical Designs'', IEEE Trans. on
Computer-Aided Design, 10(7), 911 ± 921. and the IEEE Computer Society.
[110] Wei, Y. C., Cheng, C. K. and Wurman, Z., ``Multiple Chung-Kuan Cheng received the B.S. and M.S.
Level Partitioning: An Application to the Very Large
Scale Hardware Simulators'', IEEE Journal of Solid
degrees in electrical engineering from National
State Circuits, 26, 706 ± 716, May, 1991. Taiwan University, and the Ph.D. degree in
[111] Woo, N. S. and Kim, J. (1993). ``An Ecient Method of electrical engineering and computer sciences from
Partitioning Circuits for Multiple-FPGA Implementa-
tion'', Proc. ACM/IEEE Design Automation Conf., pp. University of California, Berkeley in 1984. From
202 ± 207. 1984 to 1986 he was a senior CAD engineer at
[112] Yang, H. and Wong, D. F. (1994). ``Edge-Map: Optimal
Performance Driven Technology Mapping for Iterative Advanced Micro Devices Inc. In 1986, he joined
LUT Based FPGA Designs'', Int. Conf. on Computer- A the University of California, San Diego, where he
Aided Design, pp. 150 ± 155.
[113] Yang, H. and Wong, D. F., ``Ecient Network Flow is a Professor in the Computer Science and
based Min-Cut Balanced Partitioning'', In: Proc. IEEE Engineering Department, an Adjunct Professor
Int. Conf. Computer-Aided Design, November, 1994, pp.
50 ± 55.
in the Electrical and Computer Engineering
[114] Yeh, C. W., ``On the Acceleration of Flow-Oriented Department. He served as a chief scientist at
Circuit Clustering'', IEEE Trans. Computer-Aided De- Mentor Graphics in 1999. He is an associate editor
sign, 14(10), 1305 ± 1308, October, 1995.
[115] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., ``A general of IEEE Trans. on Computer Aided Design since
purpose, multiple-way partitioning algorithm'', IEEE 1994. He is a recipient of the best paper award,
Trans. Computer-Aided Design, 13(12), 1480 ± 1488,
December, 1994. IEEE Trans. on Computer-Aided Design 1997, the
[116] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., NCR excellence in teaching award, School of
``Optimization by iterative improvement: an experimen-
tal evaluation on two-way partitioning'', IEEE Trans. Engineering, UCSD, 1991. His research interests
Computer-Aided Design, 14(2), 145 ± 153, February, include network optimization and design automa-
1995.
[117] Yeh, C. W., Cheng, C. K. and Lin, T. T. Y., ``Circuit
tion on microelectronic circuits.